Archive

Posts Tagged ‘Drupal’

How to upgrade Drupal 6 to Drupal 7

Another detailed article, this time on how to upgrade from D6 to D7: http://www.ostraining.com/blog/drupal/migrate-drupal-6-to-drupal-7/

Advertisements
Categories: Drupal, English Tags: ,

Nginx + CDN + GoogleBot or how to avoid many useless Googlebot hits

If you’re like me and you’ve developed a CDN distribution for your website’s content (while waiting for SPDY to be widely adopted and available in mainstream distributions), you might have noted that the Googlebot is frequently scanning your CDNs, and this might have made your website a bit overloaded.

After all, the goal of the CDNs are (several but in my case only) to elegantly distribute contents across subdomains so your browser will load the page resources faster (otherwise it gets blocked by the HTTP limit or any higher limit set by your browser of simultaneous content download).

Hell, in my case, this is the number of page scans per day originating from the Googlebot on only one of my CDN-enabled sites (I think there are like 5 different subdomains). And these are only the IPs that requested the site the most:


3398: 66.249.73.186
1380: 66.249.73.27
1328: 66.249.73.15
1279: 66.249.73.214
1277: 66.249.73.179
1109: 66.249.73.181
1109: 66.249.73.48
1015: 66.249.73.38
822: 66.249.73.112
738: 66.249.73.182

As you can see, it sums up to about 13,000 requests in just 24h. On the main site (the www. prefixed one), I still get 10,000 requests per day from the Googlebot.

So if you want to avoid that, fixing it in Apache is out of the scope here, but you could easily do it with a RewriteCond line.
Doing it in Nginx should be relatively easy if you have different virtual host files for your main site and the CDN (which is recommended as they generally have different caching behaviour, etc). Find the top “location” block in your Nginx configuration. In my case, it looks like this:

        location / {
                index  index.php index.html index.htm;
                try_files $uri $uri/ @rewrite;
        }

Change it to the following (chang yoursite.com by the name of your site):

        location / {
                index  index.php index.html index.htm;
                # Avoid Googlebot in here
                if ($http_user_agent ~ Googlebot) {
                    return 301 http://www.yoursite.com.pe$request_uri;
                }
                try_files $uri $uri/ @rewrite;
        }

Reload your Nginx configuration and… done.

To test it, use the User Agent Switcher extension for Firefox. Beware that your browser generally uses DNS caching, so if you have already loaded the page, you will probably have to restart your browser (or maybe use a new browser instance with firefox –no-remote and install the extension in that one *before* loading the page).

Once the extension is installed, choose one of the Googlebot user agents in Tools -> Default User Agent -> Spider – Search, then load your cdn page: you should get redirected to the www page straight away.

Cache management in Drupal 6 & 7

September 24, 2012 Leave a comment

There’s a nice article from Makina-Corpus on the topic here: http://www.makina-corpus.org/blog/separate-cache-backends-drupal6-and-drupal7

Notably, the cache_backport module in Drupal 6 backports the cache handlers from Drupal 7 (replaces cacherouter, in a way).

Categories: Drupal, English Tags: , , ,

Drupal 7 + HTTPS + Nginx + Varnish + Apache + Boost + APC + Securepages + Drupal

September 20, 2012 Leave a comment

If you happen to develop large sites in Drupal, you might fall upon a case like this one, where different servers (namely at least one reverse proxy and one web server) interact, causing a series of chain reactions every time you change something.

It might be frustrating, at times, to try and boost a coordinated system like this, and end up getting your users frustrated because part of it doen’t work, when the rest (the part that *does* work) is super-fast. Come on, you deserve some praise!

Let’s explain a little bit about what requirements might lead to this system…

First of all, you want a website. You don’t want to play with php-fpm yet, so you decide to use Apache + mod-PHP. Great.

Then you want to boost the page interpretation time a little bit, so you decide to use APC. Nothing wrong there.
You now have Apache + PHP + APC + Drupal in line.

Then you want to make sure anonymous users get pre-generated pages to go even faster. You add the Boost module to the loop.
You now have Apache + Boost + PHP + APC + Drupal.

Then you want to speed up loading small static resources, like icons and stuff. You decide to add Varnish in front of the queue.
You now have Vanish + Apache + Boost + PHP + APC + Drupal.

And then comes the late requirement for HTTPS… Damn… Varnish doesn’t support HTTPS!? Man, and now what? Well… you can use Nginx to decipher HTTPS and handle things over to Varnish to follow in the normal queue.
You now have HTTPS + Nginx + Varnish + Apache + Boost + PHP + APC + Drupal.

And finally, your customer tells you that, because of the extra load (between 20% and 400%) generated by the ciphering to HTTPS, he only wants specific pages to be HTTPS. Apart from the mess in redirecting pages to HTTP when they shouldn’t be HTTPS (which can be done in Nginx’s config), you will also need to define rules to send HTTP pages to HTTPS when they should be secure. You can do that with Drupal’s Securepages module.
You finally have your (rather) complete schema of HTTPS + Nginx + Varnish + Apache + Boost + PHP + APC + Securepages + Drupal.

Here is a list of things to think about when doing this:

  1. the Boost module had a bug in early 2012, which didn’t define the DRUPAL_UID cookie very consistently (see previous article on this blog). Because of that, users might loose their session “from time to time” (which is more than frustrating for both you and your customer)
  2. in the specific case of requesting pages to be HTTPS without being logged in (which has the effect of not generating  Drupal connection cookie), the user will first get to Varnish, which will pass on to Apache, which will pass on to Drupal (I’m skipping a few steps here), which will send a HTTP Redirect to Varnish and Varnish will (cache it and) pass it to the user. The user (the user’s browser, in practice) then calls the same URL as HTTPS. Now Nginx takes over for deciphering HTTPS into HTTP before passing it on to Varnish. When Varnish rceives the call, whether it is cached or not, the return value it sends is a 301 Redirect, to the same URL as HTTPS. And so you now are in a loop.
Categories: Drupal, English, php, Techie Tags: , ,

Drupal site with Varnish, returning page without style on CTRL+F5

I had serious problems with a Drupal website with many Varnish optimizations. It so occurs that one of them, a return(lookup) on images and css extensions, was really the one causing the problem:

if (req.url ~ “\.(png|jpg|jpeg|swf|css|ico)”) {
return(lookup);
}

Now I don’t remember precisely why I added this condition in the first place (lookup means you force Varnish to re-use the version it has in cache) but apparently in my case it doesn’t suit my purposes.

The most common actions you can decide to ask Varnish to execute in the vcl_fetch can be found here: https://www.varnish-cache.org/docs/2.1/tutorial/vcl.html#actions

In short:

pass
When you call pass the request and subsequent response will be passed to and from the backend server. It won’t be cached. pass can be called in both vcl_recv and vcl_fetch.
lookup
When you call lookup from vcl_recv you tell Varnish to deliver content from cache even if the request othervise indicates that the request should be passed. You can’t call lookup from vcl_fetch.
pipe
Pipe can be called from vcl_recv as well. Pipe short circuits the client and the backend connections and Varnish will just sit there and shuffle bytes back and forth. Varnish will not look at the data being send back and forth – so your logs will be incomplete. Beware that with HTTP 1.1 a client can send several requests on the same connection and so you should instruct Varnish to add a “Connection: close” header before actually calling pipe.
deliver
Deliver the cached object to the client. Usually called in vcl_fetch.
esi
ESI-process the fetched document.

Spider a website with wget

This command might be useful if you want to auto-generate the Boost module cache files on a Drupal site

wget -r -l4 –spider -D thesite.com http://www.thesite.com

Let’s analyse the options…

-r indicates it’s recursive (so “follow the links” and look for more than one page)

-l indicates the number of levels we want to recurse. If you are on the first page and you follow a link, you are at level 1. If you follow a link on that last page, you are at level 2, etc

–spider indicates not to download anything (we just want to go through the pages, that’s all)

-D indicates the list (separated by commas) of domains where we think it’s acceptable to “spider” (that is, if a link points to “hello.com”, we won’t follow it)

This will create a hierarchy of directories where you start executing the command, but it’s mostly a list to know where it’s been. It doesn’t store anything (as per the “–spider” option).

If you know your site lasts some time to deliver pages, you might want to set the timeout to something like 20 seconds. Although WGET documentation seems to say that the default is 900 seconds, for some reason it tends to abandon earlier in my case.

You might also want to “fake” your user agent in case you have a website that reacts to mobile phones (in this case we simulate an iPhone)

This command might be useful if you want to auto-generate the Boost module cache files on a Drupal site

wget -r -l4 –spider -D thesite.com http://www.thesite.com

Let’s analyse the options…

-r indicates it’s recursive (so “follow the links” and look for more than one page)

-l indicates the number of levels we want to recurse. If you are on the first page and you follow a link, you are at level 1. If you follow a link on that last page, you are at level 2, etc

–spider indicates not to download anything (we just want to go through the pages, that’s all)

-D indicates the list (separated by commas) of domains where we think it’s acceptable to “spider” (that is, if a link points to “hello.com”, we won’t follow it)

This will create a hierarchy of directories where you start executing the command, but it’s mostly a list to know where it’s been. It doesn’t store anything (as per the “–spider” option).

If you know your site lasts some time to deliver pages, you might want to set the timeout to something like 20 seconds. Although WGET documentation seems to say that the default is 900 seconds, for some reason it tends to abandon earlier in my case.

You might also want to “fake” your user agent in case you have a website that reacts to mobile phones (in this case we simulate an iPhone)

 wget -r -l4 –spider –delete-after –user-agent=”iOS 4_3 – iPhone – Safari 533.17.9″ –timeout=20 -D m.thesite.com http://www.thesite.com

The delete-after directive tells wget to delete the file after it’s downloaded it, which apparently might affect your system if you’re using a proxy (beware that in this case it will be stored in your proxy and, next time you check it, it will come from there, as far as I understand it). In my case, it is not necessary.

 

Categories: Drupal, English Tags: , , ,

The Drupal 6 bootstrap easy debug

Just as a self reminder, and because I don’t fancy too much looking into the Drupal core for debugging, here is a short explanation of how the Drupal 7 bootstrap mechanism works.

First of all, a bootstrap mechanism is a mechanism by which you work progressively your way through the full loading of a system, step by step, starting with the loading of simple elements that will allow you to load more complex elements. The Linux system also has a bootstrap mechanism (as do most OSes). For operating systems, bootstrapping means you first load a little bit of code which will enable the computer to know how to deal with memory and compiled code, which then allows you to load the Linux kernel, which then allows you to load (and executre) much more stuff, like your desktop interface, etc.

So in Drupal, as you might have realized, everything goes through the /index.php file, which looks like this:

require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$return = menu_execute_active_handler();

// Menu status constants are integers; page content is a string.
if (is_int($return)) {
  switch ($return) {
    case MENU_NOT_FOUND:
      drupal_not_found();
      break;
    case MENU_ACCESS_DENIED:
      drupal_access_denied();
      break;
    case MENU_SITE_OFFLINE:
      drupal_site_offline();
      break;
  }
}
elseif (isset($return)) {
  // Print any value (including an empty string) except NULL or undefined:
  print theme('page', $return);
}

drupal_page_footer();

We are only interested in the first bit here (the bootstrap):

require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

This bit literally tells us “load the bootstrap library” then “call the bootstrap mechanism with the level of bootstrap DRUPAL_BOOTSTRAP_FULL”.

For practical reasons, the DRUPAL_BOOTSTRAP_FULL constant’s value is actually 8, but it is called DRUPAL_BOOTSTRAP_FULL to make it more human. Incidentally, this constant is defined at the beginning of the included file “bootstrap.inc”.

Now as you see we wall the drupal_bootstrap() function, also located in bootstrap.inc (quite down the file, at line 1305). Let’s see what it does.

function drupal_bootstrap($phase) {
  static $phases = array(DRUPAL_BOOTSTRAP_CONFIGURATION, DRUPAL_BOOTSTRAP_EARLY_PAGE_CACHE, DRUPAL_BOOTSTRAP_DATABASE, DRUPAL_BOOTSTRAP_ACCESS, DRUPAL_BOOTSTRAP_SESSION, DRUPAL_BOOTSTRAP_LATE_PAGE_CACHE, DRUPAL_BOOTSTRAP_LANGUAGE, DRUPAL_BOOTSTRAP_PATH, DRUPAL_BOOTSTRAP_FULL), 
    $phase_index = 0;

  while ($phase >= $phase_index && isset($phases[$phase_index])) {
    $current_phase = $phases[$phase_index];
    unset($phases[$phase_index++]);
    _drupal_bootstrap($current_phase);
  }
}

As we can see here, we prepare an array of “steps”. Because of the definition of each of the constants used here, this array is actually: $phases[0=>0,1=>1,2=>2,3=>3,4=>4,5=>5,6=>6,7=>7,8=>8], 8 being DRUPAL_BOOTSTRAP_FULL as we have seen previously.

So when calling this function, the index.php calls it with the value 8. The while condition that follows says this:

While 8 is greater than the $phase_index (initially 0) and while there is a value for the $phases array element with index of $phase_index set $current_phase to the current $phase_index unset the element of the $phases array that has the index of $phase_index AND increase $phase_index (so the next iteration executes the next $phases array element) call the _drupal_bootstrap() function with the $current_phase (0 in the first case, then 1, then 2, etc up to 8) 

So the idea is that drupal_bootstrap() is the progressive mechanism that makes several levels of bootstrap execute.

If we move to the next step (the _drupal_bootstrap() function), we see the following structure:

  switch ($phase) {

    case DRUPAL_BOOTSTRAP_CONFIGURATION:
      drupal_unset_globals();
      timer_start('page');
      conf_init();
      break;

    case DRUPAL_BOOTSTRAP_EARLY_PAGE_CACHE:
      require_once variable_get('cache_inc', './includes/cache.inc');
      if (variable_get('page_cache_fastpath', FALSE) && page_cache_fastpath()) {
        exit;
      }
      break;

    case DRUPAL_BOOTSTRAP_DATABASE:
      ...

So we are indeed simply executing a separate bit of code for each level of the bootstrap of Drupal.

Sometimes you might have problems ocurring during any phase of the bootstrap. To help you identify which step is broken, one (quite extreme – don’t use in production) method is to add calls to the die() function in any of these steps, and see if your drupal page prints the message you give into the die() call.

For example, to hack the first level:

    case DRUPAL_BOOTSTRAP_CONFIGURATION:
      drupal_unset_globals();
      die('called drupal_unset_globals() successfully');
      timer_start('page');
      conf_init();
      break;

This type of construction will fail after calling drupal_unset_globals(), but if you used to have a blank screen, you will now get an error message that tells you that the previous step worked. And you can thus work your way down the bootstrap in much the same way:

    case DRUPAL_BOOTSTRAP_CONFIGURATION:
      drupal_unset_globals();
      timer_start('page');
      conf_init();
      die('called conf_init() successfully');
      break;

Once you really get a blank screen again, you know that the previous step failed. You have thus got much closer from the real problem. The next step would then be to get into the last function called and use the die() technique again.

If you want to do that on a production server, use error_log() instead of die(). This will not show anything on your web page but it will register it in your Apache (or other web server) error log (generally something like /var/log/apache2|httpd/error_log), which will let your normal users use other parts of the website without realizing you are debugging, but still let you debug.

I hope this might have helped some of you starting with Drupal development/debugging. Don’t hesitate to leave me a message if it did, I appreciate it.

      die('called drupal_unset_globals() successfully');
%d bloggers like this: