Posts Tagged ‘Varnish’

Varnish and Nginx with Joomla

Sunday, June 28th, 2009

Recently we had a client that had some performance issues with a Joomla installation. The site wasn’t getting an incredible amount of traffic, but, the traffic it was getting was just absolutely overloading the server.

Since the machine hadn’t been having issues before, the first thing we did was contact the client and ask what had changed. We already knew the site and database that was using most of the CPU time, but, the bandwidth graph didn’t suggest that it was traffic overrunning the server. Our client rescued this client from another hosting company because the site was unusable in during prime time. So, we’ve inherited a problem. During the move, the site was upgraded from 1.0 to 1.5, so, we didn’t even have a decent baseline to revert to.

The stopgap solution was to move the .htaccess mod_rewrite rules into the apache configuration which helped somewhat. We identified a few sections of the code that were getting hit really hard and wrote a mod_rewrite rule to serve those images direct from disk — bypassing Joomla serving those images through itself. This made a large impact and at least got the site responsive enough that we could leave it online and work through the admin to figure out what had gone wrong.

Some of the modules that had been enabled contributed to quite a bit of the performance headache. One chat module generated 404s every second for each person logged in to see if there were any pending messages. Since Joomla is loaded for each 404 file, this added quite a bit of extra processing. Another quick modification to the configuration eliminated dozens of bad requests. At this point, the server is responsive, the client is happy and we make notes in the trouble ticket system and our internal documentation for reference.

Three days later the machine alerts and our load problem is back. After all of the changes, something is still having problems. Upon deeper inspection, we find that portions of the system dealing with the menus are being recreated each time. There’s no built in caching, so, the decision is to try Varnish. Varnish has worked in the past for WordPress sites that have gotten hit hard, so, we figured if we could cache the images, css and some of the static pages that don’t require authentication, we can get the server to be responsive again.

Apart from the basic configuration, our varnish.vcl file looked like this:

sub vcl_recv {
  if (req.http.host ~ "^(www.)?domain.com$") {
     set req.http.host = "domain.com";
  }

 if (req.url ~ "\.(png|gif|jpg|ico|jpeg|swf|css|js)$") {
    unset req.http.cookie;
  }
}

sub vcl_fetch {
 set obj.ttl = 60s;
 if (req.url ~ "\.(png|gif|jpg|ico|jpeg|swf|css|js)$") {
      set obj.ttl = 3600s;
 }
}

To get the apache logs to report the IP, you need to modify the VirtualHost config to log the forwarded IP.

The performance of the site after running Varnish in front of Apache was quite good. Apache was left with handling only .php and the server is again responsive. It runs like this for a week or more without any issues and only a slight load spike here or there.

However, Joomla doesn’t like the fact that every request’s REMOTE_ADDR is 127.0.0.1 and some addons stop working. In particular an application that allows the client to upload .pdf files into a library requires a valid IP address for some reason. Another module to add a sub-administration panel for a manager/editor also requires an IP address other than 127.0.0.1.

With some reservation, we decide to switch to Nginx + FastCGI which removes the reverse proxy and should fix the IP address problems.

Our configuration for Nginx with Joomla:

server {
        listen 66.55.44.33:80;
	server_name  www.domain.com;
 	rewrite ^(.*) http://domain.com$1 permanent;
}
server {
        listen 66.55.44.33:80;
	server_name  domain.com;

	access_log  /var/log/nginx/domain.com-access.log;

	location / {
		root   /var/www/domain.com;
		index  index.html index.htm index.php;

           if ( !-e $request_filename ) {
             rewrite (/|\.php|\.html|\.htm|\.feed|\.pdf|\.raw|/[^.]*)$ /index.php last;
             break;
           }

	}

	error_page   500 502 503 504  /50x.html;
	location = /50x.html {
		root   /var/www/nginx-default;
	}

	location ~ \.php$ {
		fastcgi_pass   unix:/tmp/php-fastcgi.socket;
		fastcgi_index  index.php;
		fastcgi_param  SCRIPT_FILENAME  /var/www/domain.com/$fastcgi_script_name;
		include	fastcgi_params;
	}

        location = /modules/mod_oneononechat/chatfiles/ {
           if ( !-e $request_filename ) {
             return 404;
           }
        }
}

With this configuration, Joomla was handed any URL for a file that didn’t exist. This was to allow the Search Engine Friendly (SEF) links. The second 404 handler was to handle the oneononechat module which looks for messages destined for the logged in user.

With Nginx, the site is again responsive. Load spikes occur from time to time, but, the site is stable and has a lot less trouble dealing with the load. However, once in a while the load spikes, but, the server seems to recover pretty well.

However, a module called Rokmenu which was included with the template design appears to have issues. Running php behind FastCGI sometimes gives different results than running as mod_php and it appears that Rokmenu is relying on the path being passed and doesn’t normalize it properly. So, when the menu is generated, with SEF on or off, urls look like /index.php/index.php/index.php/components/com_docman/themes/default/images/icons/16×16/pdf.png.

Obviously this creates a broken link and causes more 404s. We installed a fresh Joomla on Apache, imported the data from the copy running on Nginx, and Apache with mod_php appears to work properly. However, the performance is quite poor.

In order to troubleshoot, we made a list of every addon and ran through some debugging. With apachebench, we wrote up a quick command line that could be pasted in at the ssh prompt and decided upon some metrics. Within minutes, our first test revealed 90% of our performance issue. Two of the addons required compatibility mode because they were written for 1.0 and hadn’t been updated. Turning on compatibility mode on our freshly installed site resulted in 10x worse performance. As a test, we disabled the two modules that relied on compatibility mode and turned off compatibility mode and the load dropped immensely. We had disabled SEF early on thinking it might be the issue, but, we found the performance problem almost immediately. Enabling other modules and subsequent tests showed marginal performance changes. Compatibility mode was our culprit the entire time.

The client started a search for two modules to replace the two that required compatibility mode and disabled them temporarily while we moved the site back to Apache to fix the url issue in Rokmenu. At this point, the site was responsive, though, pageloads with lots of images were not as quick as they had been with Nginx or Varnish. At a later point, images and static files will be served from Nginx or Varnish, but, the site is fairly responsive and handles the load spikes reasonably well when Googlebot or another spider hits.

In the end the site ended up running on Apache because Varnish and Nginx had minor issues with the deployment. Moving to Apache alternatives doesn’t always fix everything and may introduce side-effects that you cannot work around.

Varnish proves itself against a DDOS

Saturday, May 2nd, 2009

I’ve worked a lot with Varnish over the last few weeks and we’ve had a rather persistent hacker that has been sending a small but annoying DDOS to a client on one of our machines. Usually we isolate the client and move their affected sites to a machine that won’t affect other clients. Then we can modify firewall rules, find the issue, wait for the attack to end and move them back. Usually this results in a bit of turmoil because not every client is easy to shuffle around. Some have multiple databases and perhaps the application they are running takes a bit more horsepower to run due to the attack.

In this case, the application wasn’t too badly written and it was just a matter of firewalling certain types of packets and modifying the TCP settings to allow things to time out a bit more quickly while the attack persisted. In order to do this seamlessly we had to move the physical IP that client was using to another machine running varnish.

What we ended up with was running Varnish on a machine where we had the ability to freely firewall packets, could turn on more verbose packet logging and, pulled the requests from the original machine. Short of moving the IP address and making config changes on the existing machine, it was straightforward:

Original Machine
* changed apache config to listen to a different IP address on port 81
* modified the firewall to allow port 81
* adjusted the apache config to listen to port 81 on that IP address
* shut down the virtual ethernet interface
* restarted apache

Varnish Machine
* set up the backend to request files from port 81 on the new IP assigned from the old machine
* copied the firewall rules from the Original Machine to the Varnish Machine
* brought up the IP from the original machine
* restarted varnish

Cleared the Arp-cache in the switches that both machines were connected to.

Within seconds, the load on the Original machine dropped to half of what it was before. Varnish had been running on that machine, but, the DDOS was still hitting the firewall rules and causing apache to open connections. Moving both of those pieces of the equation off the machine resulted in an immediate improvement on the Original Machine. Since the same cpu horsepower is being used with the script – Varnish passes those requests through, and we’ve only removed some of the static files from being served from the machine, I believe we can safely conclude that it wasn’t the application that had the problems. Apache has roughly the same number of processes as it had when we were running varnish on that machine, so, the load reduction appears to be mostly related to the firewall rules or the traffic that was still coming through.

Since moving the traffic over to the other machine, we see the same issues being exhibited there. Since that machine isn’t doing anything but caching the apache responses, we can reasonably assume that the firewall is adding quite a bit of overhead to things. The inbound traffic on the Original Machine was cut almost in half with a corresponding jump on the Varnish machine. Since Varnish is dealing with inbound traffic from the original machine and from the DDOS attack, it is difficult to say with certainty that the inbound traffic on that machine is reflecting it, however, based on the 90% cache hit rate and the size of the cached pages, I don’t believe the inbound traffic on that machine should be what it is, so, it is evident that the DDOS traffic moved.

After moving one set of sites, and analyzing the Original Machine, it does appear that a second set of his sites is also impacted.

Varnish saves the day…. maybe

Tuesday, April 28th, 2009

We had a client that had a machine where apache was being overrun… or so we thought.  Everything pointed at this one set of domains owned by a client and in particular two sites with 100+ elements on the page.  Images, css, javascript and iframes composed their main page.  Apache was handling things reasonably well, but, it was immediately obvious that it could be better.

The conversion to Varnish was quite simple to do even on a live server.  Slight modifications to the Apache config file to listen to port 81 on the set of domains in question, and a quick restart.  Varnish was configured to listen to port 80 on that particular IP and some minor modifications were made to the startup.vcl file to modify things slightly:

sub vcl_fetch {
  if (req.url ~ “\.(png|gif|jpg|swf|css|js)$”) {
    set obj.ttl = 3600s;
  }
}

A one hour cache should be granular enough to do a bit more good on these sites, overriding the default of two minutes.  After an hour, it was evident that the sites did peform much more quickly, but, we still had a load issue.  Some modifications of the apache config alleviated some of the other load problems after we dug further into things.

After 5 hours, we ended up with the following statistics from varnish:

0+05:18:24                                                               xxxxxx
Hitrate ratio:       10      100     1000
Hitrate avg:     0.9368   0.9231   0.9156

62576         1.00         3.28 Client connections accepted
466684        57.88        24.43 Client requests received
411765        48.90        21.55 Cache hits
148         0.00         0.01 Cache hits for pass
32018         7.98         1.68 Cache misses
54761         8.98         2.87 Backend connections success
0         0.00         0.00 Backend connections failures
45411         7.98         2.38 Backend connections reuses
48598         7.98         2.54 Backend connections recycles

Varnish is doing a great job.  The site does load considerably faster, but, it didn’t solve the entire problem.  It did reduce the number of apache processes on that machine from 450 to 170 or so, freed up some ram for cache, and did make the server more responsive, but, it probably only contributed to 50% of the issue.  The rest of it was cleaning up some poorly written php code, modifying a few mysql tables and adding some indexes to make things work more quickly.

After we fixed the code problems, we debated removing Varnish from their configuration.  Varnish did buy us time to fix the problem and does result in a better experience for surfers on the sites, but, after the backend changes, it is hard to tell whether it makes enough impact to keep a non-standard configuration running.  Since it is not caching the main page of the site and is only serving the static elements (the site sets an expire time on each generated page), the only real benefit is that we are removing the need for apache to serve the static elements.

While testing another application, we were able to override hardcoded expire times and forcing a minimally cached page.  Even if we cached a generated page for two minutes, it could be the difference between a responsive server and a machine struggling to keep up.  Since WordPress, Joomla, Drupal and others set expire times using dates that have passed, they ensure that the site html being output is not cached.  Varnish allows us to ignore that, and to set our own cache time which could save a site hit with a lot of traffic.

sub vcl_fetch {
  if (obj.ttl < 120s) {
    set obj.ttl = 120s;
  }
}

would give us a minimum two minute cache which would cut the requests to a dynamically generated page considerably.

It is a juggling act.  Where do you make the tradeoff and what do you accelerate? Too many times the solution to a website’s performance problem is to throw more hardware at it.  At some point you have to split the load on multiple servers, adding new bottlenecks.  An application designed to run on a single machine becomes difficult to split to two or more machines, so, many times we do what we can to keep things running on a single machine.

Varnish and Apache2

Tuesday, April 7th, 2009

One client had some issues with Apache2 and a WordPress site. While WordPress isn’t really a great performer, this client had multiple domains on the same IP and dropping Nginx in didn’t seem like it would make sense to solve the immediate problem.

First things first, we evaluated where the issue was with WordPress and installed db-cache and wp-cache-2. We had tried wp-super-cache but had seen some issues with it in some configurations. Immediately the pageload time dropped from 41 seconds to 11 seconds. Since the machine was running on a quadcore with 4gb ram and was running mostly idle, the only thing left was the 91 page elements being served. Each pageload, even with pipelining still seemed to cause some stress. Two external javascripts and one external flash object caused some delay in rendering the page. The javascripts were actually responsible for holding up the page rendering which made the site seem even slower than it was. We made some minor modifications, but, while apache2 was configured to serve things as best it could, we felt there was still some room for improvement.

While I had tested Varnish in front of Apache2, I knew it would make an impact in this situation due to the number of elements on the page and the fact that apache did a lot of work to serve each request. Varnish and its VCL eliminated a lot of the overhead Apache had and should result in the capacity for roughly 70% better performance. For this installation, we removed the one IP that was in use by the problem domain from Apache and used that for Varnish and ran Varnish on that IP, using 127.0.0.1 port 80 as the backend.

Converting a site that is in production and live is not for the fainthearted, but, here are a few notes.

For Apache you’ll want to add a line like this to make sure your logs show the remote IP rather than the IP of the Varnish server:

LogFormat "%{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-A
gent}i\"" varnishcombined

Modify each of the VirtualHost configs to say:

<VirtualHost 127.0.0.1:80>

and change the line for the logfile to say:

CustomLog /var/log/apache2/domain.com-access.log varnishcombined

Add Listen Directives to prevent Apache from listening to port 80 on the IP address that you want varnish to answer and comment out the default Listen 80:

#Listen 80
Listen 127.0.0.1:80
Listen 66.55.44.33:80

Configuration changes for Varnish:

backend default {
.host = "127.0.0.1";
.port = "80";
}

sub vcl_recv {
  if (req.url ~ "\.(jpg|jpeg|gif|png|tiff|tif|svg|swf|ico|mp3|mp4|m4a|ogg|mov|avi|wmv)$") {
      lookup;
  }

  if (req.url ~ "\.(css|js)$") {
      lookup;
  }
}
sub vcl_fetch {
        if( req.request != "POST" )
        {
                unset obj.http.set-cookie;
        }

        set obj.ttl = 600s;
        set obj.prefetch =  -30s;
        deliver;
}

Shut down Apache, Restart Apache, Start Varnish.

tail -f the logfile for Apache for one of the domains that you have moved. Go to the site. Varnish will load everything the first time, but, successive reloads shouldn’t show requests for images, javascript, css. For this client we opted to hold things in cache for 10 minutes (600 seconds).

Overall, the process was rather seamless. Unlike converting a site to Nginx, we are not required to make changes to the rewrite config or worry about setting up a fastcgi server to answer .php requests. Overall, varnish is quite seamless to the end product. Clients will lose the ability to do some things like deny hotlinking, but, Varnish will run almost invisibly to the client. Short of the page loading considerably quicker, the client was not aware we had made any server changes and that is the true measure of success.

Apache, Varnish, nginx and lighttpd

Wednesday, April 1st, 2009

I’ve never been happy with Apache’s performance.  It seemed that it always had problems with high volume sites.  Even extremely tweaked configurations resulted in decent performance to a point which then required more hardware to continue going.  While I had been a huge fan of Tux, sadly, Tux doesn’t work with Linux 2.6 kernels very well.

So, the search was on.  I’ve used many webservers over the years ranging from AOLServer to Paster to Caudium looking for a robust, high-performance solution.  I’ve debated caching servers in front of Apache, a server to handle just static files and coding the web sites to utilize that, but, I never really found the ultimate solution to handle particular requirements.

This current problem is a php driven site with roughly 100 page elements plus the generated page itself.  The site receives quite a bit of traffic and we’ve had to tweak Apache quite a bit from our default configuration to keep the machine performing well.

Apache can be run many different ways.  Generally when a site uses php, we’ll run mod_php because it is faster.  Eaccelerator can help sometimes — though, does create a few small problems, but, in general, Apache-mpm-prefork runs quite well.  On sites where we’ve had issues with traffic, we’ve switched over to Apache-mpm-worker with a fastcgi php process.  This works quite well even though php scripts are slightly slower.

After considerable testing, I came up with three decent metrics that I used to judge things.  Almost all testing was done with ab (apachebench) running 10000 connections with keepalives and 50 concurrent sessions from a dual quad-core xeon machine to a gigE connected machine on the same switch running a core2quad machine.  On the first IP was bare apache, the second IP had lighttpd, the third IP ran nginx and the fourth IP ran Varnish in front of Apache.  Everything was set up so that no restarts of daemons would need to be made, the tests were run twice with the second result generally being the higher of the two which was used.  The linux kernel does some caching and we’re after the performance after the kernel has done its caching, apache has forked its processes and hasn’t killed off the children, etc.

First impressions from Apache-mpm-prefork were that it handled php exceedingly well, but, has never had great performance with static files.  This is why Tux prevailed for us as Apache handled what it did best and Tux handled what it did best.  Regrettably, Tux didn’t keep up with the 2.6 kernel and development ceased.  With new hardware, the 2.6 kernel and the ability for userland processes to get access to sendfile, large file transfer should be almost the same for all of the processes so, startup latency of the tiny files was what really seemed to harm Apache.  Apache-mpm-worker with php running as fastcgi has always been a fallback for us to gain a little more serving capacity as most sites have a relatively heavy static file to dynamic file construction.

But, Apache seemed to have problems with the type of traffic our clients are putting through and we felt that there had to be a better way.  I’ve read page after page of people complaining about their Drupal installation being able to take 50 users and then they upgraded to nginx or lighttpd and now their site doesn’t run into swap issues.  If your server is having problems with 50 simultaneous users with apache, you have serious problems with your setup.  It is not uncommon for us to push a P4/3.0ghz with 2gb ram with 80mb/sec traffic and MySQL running 1000 queries per second.  Where your apache logfile reaches 6gb/day for a domain not including the other 30 domains configured on the machine.  VBulletin will easily run 350 online users and 250 guests on the same hardware without any difficulties.  The same with Joomla, Drupal and the other CMS products out there.  If you can’t run 50 simultaneous users, with any of those products, dig into the configs FIRST so that you are comparing a tuned configuration to a tuned configuration.

Uptime: 593254  Threads: 571  Questions: 609585858  Slow queries: 1680967  Opens: 27182  Flush tables: 1  Open tables: 2337  Queries per second avg: 1027.529

86

Based on all of my reading, I expected Varnish -> Apache2 to be the fastest followed by nginx, lighttpd and bare Apache.  Lighttpd has some interesting design issues that I believed would put it behind nginx, I really expected Varnish would do really well.  For this client, we needed the FLV streaming so, I knew I would be running nginx or lighttpd for a backend for the .flv files and contemplated running Varnish in front of whichever of those performed best.  Splitting things so that the .flv files were served from a different domain was no problem for this client, so, we weren’t having to put a solution in place where we couldn’t make changes.

The testing methodology was based on numerous runs of ab where I tested and tweaked each setup.  I am reasonably sure that someone with vast knowledge of Varnish, nginx or lighttpd would not be able to substantially change the results.  Picking out the three or four valid pieces of information from all of the testing to give me a generalized result was difficult.

The first thing I was concerned with was the raw speed on a small 6.3kb file.  With keepalives enabled, that was a good starting point.  The second test was to run a page that called phpinfo();.  Not an exceedingly difficult test, it does at least start the php engine, process a page and return the result.  The third test was to download a 21mb flv file.  All of the tests were run with 10000 iterations and 50 concurrent threads except the 21mb flv file which ran 100 iterations and 10 concurrent threads due to the time it took.

Server Small File Requests Per Second phpinfo() Requests Per Second .flv MB/Sec Min/Max time to serve .flv Time to run ab for .flv test
Apache-mpm-prefork 1000 164 11.5MB/sec 10-26 seconds 182 seconds
Apache-mpm-worker 1042 132 11.5MB 11-25 seconds 181 seconds
Lighttpd 1333 181 11.4MB 13-23 seconds 190 seconds
nginx 1800 195 11.5MB 14-24 seconds 187 seconds
Varnish 1701 198 11.3MB 18-30 seconds 188 seconds

Granted, I expected more from Varnish and it’s caching nature does shine through.  It is considerably more powerful than nginx due to some of the internal features it has for load balancing, multiple backends, etc.  However, based on the results above, I have to believe that in this case, nginx wins.

There are a number of things about the nginx documentation that were confusing.  First was that they used inet rather than a local socket for communication with the php-cgi process.  That alone bumped up php almost 30 transactions per second.  The documentation for nginx is sometimes very terse and it required a bit more time to get configured correctly.  While I do have both php and perl cgi working with nginx natively, some perl cgi scripts do have minor issues which I’m still working out.

Lighttpd performed about as well as I expected.  Due to some backend design issues, there are some things that made me believe it wouldn’t be the top performer.  It is also older and more mature than Nginx and Varnish which use today’s tricks to accomplish their magic.  File transfer speed is going to be somewhat capped because the Linux kernel opens up some APIs that allow a userspace application to ask the kernel to handle the transfer.  Every application tested takes advantage of this.

Given the choice of Varnish or Nginx for a project that didn’t require .flv streaming, I might consider Varnish.  Lighttpd did have one very interesting module that prevented hotlinking of files in a much different manner than normal — I’ll be testing that for another application. If you are used to Apache mod_rewrite rules, Nginx and Lighttpd have a completely different structure for these.  They work in almost the same manner with some minor syntax changes.  Varnish runs as a cache to the frontend of your site, so, everything works with it the same way it does under Apache since Varnish merely connects to your Apache backend and caches what it can.  Its configuration language allows considerable control over the process.

Short of a few minor configuration tweaks, this particular client will be getting nginx.

Overall, I don’t believe you can take an agnostic approach to webservers.  Every client’s requirements are different and they don’t all fit into the same category.  If you run your own web server, you can make choices to make sure your site runs as well as it can.  From the number of pages showing stellar performance gains from switching from Apache to something else, if most of those writers spent the same time debugging their apache installation as they did migrating to a new web server, I would imagine 90% of them would find Apache meets their needs just fine.

The default out of the box configuration of MySQL and Apache in most Linux distributions leaves a lot to be desired.  To compare those configurations with a more sane default supplied by the software developers of competing products doesn’t really give a good comparison.  I use Debian, and their default configurations for Apache, MySQL and a number of other applications are terrible for any sort of production use.  Even Redhat has some fairly poor default configurations for many of the applications you would use to serve your website.  Do yourself a favor and do a little performance tuning with your current setup before you start making changes.  You might find the time invested well worth it.

Entries (RSS) and Comments (RSS).
Cluster host: li