Posts Tagged ‘Varnish’

W3 Total Cache and Varnish

Thursday, July 21st, 2011

Last week I got called into a firestorm to fix a set of machines that were having problems. As Varnish was in the mix, the first thing I noticed was the hit rate was extremely low as Varnish’s VCL wasn’t really configured well for WordPress. Since WordPress uses a lot of cookies and Varnish passes anything with a cookie to the backend, we have to know which cookies we can ignore so that we can get the cache hit rate up.

Obviously, static assets like javascript, css and images generally don’t need cookies, so, those make a good first target. Since some ad networks set their own cookies on the domain, we need to know which ones to set. However, to make a site resilient, we have to get a little more aggressive and tell Varnish to cache things against its judgement. When we do this, we don’t want to have surfers see stale content, so, we need to purge cached objects from Varnish when they are changed to keep the site interactive.

Caching is easy, purging is hard

This particular installation used W3 Total Cache, a plugin that does page caching, javascript/css minification and combining and handles a number of other features. I was unable to find any suggested VCL, but, several posts on the forums show a disinterest in supporting Varnish.

In most cases, once we determine what we’re caching, we need to figure out what to purge. When a surfer posts a comment, we need to clear the cached representation of that post, the Feed RSS and the front page of the site. This allows any post counters to be updated and keeps the RSS feed accurate.

W3TC includes the ability to purge, but, only works in a single server setting. If you put a domain name in the config box, it should work fine. If you put a series of IP addresses, your VCL either needs to override the hostname or, you need to apply the following patch. There are likely to be bugs, so, try this at your own risk.

If you aren’t using the Javascript/CSS Minification and combining or some of the CDN features that W3TC provides, then I would suggest WordPress-Varnish which is maintained by some people very close to the Varnish team.

I’ve maintained the original line of code from W3TC commented above any changes for reference.

--- w3-total-cache/inc/define.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/inc/define.php	2011-07-21 16:10:39.270111723 -0400
@@ -1406,11 +1406,15 @@
  * @param boolean $check_status
  * @return string
  */
-function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true) {
+#cd34, 20110721, added $server IP for PURGE support
+# function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true) {
+function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true, $server = '') {
     $status = 0;
     $method = strtoupper($method);
 
-    if (function_exists('curl_init')) {
+#cd34, 20110721, don't use CURL for purge
+#    if (function_exists('curl_init')) {
+    if ( (function_exists('curl_init')) && ($method != 'PURGE') ) {
         $ch = curl_init();
 
         curl_setopt($ch, CURLOPT_URL, $url);
@@ -1474,7 +1478,13 @@
             $errno = null;
             $errstr = null;
 
-            $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+#cd34, 20110721, if method=PURGE, connect to $server, not $host
+#            $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+            if ( ($method == 'PURGE') && ($server != '') ) {
+                $fp = @fsockopen($server, $port, $errno, $errstr, 10);
+            } else {
+                $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+            }
 
             if (!$fp) {
                 return false;
@@ -1543,8 +1553,9 @@
  * @param bool $check_status
  * @return string
  */
-function w3_http_purge($url, $auth = '', $check_status = true) {
-    return w3_http_request('PURGE', $url, null, $auth, $check_status);
+#cd34, 20110721, added server IP
+function w3_http_purge($url, $auth = '', $check_status = true, $server = '') {
+    return w3_http_request('PURGE', $url, null, $auth, $check_status, $server);
 }
 
 /**
diff -Naur w3-total-cache/lib/W3/PgCache.php w3-total-cache-varnish/lib/W3/PgCache.php
--- w3-total-cache/lib/W3/PgCache.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/lib/W3/PgCache.php	2011-07-21 16:04:07.247499682 -0400
@@ -693,7 +693,9 @@
                     $varnish =& W3_Varnish::instance();
 
                     foreach ($uris as $uri) {
-                        $varnish->purge($uri);
+#cd34, 20110721 Added $domain_url to build purge hostname
+#                        $varnish->purge($uri);
+                        $varnish->purge($domain_url, $uri);
                     }
                 }
             }
diff -Naur w3-total-cache/lib/W3/Varnish.php w3-total-cache-varnish/lib/W3/Varnish.php
--- w3-total-cache/lib/W3/Varnish.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/lib/W3/Varnish.php	2011-07-21 16:04:52.836919164 -0400
@@ -70,7 +70,7 @@
      * @param string $uri
      * @return boolean
      */
-    function purge($uri) {
+    function purge($domain, $uri) {
         @set_time_limit($this->_timeout);
 
         if (strpos($uri, '/') !== 0) {
@@ -78,9 +78,11 @@
         }
 
         foreach ((array) $this->_servers as $server) {
-            $url = sprintf('http://%s%s', $server, $uri);
+#cd34, 20110721, Replaced $server with $domain
+#            $url = sprintf('http://%s%s', $server, $uri);
+            $url = sprintf('%s%s', $domain, $uri);
 
-            $response = w3_http_purge($url, '', true);
+            $response = w3_http_purge($url, '', true, $server);
 
             if ($this->_debug) {
                 $this->_log($url, ($response !== false ? 'OK' : 'Bad response code.'));
diff -Naur w3-total-cache/w3-total-cache.php w3-total-cache-varnish/w3-total-cache.php
--- w3-total-cache/w3-total-cache.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/w3-total-cache.php	2011-07-21 15:56:53.275922099 -0400
@@ -2,7 +2,7 @@
 /*
 Plugin Name: W3 Total Cache
 Description: The highest rated and most complete WordPress performance plugin. Dramatically improve the speed and user experience of your site. Add browser, page, object and database caching as well as minify and content delivery network (CDN) to WordPress.
-Version: 0.9.2.3
+Version: 0.9.2.3.v
 Plugin URI: http://www.w3-edge.com/wordpress-plugins/w3-total-cache/
 Author: Frederick Townes
 Author URI: http://www.linkedin.com/in/w3edge
@@ -47,4 +47,4 @@
     require_once W3TC_LIB_W3_DIR . '/Plugin/TotalCache.php';
     $w3_plugin_totalcache = & W3_Plugin_TotalCache::instance();
     $w3_plugin_totalcache->run();
-}
\ No newline at end of file
+} 

Gracefully Degrading Site with Varnish and High Load

Saturday, July 16th, 2011

If you run Varnish, you might want to gracefully degrade your site when traffic comes unexpectedly. There are other solutions listed on the net which maintain a Three State Throttle, but, it seemed like this could be done easily within Varnish without needing too many external dependencies.

The first challenge was to figure out how we wanted to handle state. Our backend director is set up with a ‘level1’ backend which doesn’t do any health checks. We need at least one node to never fail the health check since the ‘level2’ and ‘level3’ backends will go offline to signify to Varnish that we need to take action. While this scenario considers the failure mode cascades, i.e. level2 fails, then if things continue to increase load, level3 fails, there is nothing preventing you from having separate failure modes and different VCL for those conditions.

You could have VCL that replaced the front page of your site with ‘top news’ during an event which links to your secondary page. You can rewrite your VCL to handle almost any condition and you don’t need to worry about doing a VCL load to update the configuration.

While maintaining three configurations is easier, there are a few extra points of failure added in that system. If the load on the machine gets too high and the cron job or daemon that is supposed to update the VCL doesn’t run quickly enough or has issues with network congestion talking with Varnish, your site could run in a degraded mode much longer than needed. With this solution, in the event that there is too much network congestion or too much load for the backend to respond, Varnish automatically considers that a level3 failure and enacts those rules – without the backend needing to acknowledge the problem.

The basics

First, we set up the script that Varnish will probe. The script doesn’t need to be php and only needs to respond with an error 404 to signify to Varnish that probe request has failed.

<?php
$level = $_SERVER['QUERY_STRING'];
$load = file_get_contents('/proc/loadavg') * 1;
if ( ($level == 2) and ($load > 10) ) {
  header("HTTP/1.0 404 Get the bilge pumps working!");
}
if ( ($level == 3) and ($load > 20) ) {
  header("HTTP/1.0 404 All hands abandon ship");
}
?>

Second, we need to have our backend pool configured to call our probe script:

backend level1 {
  .host = "66.55.44.216";
  .port = "80";
}
backend level2 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?2";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}
backend level3 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?3";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}

director crisis random {
  {
# base that should always respond so we don't get an Error 503
    .backend = level1;
    .weight = 1;
  }
  {
    .backend = level2;
    .weight = 1;
  }
  {
    .backend = level3;
    .weight = 1;
  }
}

Since both of our probes go to the same backend, it doesn’t matter which director we use or what weight we assign. We just need to have one backend configured that won’t fail the probe along with our level2 and level3 probes. In this example, when the load on the server is greater than 10, it triggers a level2 failure. If the load is greater than 20, it triggers a level3 failure.

In this case, when the backend probe request fails, we just rewrite the URL. Any VCL can be added, but, you will have some duplication. Since the VCL is compiled into the Varnish server, it should have negligible performance impact.

sub vcl_recv {
  set req.backend = level2;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level2.php";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level3.php";
  }
  set req.backend = crisis;
}

In this case, when we have a level2 failure, we change any URL requested to serve the file /level2.php. In vcl_fetch, we make a few changes to the object ttl so that we prevent the backend from getting hit too hard. We also change the server name so that we can look at the headers to see what level our server is currently running. In Firefox, there is an extension called Header Spy which will allow you to keep track of a header. Often times I’ll track X-Cache which I set to HIT or MISS to make sure Varnish is caching, but, you could also track Server and be aware of whether things are running properly.

sub vcl_fetch {
  set beresp.ttl = 0s;

  set req.backend = level2;
  if (!req.backend.healthy) {
    set beresp.ttl = 5m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 2 - Warning)";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    set beresp.ttl = 30m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 3 - Critical)";
  }

At this point, we’ve got a system that degrades gracefully, even if the backend cannot respond or update Varnish’s VCL and it self-heals based on the load checks. Ideally you’ll also want to put Grace timers and possibly run Saint mode to handle significant failures, but, this should help your system protect itself from meltdown.

Complete VCL

backend level1 {
  .host = "66.55.44.216";
  .port = "80";
}
backend level2 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?2";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}
backend level3 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?3";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}

director crisis random {
  {
# base that should always respond so we don't get an Error 503
    .backend = level1;
    .weight = 1;
  }
  {
    .backend = level2;
    .weight = 1;
  }
  {
    .backend = level3;
    .weight = 1;
  }
}

sub vcl_recv {
  set req.backend = level2;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level2.php";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level3.php";
  }
  set req.backend = crisis;
}

sub vcl_fetch {
  set beresp.ttl = 0s;

  set req.backend = level2;
  if (!req.backend.healthy) {
    set beresp.ttl = 5m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 2 - Warning)";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    set beresp.ttl = 30m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 3 - Critical)";
  }

  if (req.url ~ "\.(gif|jpe?g|png|swf|css|js|flv|mp3|mp4|pdf|ico)(\?.*|)$") {
    set beresp.ttl = 365d;
  }
}

Updated WordPress VCL – still not complete, but, closer

Saturday, July 16th, 2011

Worked with a new client this week and needed to get the VCL working for their installation. They were running W3TC, but, this VCL should work for people running WP-Varnish or any plugin that allows Purging. This VCL is for Varnish 2.x.

There are still some tweaks, but, this appears to be working quite well.

backend default {
    .host = "127.0.0.1";
    .port = "8080";
}

acl purge {
    "10.0.1.100";
    "10.0.1.101";
    "10.0.1.102";
    "10.0.1.103";
    "10.0.1.104";
}

sub vcl_recv {
 if (req.request == "PURGE") {
   if (!client.ip ~ purge) {
     error 405 "Not allowed.";
   }
   return(lookup);
 }

  if (req.http.Accept-Encoding) {
#revisit this list
    if (req.url ~ "\.(gif|jpg|jpeg|swf|flv|mp3|mp4|pdf|ico|png|gz|tgz|bz2)(\?.*|)$") {
      remove req.http.Accept-Encoding;
    } elsif (req.http.Accept-Encoding ~ "gzip") {
      set req.http.Accept-Encoding = "gzip";
    } elsif (req.http.Accept-Encoding ~ "deflate") {
      set req.http.Accept-Encoding = "deflate";
    } else {
      remove req.http.Accept-Encoding;
    }
  }
  if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
    unset req.http.cookie;
    set req.url = regsub(req.url, "\?.*$", "");
  }
  if (req.http.cookie) {
    if (req.http.cookie ~ "(wordpress_|wp-settings-)") {
      return(pass);
    } else {
      unset req.http.cookie;
    }
  }
}

sub vcl_fetch {
# this conditional can probably be left out for most installations
# as it can negatively impact sites without purge support. High
# traffic sites might leave it, but, it will remove the WordPress
# 'bar' at the top and you won't have the post 'edit' functions onscreen.
  if ( (!(req.url ~ "(wp-(login|admin)|login)")) || (req.request == "GET") ) {
    unset beresp.http.set-cookie;
# If you're not running purge support with a plugin, remove
# this line.
    set beresp.ttl = 5m;
  }
  if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
    set beresp.ttl = 365d;
  }
}

sub vcl_deliver {
# multi-server webfarm? set a variable here so you can check
# the headers to see which frontend served the request
#   set resp.http.X-Server = "server-01";
   if (obj.hits > 0) {
     set resp.http.X-Cache = "HIT";
   } else {
     set resp.http.X-Cache = "MISS";
   }
}

sub vcl_hit {
  if (req.request == "PURGE") {
    set obj.ttl = 0s;
    error 200 "OK";
  }
}

sub vcl_miss {
  if (req.request == "PURGE") {
    error 404 "Not cached";
  }
}

When to Cache, What to Cache, How to Cache

Tuesday, June 21st, 2011

This post is a version of the slideshow presentation I did at Hack and Tell in Fort Lauderdale, Florida at The Collide Factory on Saturday, April 2, 2011. These are 5 minute talks where each slide auto-advances after fifteen seconds which limits the amount of detail that can be conveyed.

A brief introduction

What makes a page load quickly? While we can look at various metrics, there are quite a few things that impact pageloads. While the page can be served quickly, the design of the page can often times impact the way that the page is rendered in the browser which can make a site appear to be sluggish. However, we’re going to focus on the mechanics of what it takes to get a page to serve quickly.

The Golden Rule – do as few calculations as possible to hand content to your surfer.

But my site is dynamic!

Do you really need to calculate the last ten posts entered on your blog every time someone visits the page? Surely you could cache that and purge the cache when a new post is entered. When someone adds a new comment, purge the cache and let it be recalculated once.

But my site has user personalization!

Can that personalization be broken into it’s own section of the webpage? Or, is it created by a cacheable function within your application? Even if you don’t support fragment caching on the edge, you can emulate that by caching your expensive SQL queries or even portions of your page.

Even writing a generated file to a static file and allowing your webserver to serve that static file provides an enormous boost. This is what most of the caching plugins for WordPress do. However, they are page caching, not fragment caching, which means that the two most expensive queries that WordPress executes, Category list and Tag Cloud, are generated each time a new page is hit until that page is cached.

One of the problems with high performance sites is the never-ending quest for that Time to First Byte. Each load balancer or proxy in front adds some latency. It also means a page needs to be pre-constructed before it is served, or, you need to do a little trickery. This eliminates being able to do any dynamic processing on the page in order to hand a response back as quickly as possible unless you’ve got plenty of spare computing horsepower.

With this, we’re left with a few options to have a dynamic site that has the performance of a statically generated site.

Amazon was one of the first to embrace the Page and Block method by using Mason, a mod_perl based framework. Each of the blocks on the page was generated ahead of time, and only the personalized blocks were generated ‘late’. This allowed the frontend to assemble these pieces, do minimal work to display the personal recommendations and present the page quickly.

Google took a different approach by having an immense amount of computing horsepower behind their results. Google’s method probably isn’t cost effective for most sites on the Internet.

Facebook developed bigpipe which generates pages and then dynamically loads portions of the page into the DOM units. This makes the page load quickly, but in stages. The viewer sees the rough page quickly as the rest of the page fills in.

The Presentation begins here

Primary Goal

Fast Pageloads – We want the page to load quickly and render quickly so that the websurfer doesn’t abandon the site.

Increased Scalability – Once we get more traffic, we want the site to be able to scale and provide websurfers with a consistent, fast experience while the site grows.

Metrics We Use

Time to First Byte – This is a measure of how quickly the site responds to an incoming request and starts sending the first byte of data. Some sites have to take time to analyze the request, build the page, etc before sending any data. This lag results in the browser sitting there with a blank screen.

Time to Completion – We want the entire page to load quickly enough that the web surfer doesn’t abandon. While we can do some tricky things with chunked encoding to fool websurfers into thinking our page loads more quickly than it really does, for 95% of the sites, this is a decent metric.

Number of Requests – The total number of requests for a page is a good indicator of overall performance. Since most browsers will only request a handful of static assets from a page per hostname, we can use a CDN, embed images in CSS or use Image Sprites to reduce the number of requests.

Why Cache?

Expecting Traffic

When we have an advertising campaign or holiday promotion going on, we don’t know what our expected traffic level might be, so, we need to prepare by having the caching in place.

Receiving Traffic

If we receive unexpected publicity, or our site is listed somewhere, we might cache to allow the existing system to survive a flood of traffic.

Fighting DDOS

When fighting a Distributed Denial of Service Attack, we might use caching to avoid the backend servers from getting overloaded.

Expecting Traffic

There are several types of caching we can do when we expect to receive traffic.

* Page Cache – Varnish/Squid/Nginx provide page caching. A static copy of the rendered page is held and updated from time to time either by the content expiring or being purged from the cache.
* Query Cache – MySQL includes a query cache that can help on repetitive queries.
* Wrap Queries with functions and cache – We can take our queries and write our own caching using a key/value store, avoiding us having to hit the database backend.
* Wrap functions with caching – In Python, we can use Beaker to wrap a decorator around a function which does the caching magic for us. Other languages have similar facilities.

Receiving Traffic

* Page Caching – When we’re receiving traffic, the easiest thing to do is to put a page cache in place to save the backend/database servers from getting overrun. We lose some of the dynamic aspects, but, the site remains online.

* Fragment Caching – With fragment caching, we can break the page into zones that have separate expiration times or can be purged separately. This can give us a little more control over how interactive and dynamic the site appears while it is receiving traffic.

DDOS Handling

* Slow Client/Vampire Attacks – Certain DDOS attacks cause problems with some webserver software. Recent versions of Apache and most event/poll driven webservers have protection against this.
* Massive Traffic – With some infrastructures, we’re able to filter out the traffic ahead of time – before it hits the backend.

Caching Easy, Purging Hard

Caching is scaleable. We can just add more caching servers to the pool and keep scaling to handle increased load. The problem we run into is keeping a site interactive and dynamic as content needs to be updated. At this point, purging/invalidating cached pages or regions requires communication with each cache.

Page Caching

Some of the caching servers that work well are Varnish, Squid and Nginx. Each of these allows you to do page caching, specify expire times, and handle most requests without having to talk to the backend servers.

Fragment Caching

With Edge Side Includes or a Page and Block Construction can allow you to cache pieces of the page as shown in the following diagram. With this, we can individually expire pieces of the page and allow our front end cache, Varnish, to reassemble the pieces to serve to the websurfer.

http://www.trygve-lie.com/blog/entry/esi_explained_simple

Cache Methods

* Hardware – Hard drives contain caches as do many controller cards.
* SQL Cache – adding memory to keep the indexes in memory or enabling the SQL query cache can help.
* Redis/Memcached – Using a key/value store can keep requests from hitting rotational media (disks)
* Beaker/Functional Caching – Either method can use a key/value store, preferably using RAM rather than disk, to prevent requests from having to hit the database backend.
* Edge/Frontend Caching – We can deploy a cache on the border to reduce the number of requests to the backend.

OS Buffering/Caching

* Hardware Caching on drive – Most hard drives today have caches – finding one with a large cache can help.
* Caching Controller – If you have a large ‘hot set’ of data that changes, using a caching controller can allow you to put a gigabyte or more RAM to avoid having to hit the disk for requests. Make sure you get the battery backup card just in case your machine loses power – those disk writes are often reported as completed before they are physically written to the disk.
* Linux/FreeBSD/Solaris/Windows all use RAM for caching

MySQL Query Cache

The MySQL Query cache is simple yet effective. It isn’t smart and doesn’t cache based on query plan, but, if your code base executes queries where the arguments are in the same order, it can be quite a plus. If you are dynamically creating queries, assembling the queries to try and keep the conditions in the same order will help.

Redis/Memcached

* Key Value Store – you can store frequently requested data in memory.
* Nginx can read rendered pages right from Memcached.

Both methods use RAM rather than hitting slower disk media.

Beaker/Functional Caching

With Python, we can use the Beaker decorator to specify caching. This insulates us from having to write our own handler.

Edge/Front End Caching

* Define blocks that can be cached, portions of the templates.
* Page Caching
* JSON (CouchDB) – Even JSON responses can run behind Varnish.
* Bigpipe – Cache the page, and allow javascript to assemble the page.

Content Delivery Network (CDN)

When possible, use a Content Delivery Network to store static assets off net. This adds a separate hostname and sometimes a separate domain name which allows most browsers to fetch more resources at the same time. Preferably you want to use a separate domain name that won’t have any cookies set – which cuts down on the size of the request object sent from the browser to the server with the static assets.

Bigpipe

Facebook uses a technology called Bigpipe which caches the page template and the javascript required to build the page. Once that has loaded, Javascript fetches the data and builds the page. Some of the json data requested is also cached, leading to a very compact page being loaded and built while you’re viewing the page.

Google’s Answer

Google has spent many years building a tremendous distributed computer. When you request a site, their frontend servers use a deadline scheduler and request blocks from their advertising, personalization, search results and other page blocks. The page is then assembled and returned to the web surfer. If any block doesn’t complete quickly enough, it is left out from assembly – which motivates the advertising department to make sure their block renders quickly.

What else can we do?

* Reduce the number of calculations required to serve a page
* Reduce the number of disk operations
* Reduce the network Traffic

In general, do as few calculations as possible while handing the page to the surfer.

WordPress, Varnish and ESI Plugin

Sunday, June 5th, 2011

This post is a version of the slideshow presentation I did at Hack and Tell in Fort Lauderdale, Florida at The Whitetable Foundation on Saturday, June 4, 2011.

Briefly, I created a Plugin that enabled Fragment Caching with WordPress and Varnish. The problem we ran into with normal page caching methods was related to the fact that this particular client had people visiting many pages per visit, requiring the sidebar to be regenerated on uncached (cold) pages. By caching the sidebar and the page and assembling the page using Edge Side Includes, we can cache the sidebar which contains the most database intensive queries separately from the page. Thus, a visitor moving from one page to a cold page, only needs to wait for the page to generate and pull the sidebar from the cache.

What problem are we solving?

We had a high traffic site where surfers visited multiple pages, and, a very interactive site. Surfers left a lot of comments which meant we were constantly purging the page cache. This resulted in the sidebar having to be regenerated numerous times – even when it wasn’t truly necesssary.

What are our goals?

First, we want that Time to First Byte to be as quick as possible – surfers hate to wait and if you have a site that takes 12 seconds before they see any visible indication that there is something happening, most will leave.

We needed to keep the site interactive, which meant purging pages from cache when posts were made.

We had to have fast pageloads – accomplished by caching the static version of the page and doing as few calculations as possible to deliver the content.

We needed fast static content loading. Apache does very well, but, isn’t the fastest webserver out there.

How does the WordPress front page work?

The image above is a simple representation of a page that has a header, an article section where three articles are shown and a sidebar. Each of those elements is built from a number of SQL queries, assembled and displayed to the surfer. Each plugin that is used, especially filter plugins that look at content and modify it before output add a little latency – resulting in a slower page display.

How does an Article page work?

An article page works very similar to the frontpage except our content block now only contains the contents from one post. Sometimes additional plugins are called to display the post content dealing with comments, social media sharing icons, greetings based on where you’re visiting from (Google, Digg, Reddit, Facebook, etc) and many more. We also see the same sidebar on our site which contains the site navigation, advertisements and other content.

What Options do we Have?

There are a number of existing caching plugins that I have benchmarked in the past. Notably we have:

* WP-Varnish
* W3 Total Cache
* WP Super Cache
* WordPress-Varnish-ESI
* and many others

Page Caching

With Page Caching, you take the entire generated page and cache it either in ram or on disk. Since the page doesn’t need to be generated from the database, the static version of the page is served much more quickly.

Fragment Caching

With Fragment Caching, we’re able to cache the page and a smaller piece that is often repeated, but, perhaps doesn’t change as often as the page. When a websurfer comments on a post, the sidebar doesn’t need to be regenerated, but, the page does.

WordPress and Varnish

Varnish doesn’t deal well with cookies, and WordPress uses a lot of cookies to maintain information about the current web surfer. Some plugins also add their own cookies to track things so that their plugin works.

Varnish can do domain name normalization which may be desired or not. Many sites redirect the bare domain to the www.domain.com. If you do this, you can modify your Varnish Cache Language (VCL) to make sure it always hands back the proper host header.

There are other issues with Varnish that affect how well it caches. There are a number of situations where Varnish doesn’t work as you would expect, but, this can all be addressed with VCL.

Purging – caching is easy, purging is hard once you graduate beyond a single server setup.

WordPress and Varnish with ESI

In this case, our plugin caches the page and the sidebar separately, and allows Varnish to assemble the page prior to sending it to the server. This is going to be a little slower than page caching, but, in the long run, if you have a lot of page to page traffic, having that sidebar cached will make a significant impact.

Possible Solutions

You could hardcode templates and write modules to cache CPU or Database heavy widgets and in some cases, that is a good solution.

You could create a widget that handles the work to cache existing widgets. There is a plugin called Widget Cache, but, I didn’t find it to have much benefit when testing.

Many of the plugins could be rewritten to use client-side javascript. This way, caching would allow the javascript to be served and the actual computational work would be done on the client’s web browser.

Technical Problems

When the plugin was originally written, Varnish didn’t support compressing ESI assembled pages which resulted in a very difficult to manage infrastructure.

WordPress uses a lot of cookies which need to be dealt with very carefully in Varnish’s configuration.

What sort of Improvement?

Before the ESI Widget After the ESI Widget
12 seconds time to first byte .087 seconds time to first byte
.62 requests per second 567 requests per second
Huge number of elements Moved some elements to a ‘CDN’ url

WordPress Plugin

In the above picture, we can see the ESI widget has been added to the sidebar, and we’ve added our desired widgets to the new ESI Widget Sidebar.

Varnish VCL – vcl_recv

sub vcl_recv {
    if (req.request == "BAN") {
       ban("req.http.host == " + req.http.host +
              "&& req.url == " + req.url);
       error 200 "Ban added";
    }
    if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
      unset req.http.cookie;
      set req.url = regsub(req.url, "\?.*$", "");
    }
    If (!(req.url ~ "wp-(login|admin)")) {
      unset req.http.cookie;
    }
}

In vcl_recv, we set up rules to allow the plugin to purge content, we do a little manipulation to cache static assets and ignore some of the cache breaking arguments specified after the ? and we aggressively remove cookies.

Varnish VCL – vcl_fetch

sub vcl_fetch {
  if ( (!(req.url ~ "wp-(login|admin)")) || (req.request == "GET") ) {
                unset beresp.http.set-cookie;
  }
  set beresp.ttl = 12h;

  if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
    set beresp.ttl = 365d;
  } else {
    set beresp.do_esi = true;
  }
}

Here, we remove cookies set by the backend. We set our timeout to 12 hours, overriding any expire time. Since the widget purges cached content, we can set this to a longer expiration time – eliminating additional CPU and database work. For static asset, we set a one year expiration time, and, if it isn’t a static asset, we parse it for ESI. The ESI parsing rule needs to be refined considerably as it currently parses objects that wouldn’t contain ESI.

Did Things Break?

Purging broke things and revealed a bug in PHP’s socket handling.

Posting Comments initially broke as a result of cookie handling that was a little too aggressive.

Certain plugins break that rely on being run on each pageload such as WP Greet Box and many of the Post Count and Statistics plugins.

Apache logs are rendered virtually useless since most of the queries are handled by Varnish and never hit the backend. You can log from varnishncsa, but, Google Analytics or some other webbug statistics program is a little easier to use.

End Result

Varnish 3.0, currently in beta, allows compression of ESI assembled pages, and, now can accept compressed content from the backend – allowing the Varnish server to exist at a remote location, possibly opening up avenues for companies to provide Varnish hosting in front of your WordPress site using this plugin.

Varnish ESI powered sites became much easier to deploy with 3.0. Before 2.0, you needed to run Varnish to do the ESI assembly, then, into some other server like Nginx to compress the page before sending it to the surfer, or, you would be stuck handing uncompressed pages to your surfers.

Other Improvements

* Minification/Combining Javascript and CSS
* Proper ordering of included static assets – i.e. include .css files before .js, use Async javascript includes.
* Spriting images – combining smaller images and using CSS to alter the display port resulting in one image being downloaded rather than a dozen tiny social media buttons.
* Inline CSS for images – if your images are small enough, they could be included inline in your CSS – saving an additional fetch for the web browser.
* Multiple sidebars – currently, the ESI widget only handles one sidebar.

How can I get the code?

http://code.google.com/p/wordpress-varnish-esi/

Entries (RSS) and Comments (RSS).
Cluster host: li