Finding my XFS Bug

October 6th, 2011

Recently one of our servers had some filesystem corruption – corruption that has occurred more than once over time. As we use hardlinks a lot with link-dest and rsync, I’m reasonably sure the issue occurs due to the massive number of hardlinks and deletions that take place on that system.

I’ve written a small script to repeatedly test things and started it running a few minutes ago. My guess is that the problem should show up in a few days.

#!/bin/bash

RSYNC=/usr/bin/rsync
REVISIONS=10

function rsync_kernel () {
  DATE=`date +%Y%m%d%H%M%S`

  BDATES=""
  loop=0
  for f in `ls -d1 /tmp/2011*`
  do
    BDATES[$loop]=$f
    loop=$(($loop+1))
  done

  CT=${#BDATES[*]}

  if (( $CT > 0 ))
  then
    RECENT=${BDATES[$(($CT-1))]}
    LINKDEST=" --link-dest=$RECENT"
  else
    RECENT="/tmp/linux-3.0.3"
    LINKDEST=" --link-dest=/tmp/linux-3.0.3"
  fi

  $RSYNC -aplxo $LINKDEST $RECENT/ $DATE/

  if (( ${#BDATES[*]} >= $REVISIONS ))
  then
    DELFIRST=$(( ${#BDATES[*]} - $REVISIONS ))
    loop=0
    for d in ${BDATES[*]}
      do
        if (( $loop < = $DELFIRST ))
        then
          `rm -rf $d`
        fi
        loop=$(($loop+1))
      done
  fi
}

while [ 1==1 ]
do
  rsync_kernel
  echo .
  sleep 1
done

AVL Tree, doubly linked lists and iterations

September 28th, 2011

Over the past few months I’ve been mulling over a Network Performance and Intrusion Detection box that could sit on a network and make route suggestions to the border.

There are some very basic CompSci fundamental processes involved in the data analysis to find the source and preferred path for each incoming packet. When you’re dealing with flows and traffic capture of multiple gigabits per second, you need something that is very quick.

The problem we have is the representation of an IP address in memory. Obviously we can use a tuple to store the octets or words, but, the Internet isn’t fully allocated and there are a lot of empty netblocks.

Since the data is sparse, and we need to know whether a packet is in a netblock that we’ve seen, the ASN, and the preferred path we have a small problem.

Our first issue comes from the fact that we might see an announcement of:

66.244.0.0/18

which contains all of the IP addresses from 66.244.0.0 to 66.244.64.255. So, large block announcements (and overlapping announcements) become a potential problem. Currently, there is an announcement for 4.0.0.0/8 and 4.0.0.0/9. Any IP address in 4.0.0.0/9 (over 8.3 million) is handled by one announcement but also contained within 4.0.0.0/8.

With this in mind, we’ve eliminated the ability to do a binary search as our list would be quite imbalanced. We could store the individual tuples in a tree which would be three levels deep, but, there are portions of the internet filled with /24s that would be quite dense. Since those portions of netspace are not only densely packed, but quite active, we know we’re going to be spending a lot of time traversing those nodes.

Back to the issue at hand, we have an IP address, 66.244.19.68/32 that we need to find the source announcement for. Assuming for the moment we ignore overlapping address announcements, we need to find that the IP address is announced in 66.244.0.0/18, originates from ASN306 and which of our current peers is the ‘best path’ according to BGP4.

BGP4 deals with routes based on shortest path. Each hop through a provider adds an ASN to the path so that we know how to get to a destination. While the shortest ASPATH seems logical that it would be the fastest path, that isn’t always the case. There are times when a packet might traverse a provider’s backbone that has some congestion where avoiding that provider could result in faster connection/transfer speeds for that destination. This happens much more frequently than you would think where providers dump traffic on free exchanges rather than paying for the transit and running those connections at 95% capacity during peak hours. Performance analysis is a separate issue, we just need to find the most specific announcement for a particular destination.

There are several data structures that come to mind:

* Sorted List – iterate through using a number of different methods sequentially, binary search, etc.
* Red/Black tree – Faster than binary, but, route announcements can be withdrawn and we might have some deletions leading to a more imbalanced tree.
* AVL tree – Faster than red/black, more evenly balanced, great for lookups, slight performance penalty on deletions, but, we don’t delete too many items too often.

So, the AVL tree probably wins in this case, but, we’re not comparing for equality. We need to find the IP address within in announcement, and while memory is cheap storing every single IP address and an associative pointer to the source announcement would be the fastest if we used a simple hash lookup, we need to modify our AVL tree lookup slightly. With IPv4, if we used only the first three octets, we would only have 16.4 million entries if we represented each netblock with an associative pointer to the source. With IPv6, our same hash would need 1.84×10^19th entries. In either case, populating that full hash would take a bit of time.

We need to find the node of the tree that has a value greater than our current IP address, then, take the previous node and verify that the IP address is within that announcement. It should be if our routing table is complete, but, if not, it could be a ‘martian’ or an announcement to a location that would be returned to our default route.

Since we’ve chosen Erlang, it does have a Generally Balanced Tree built in which has some decent performance improvements over a generic AVL tree, but we need to modify the traverser slightly for our use case. Luckily we can do this.

I raise this issue because I read a blog post about a student claiming he’d never put anything he learned in college to use. I think he’s just not solving the right problems. I learned about trees, sorts, linked lists in college and still put things like this to use today.

I remember back in the mid ’80s I was attending UMBC to get a degree in Computer Science. Since I had computer programming experience, I was able to take two experimental courses to replace the general CompSci track – introducing me to Ada by one of the few experts at the time. However, a friend that was taking CS270 – Pascal mentioned his teacher was giving a discussion on doubly linked lists and a new method of traversal he had come up with. As I lived on campus and my advisors had completely screwed my class schedules, I had a lot of free time between classes and opted to attend the class. Of course, the teacher decided to give a pop quiz that day prior to the lecture which I passed over to my neighbor. The TA asked why I wasn’t taking the quiz and I said, I’m not in the class and was here to hear the lecture. This raised a small commotion which brought the professor over to ask why I refused to take the quiz. I explained that I had heard through a friend he was giving a lecture on doubly linked lists with a different method of traversal and I was interested in hearing the discussion – I wasn’t a member of the class and if he preferred, I would leave, but, I wanted to hear the lecture.

I think something happened – this teacher had a reputation for being very intelligent but very gruff. He asked me to come with him to the front of the class and said,

Ladies and Gentlemen, this is Chris Davies. He doesn’t attend this class, but, wanted to hear my lecture. Here’s a student that actually wanted to be here to learn while you’re all disappointed with having to show up to hear me talk.

He gave his lecture and we did talk a few more times after that about data retrieval and data structures.

Yes, twenty-six years later, I am still using the education I paid for.

Pyramid Apex – putting it in production

August 15th, 2011

After quite a bit of work we’ve finally gotten Pyramid Apex to a point where I can deploy it on two production apps to make sure things are working as I expect they should.

If you’re developing a Pyramid Application and are using Authentication/Authorization, I18N/L10N, Flash Messages and a Form Library, take a look at Pyramid Apex, a library Matthew Housden and I wrote to make it easier to quickly develop Pyramid applications.

It supports OpenID, Local authentication storage using bcrypt and a number of other basic features.

W3 Total Cache and Varnish

July 21st, 2011

Last week I got called into a firestorm to fix a set of machines that were having problems. As Varnish was in the mix, the first thing I noticed was the hit rate was extremely low as Varnish’s VCL wasn’t really configured well for WordPress. Since WordPress uses a lot of cookies and Varnish passes anything with a cookie to the backend, we have to know which cookies we can ignore so that we can get the cache hit rate up.

Obviously, static assets like javascript, css and images generally don’t need cookies, so, those make a good first target. Since some ad networks set their own cookies on the domain, we need to know which ones to set. However, to make a site resilient, we have to get a little more aggressive and tell Varnish to cache things against its judgement. When we do this, we don’t want to have surfers see stale content, so, we need to purge cached objects from Varnish when they are changed to keep the site interactive.

Caching is easy, purging is hard

This particular installation used W3 Total Cache, a plugin that does page caching, javascript/css minification and combining and handles a number of other features. I was unable to find any suggested VCL, but, several posts on the forums show a disinterest in supporting Varnish.

In most cases, once we determine what we’re caching, we need to figure out what to purge. When a surfer posts a comment, we need to clear the cached representation of that post, the Feed RSS and the front page of the site. This allows any post counters to be updated and keeps the RSS feed accurate.

W3TC includes the ability to purge, but, only works in a single server setting. If you put a domain name in the config box, it should work fine. If you put a series of IP addresses, your VCL either needs to override the hostname or, you need to apply the following patch. There are likely to be bugs, so, try this at your own risk.

If you aren’t using the Javascript/CSS Minification and combining or some of the CDN features that W3TC provides, then I would suggest WordPress-Varnish which is maintained by some people very close to the Varnish team.

I’ve maintained the original line of code from W3TC commented above any changes for reference.

--- w3-total-cache/inc/define.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/inc/define.php	2011-07-21 16:10:39.270111723 -0400
@@ -1406,11 +1406,15 @@
  * @param boolean $check_status
  * @return string
  */
-function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true) {
+#cd34, 20110721, added $server IP for PURGE support
+# function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true) {
+function w3_http_request($method, $url, $data = '', $auth = '', $check_status = true, $server = '') {
     $status = 0;
     $method = strtoupper($method);
 
-    if (function_exists('curl_init')) {
+#cd34, 20110721, don't use CURL for purge
+#    if (function_exists('curl_init')) {
+    if ( (function_exists('curl_init')) && ($method != 'PURGE') ) {
         $ch = curl_init();
 
         curl_setopt($ch, CURLOPT_URL, $url);
@@ -1474,7 +1478,13 @@
             $errno = null;
             $errstr = null;
 
-            $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+#cd34, 20110721, if method=PURGE, connect to $server, not $host
+#            $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+            if ( ($method == 'PURGE') && ($server != '') ) {
+                $fp = @fsockopen($server, $port, $errno, $errstr, 10);
+            } else {
+                $fp = @fsockopen($host, $port, $errno, $errstr, 10);
+            }
 
             if (!$fp) {
                 return false;
@@ -1543,8 +1553,9 @@
  * @param bool $check_status
  * @return string
  */
-function w3_http_purge($url, $auth = '', $check_status = true) {
-    return w3_http_request('PURGE', $url, null, $auth, $check_status);
+#cd34, 20110721, added server IP
+function w3_http_purge($url, $auth = '', $check_status = true, $server = '') {
+    return w3_http_request('PURGE', $url, null, $auth, $check_status, $server);
 }
 
 /**
diff -Naur w3-total-cache/lib/W3/PgCache.php w3-total-cache-varnish/lib/W3/PgCache.php
--- w3-total-cache/lib/W3/PgCache.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/lib/W3/PgCache.php	2011-07-21 16:04:07.247499682 -0400
@@ -693,7 +693,9 @@
                     $varnish =& W3_Varnish::instance();
 
                     foreach ($uris as $uri) {
-                        $varnish->purge($uri);
+#cd34, 20110721 Added $domain_url to build purge hostname
+#                        $varnish->purge($uri);
+                        $varnish->purge($domain_url, $uri);
                     }
                 }
             }
diff -Naur w3-total-cache/lib/W3/Varnish.php w3-total-cache-varnish/lib/W3/Varnish.php
--- w3-total-cache/lib/W3/Varnish.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/lib/W3/Varnish.php	2011-07-21 16:04:52.836919164 -0400
@@ -70,7 +70,7 @@
      * @param string $uri
      * @return boolean
      */
-    function purge($uri) {
+    function purge($domain, $uri) {
         @set_time_limit($this->_timeout);
 
         if (strpos($uri, '/') !== 0) {
@@ -78,9 +78,11 @@
         }
 
         foreach ((array) $this->_servers as $server) {
-            $url = sprintf('http://%s%s', $server, $uri);
+#cd34, 20110721, Replaced $server with $domain
+#            $url = sprintf('http://%s%s', $server, $uri);
+            $url = sprintf('%s%s', $domain, $uri);
 
-            $response = w3_http_purge($url, '', true);
+            $response = w3_http_purge($url, '', true, $server);
 
             if ($this->_debug) {
                 $this->_log($url, ($response !== false ? 'OK' : 'Bad response code.'));
diff -Naur w3-total-cache/w3-total-cache.php w3-total-cache-varnish/w3-total-cache.php
--- w3-total-cache/w3-total-cache.php	2011-06-21 23:22:54.000000000 -0400
+++ w3-total-cache-varnish/w3-total-cache.php	2011-07-21 15:56:53.275922099 -0400
@@ -2,7 +2,7 @@
 /*
 Plugin Name: W3 Total Cache
 Description: The highest rated and most complete WordPress performance plugin. Dramatically improve the speed and user experience of your site. Add browser, page, object and database caching as well as minify and content delivery network (CDN) to WordPress.
-Version: 0.9.2.3
+Version: 0.9.2.3.v
 Plugin URI: http://www.w3-edge.com/wordpress-plugins/w3-total-cache/
 Author: Frederick Townes
 Author URI: http://www.linkedin.com/in/w3edge
@@ -47,4 +47,4 @@
     require_once W3TC_LIB_W3_DIR . '/Plugin/TotalCache.php';
     $w3_plugin_totalcache = & W3_Plugin_TotalCache::instance();
     $w3_plugin_totalcache->run();
-}
\ No newline at end of file
+} 

Gracefully Degrading Site with Varnish and High Load

July 16th, 2011

If you run Varnish, you might want to gracefully degrade your site when traffic comes unexpectedly. There are other solutions listed on the net which maintain a Three State Throttle, but, it seemed like this could be done easily within Varnish without needing too many external dependencies.

The first challenge was to figure out how we wanted to handle state. Our backend director is set up with a ‘level1’ backend which doesn’t do any health checks. We need at least one node to never fail the health check since the ‘level2’ and ‘level3’ backends will go offline to signify to Varnish that we need to take action. While this scenario considers the failure mode cascades, i.e. level2 fails, then if things continue to increase load, level3 fails, there is nothing preventing you from having separate failure modes and different VCL for those conditions.

You could have VCL that replaced the front page of your site with ‘top news’ during an event which links to your secondary page. You can rewrite your VCL to handle almost any condition and you don’t need to worry about doing a VCL load to update the configuration.

While maintaining three configurations is easier, there are a few extra points of failure added in that system. If the load on the machine gets too high and the cron job or daemon that is supposed to update the VCL doesn’t run quickly enough or has issues with network congestion talking with Varnish, your site could run in a degraded mode much longer than needed. With this solution, in the event that there is too much network congestion or too much load for the backend to respond, Varnish automatically considers that a level3 failure and enacts those rules – without the backend needing to acknowledge the problem.

The basics

First, we set up the script that Varnish will probe. The script doesn’t need to be php and only needs to respond with an error 404 to signify to Varnish that probe request has failed.

<?php
$level = $_SERVER['QUERY_STRING'];
$load = file_get_contents('/proc/loadavg') * 1;
if ( ($level == 2) and ($load > 10) ) {
  header("HTTP/1.0 404 Get the bilge pumps working!");
}
if ( ($level == 3) and ($load > 20) ) {
  header("HTTP/1.0 404 All hands abandon ship");
}
?>

Second, we need to have our backend pool configured to call our probe script:

backend level1 {
  .host = "66.55.44.216";
  .port = "80";
}
backend level2 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?2";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}
backend level3 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?3";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}

director crisis random {
  {
# base that should always respond so we don't get an Error 503
    .backend = level1;
    .weight = 1;
  }
  {
    .backend = level2;
    .weight = 1;
  }
  {
    .backend = level3;
    .weight = 1;
  }
}

Since both of our probes go to the same backend, it doesn’t matter which director we use or what weight we assign. We just need to have one backend configured that won’t fail the probe along with our level2 and level3 probes. In this example, when the load on the server is greater than 10, it triggers a level2 failure. If the load is greater than 20, it triggers a level3 failure.

In this case, when the backend probe request fails, we just rewrite the URL. Any VCL can be added, but, you will have some duplication. Since the VCL is compiled into the Varnish server, it should have negligible performance impact.

sub vcl_recv {
  set req.backend = level2;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level2.php";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level3.php";
  }
  set req.backend = crisis;
}

In this case, when we have a level2 failure, we change any URL requested to serve the file /level2.php. In vcl_fetch, we make a few changes to the object ttl so that we prevent the backend from getting hit too hard. We also change the server name so that we can look at the headers to see what level our server is currently running. In Firefox, there is an extension called Header Spy which will allow you to keep track of a header. Often times I’ll track X-Cache which I set to HIT or MISS to make sure Varnish is caching, but, you could also track Server and be aware of whether things are running properly.

sub vcl_fetch {
  set beresp.ttl = 0s;

  set req.backend = level2;
  if (!req.backend.healthy) {
    set beresp.ttl = 5m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 2 - Warning)";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    set beresp.ttl = 30m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 3 - Critical)";
  }

At this point, we’ve got a system that degrades gracefully, even if the backend cannot respond or update Varnish’s VCL and it self-heals based on the load checks. Ideally you’ll also want to put Grace timers and possibly run Saint mode to handle significant failures, but, this should help your system protect itself from meltdown.

Complete VCL

backend level1 {
  .host = "66.55.44.216";
  .port = "80";
}
backend level2 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?2";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}
backend level3 {
  .host = "66.55.44.216";
  .port = "80";
  .probe = {
    .url = "/load.php?3";
    .timeout = 0.3 s;
    .window = 3;
    .threshold = 3;
    .initial = 3;
  }
}

director crisis random {
  {
# base that should always respond so we don't get an Error 503
    .backend = level1;
    .weight = 1;
  }
  {
    .backend = level2;
    .weight = 1;
  }
  {
    .backend = level3;
    .weight = 1;
  }
}

sub vcl_recv {
  set req.backend = level2;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level2.php";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    unset req.http.cookie;
    set req.url = "/level3.php";
  }
  set req.backend = crisis;
}

sub vcl_fetch {
  set beresp.ttl = 0s;

  set req.backend = level2;
  if (!req.backend.healthy) {
    set beresp.ttl = 5m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 2 - Warning)";
  }
  set req.backend = level3;
  if (!req.backend.healthy) {
    set beresp.ttl = 30m;
    unset beresp.http.set-cookie;
    set beresp.http.Server = "(Level 3 - Critical)";
  }

  if (req.url ~ "\.(gif|jpe?g|png|swf|css|js|flv|mp3|mp4|pdf|ico)(\?.*|)$") {
    set beresp.ttl = 365d;
  }
}

Entries (RSS) and Comments (RSS).
Cluster host: li