Archive for the ‘Web Infrastructure’ Category

The Path to IPv6 from a webhosting perspective

Tuesday, March 29th, 2011

My goal in June 2010 was to be completely IPv4/IPv6 dual stack by the end of 2010. This started a long, arduous process that required reworking portions of our network, upgrading the software on our border routers, increasing the memory on our border routers for the larger BGP table, removing a provider that refused to handle IPv6 in the data center we were located in, adding a separate provider so that we could have redundant IPv6 feeds and a number of other issues. In the last 7 days since we turned up IPv6 and started announcing two /48s, We’ve gotten 25% of our network configured for IPv6 and expect to be able to transition the remaining 75% in the next 15 days.

Of course, with IPv6 comes a new kernel as the existing kernel we’ve used didn’t have IPv6. 2.6.38 comes with Automatic Process Grouping which in early testing has had a positive impact on several machines with different workloads. So, we have an additional reason to deploy kernels on every machine.

Some of the issues we ran into:

* Router
** Initial problem with IPv6 and the OS on the router
** Current minor issue with OSPF3
* Route Performance Control Box
** appears to ignore IPv6 traffic
* Aggregate Network
** OSPF3 support, altered network design to reflatten it (this from unflattening it a few years back)
* Nameservers
** Currently using bind9, no issues, switching to PowerDNS for other reasons
** Glue records at register required manual entry (webform didn’t accept : in an IP address)
* MX Servers
** Postfix, no issues, added inet_protocols=ipv4, ipv6, restarted
** Some anti-spam software that depended on IP addressing acts a little odd
** Antivirus daemon appears to only listen on IPv4 socket, but, since that is an internal milter, it doesn’t cause any real problems now.
** First 7 days, 247k emails processed, 2 from IPv6
* Webservers
* Load Balancers
** very odd issue with the new kernel, udev, and the SSD drives, not network/ipv6 related
* Cluster
** No issues, GFS, DRBD, Apache, Dovecot, etc all recognized IPv6
* General Machine issues
** Firewall software on each machine requires separate rulesets for IPv6. Not a huge problem, but, one to consider.
* Client applications
** char(15) in mysql to store IP addresses
** parsing of Apache CLF doesn’t understand IP addresses

One person that was testing with a Teredo tunnel wasn’t able to access the site via IPv6, but, was able to ping. After reading through a number of pages on the web:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Dnscache\Parameters

and add a DWORD value:
AddrConfigControl = 0

fixed the issue. For my home connection, I used TunnelBroker along with the script mentioned on the page Enable IPv6 on Mac OS X, the tunnelbroker.net way.

After receiving this tweet, I decided that running this site with a separate hostname for IPv6 was probably not a great test and put the AAAA records in DNS. So far, one person has mentioned that they had difficulties reaching the site, but, that was a problem with their ISP and transit. Their ISP appears to be blocking protocol 41 packets. Switching to a tunnel fixed that problem.

All in all, most of the issues are very minor from a networking standpoint, but, web applications are going to have the most trouble.

We’re working hard to make sure everything is dual-stack enabled by IPv6 Jump Day (June 8, 2011) but I suspect it won’t be until 2020-2030 before we can deploy IPv6 only services.

IPv6 Readiness test

Saturday, March 26th, 2011

From the site test-ipv6.com, my laptop at the data center, using our DNS appears to be working correctly.

Still some prep work left, but, converted another dozen or so machines this evening.

WordPress Varnish ESI Widget is back. Thank you Varnish.

Wednesday, January 26th, 2011

Long ago I wrote the WordPress ESI Widget to help a client’s site stay online during a barrage of traffic. To solve some of the performance problems on high traffic WordPress sites, you have to use caching, but almost all of the caching addons for WordPress do page level caching rather than fragment caching. After the site’s traffic slowed, I stopped development on the widget due to the infrastructure required to support compression.

To compress an ESI assembled page, one needed to run Nginx in front of Varnish and lost some performance as a result. Nginx would take the initial request, pass it to Varnish, Varnish would talk to the backend — which could be the same Nginx server in a somewhat complex configuration — grab the parts, assemble it, hand it back to Nginx which would then compress it and hand it to the surfer.

With Varnish compressing ESI assembled pages, we don’t need the incredibly complex configuration to run ESI. We’re left with a very simple front end cache in front of our backend servers.

Why is Fragment Caching important?

Fragment caching allows the cache to store pieces of the page that may repeat on several pages and assemble those pieces with the rest of the page. The sidebar on your WordPress site only needs to be generated once as someone surfs through your site. This changes the nature of WordPress caching considerably. Compared to the fastest existing WordPress caching plugin, the Varnish ESI widget doubled its performance – bested only by WP Varnish, a plugin that ran Varnish directly and managed cache expiration.

ESI explained simply is probably the best example I have ever found for explaining how ESI works.

But something else is faster

WP Varnish is currently faster, and, for all practical purposes probably always will be on a very busy site. However, on a site that gets a lot of traffic on one page, the second page time to first byte should be faster on an ESI assembled page because the sidebar which contains some of the most computationally expensive parts of the page, doesn’t need to be generated again. While we give up some of the raw speed, we gain an advantage when someone clicks through to read the second page. The perfect use case here is getting publicity for a particular post on your WordPress site, and those surfers decide to read other articles you’ve written.

Varnish’s Original Announcement

From: Poul-Henning Kamp
Date: January 25, 2011 6:04:02 AM EST
To: varnish-misc@varnish-cache.org
Subject: Please help break Varnish GZIP/ESI support before 3.0


One of the major features of Varnish 3.0 is now feature complete, and
I need people to start beating it up and help me find the bugs before
we go into the 3.0 release cycle.


GZIP support
------------

Varnish will ask the backend for gzip'ed objects by default and for
the minority of clients that do not grok that, ungzip during delivery.

If the backend can not or will not gzip the objects, varnish can be
told in VCL to gzip during fetch from the backend.  (It can also
gunzip, but I don't know why would you do that ?)

In addition to bandwidth, this should save varnish storage (one gzip
copy, rather than two copies, one gzip'ed one not).

GZIP support is on by default, but can be disabled with a parameter.



ESI support
-----------

Well, we have ESI support already, the difference is that it also
understands GZIP'ing.  This required a total rewrite of the ESI
parser, much improving the readability of it, I might add.

So now you can use ESI with compression, something that has hitherto
been a faustian bargain, often requiring an afterburner of some kind
to do the compression.

There are a lot of weird cornercases in this code, (such as including
a gzip'ed object in an uncomressed object) so this code really needs
beaten up.

Original message

What else is there?

Another very important fact is that Varnish will use gzip to request assets from the backend. While this doesn’t sound incredibly important, it is. Now, you can run a Varnish server at another data center and not worry as much about latency. Before this version, any ESI assembled page needed to be fetched uncompressed, and, large pages add tiny bits of latency which result in a poorer experience while surfing. Most installations run Varnish on the same machine or on a machine network topologically close, but, this opens the doors for a CDN to run ESI enabled edge servers to supercharge your WordPress site hosted anywhere.

When will it be here?

Varnish moves quickly, and while the changes are substantial in terms of code rewrites, their code is very well written. I don’t expect we’ll see many bugs in the code and it’ll be released in the next few months. This site and a number of other sites we work with will be running it later this week.

In short, caching for WordPress just got an incredible boost. Even before the compression and gzip request from the backend, the ESI Widget was twice as fast as the fastest non-Varnish enabled plugin and over 440 times faster than WordPress out of the box.

Original Info

* WordPress Cache Plugin Benchmarks
* WordPress, Varnish and Edge Side Includes
* ESI Widget Issues in the Varnish, ESI, WordPress experiment
* A WordPress Widget that Enables one to use Varnish and ESI

Adaptec 31205 under Debian

Saturday, September 25th, 2010

We have a Storage Server with 11 2tb drives in a Raid5. During a recent visit, we heard the alarm, but, no red light on any drive was visible nor was the light on the front of the chassis lit. Knowing it was a problem waiting to happen, but, without being able to see which drive had caused the array to fail, we scheduled a maintenance window that happened to coincide with a kernel upgrade.

In the meantime, we attempted to install the RPM and java management system to no avail. So, we weren’t able to read the controller status to find out what the problem was.

When we rebooted the machine, the array status was degraded and it prompted us to hit enter to accept the configuration or control-A to enter the admin. We entered the admin, Manage array, all drives are present and working. Immediately the array status changes to rebuilding with no indication which drive had failed and was being readded.

Exiting the admin, saving the config, the client said, pull the machine offline until it is fixed. This started what seemed like an endless process. We figured we would let it rebuild while it was online, but, disable it from the cluster. We installed a new kernel, 2.6.36-rc5, rebooted and this is where the trouble started. On boot, the new kernel got an I/O error, the channel hung, it forced a reset and then sat there for about 45 seconds. After it continued, it paniced as it was unable to read /dev/sda1.

Rebooting and entering the admin, we’re faced with an array that is marked offline. After identifying each of the drives through disk utils to make sure that they are recognized, we forced the array back online and rebooted into the old kernel. As it turns out, something in our 2.6.36-rc5 disables the array and sets it offline. It takes 18 hours to rebuild the array and return it to optimal status.

After the machine comes up, we knew we had a problem on one of the directories on the system and this seemed like an opportune time to run xfs_repair. About 40 minutes into it, we run into an I/O error with a huge block number and bam, the array is offline again.

In Disk Util in the ROM we start the test on the first drive. It takes 5.5 hours to run through the first disk which puts us at an estimated 60+ hours to check all 11 drives in the array. smartctl doesn’t allow us to independently check the drives, so, we fire up a second machine and mount each of the drives looking for any possible telltale signs in the S.M.A.R.T. data stored on the drives. Two drives show some abnormal numbers and we have an estimated 11 hours to check those disks. 5.5 hours later, the first disk is clean, less than 30 minutes later, we have our culprit. Relocating a number of bad sectors results in the controller hanging again, yet, no red fault light anywhere to be seen, no indication in the Adaptec manager that this drive is bad.

Replacing the drive and going back into the admin shows us a greyed out drive which immediately starts reconstructing. We reboot the system into the older kernel and start xfs_repair again. After two hours, it has run into a number of errors, but no I/O Errors.

It is obvious we’ve had some corruption for quite some time. We had a directory we couldn’t delete because it claimed it had files, however, no files were in the directory. We had 2 directories with files that we couldn’t do anything with and couldn’t even mv them to an area outside our working directories. We figured it was an xfs bug that we had hit due to the 18 terabyte size of the partition, but guessed that an xfs_repair would fix this. It was a minor annoyance to the client until we could get to a maintenance interval so we waited. In reality, this should have been a sign that we had some issues and we should have pushed the client harder to allow us to diagnose this much earlier. There is some data corruption, but, this is the second in a pair of backup servers for their cluster. Resyncing the data to a known good source will fix this without too much difficulty.

After four hours, xfs_repair is reporting issues like:


bad directory block magic # 0 in block 0 for directory inode 21491241467
corrupt block 0 in directory inode 21491241467
        will junk block
no . entry for directory 21491241467
no .. entry for directory 21491241467
problem with directory contents in inode 21491241467
cleared inode 21491241467
        - agno = 6
        - agno = 7
        - agno = 8
bad directory block magic # 0 in block 1947 for directory inode 34377945042
corrupt block 1947 in directory inode 34377945042
        will junk block
bad directory block magic # 0 in block 1129 for directory inode 34973370147
corrupt block 1129 in directory inode 34973370147
        will junk block
bad directory block magic # 0 in block 3175 for directory inode 34973370147
corrupt block 3175 in directory inode 34973370147
        will junk block

It appears that we have quite a bit of data corruption due to a bad drive which is precisely why we use Raid.

The array failed, why didn’t the Adaptec on-board manager know which drive had failed? Had we gotten the Java application to run, I’m still not convinced it would have told us which drive was throwing the array into degraded status. Obviously the card knew something was wrong as the alarm was on. Each drive has a fault light and an activity light, but, all of the drives allowed the array to be rebuilt and claimed the status was Optimal. During initialization, the Adaptec does light the fault and activity lights for each drive so it seems reasonable that when the drive encountered errors, it could have lit the fault light so we knew which drive to replace. When running xfs_repair and receiving the I/O error where it couldn’t relocate the block, why didn’t the Adaptec controller immediately fail the drive?

All in all, I’m not too happy with Adaptec right now. A 2tb hard drive failed which cost us roughly 60 hours to diagnose and put back into service. The failing drive should have been tagged and removed from the raid set immediately and marked. As it is right now, even though it was running in degraded mode, we shouldn’t have seen any corruption, however, xfs_repair is finding a considerable number of errors.

The drives report roughly 5600 hours online which corresponds to the eight months we’ve had the machine online and based on the number of files xfs_repair is finding are bad, I believe that drive had been failing for quite some time and Adaptec has failed us. While we have a considerable number of Adaptec controllers, we’ve never seen a failure like this.

A weekend with Tornado

Tuesday, June 29th, 2010

After working on a Pylons project for a week or so, there was a minor part of it that I felt didn’t need the complexity of a framework. Some quick benchmarking of the most minimal Pylons/SQLAlchemy project I could muster came in around 200 requests per second which put me at roughly 12 million requests per day based on the typical curve.

Within 15 minutes of installing Tornado and using their simple hello world example, I imported SQLAlchemy and ended up boosting this to 280 requests per second. As I really didn’t need any of the features from the ORM, I decided to use tornado.database which isn’t much more than a bare wrapper to python-mysql. Even with a single worker process, I was able to get 870 requests per second. 56 million requests per day, without any tuning?

I’m reasonably impressed. Once I put it on production hardware, I’m thinking I’ll easily be able to count on double those numbers if not more.

Next weekend, Traffic Server.

Entries (RSS) and Comments (RSS).
Cluster host: li