Scalability

Archive for the ‘Scalability’ Category

GFS2 Kernel Oops

Sunday, October 30th, 2011

For a few years I’ve run a system using DRBD replication between two machines with GFS2 running in dual primary mode to test a theory on a particular type of web hosting I’ve been developing.

For months the system will run fine, then, out of the blue, one of the nodes will drop from the cluster, reboot and we’ve never seen anything in the logs. It’ll run another 120-180 days without incident and then will reboot again with no real indication of the problem. We knew it was a kernel panic or kernel oops, but, the logs never were flushed to disk when the machine was rebooted.

Imagine our luck when two days in a row, at roughly the same time of day, the node rebooted. Even though we have remote syslog set up, we’ve never caught it.

/etc/sysctl.conf was changed so that panic_on_oops was set to 0, a number of terminal sessions were opened from another machine tailing various logs, and we were hoping to have the problem occur again.

/etc/sysctl.conf:

kernel.panic=5
kernel.panic_on_oops=0

At 6:25am, coincidentally during log rotation, the GFS2 partition umounted, but, the machine didn’t reboot. Checking our terminal, we still had access to dmesg, and, we had some logs:

GFS2: fsid=gamma:gfs1.0: fatal: invalid metadata block
GFS2: fsid=gamma:gfs1.0:   bh = 1211322 (magic number)
GFS2: fsid=gamma:gfs1.0:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 401
GFS2: fsid=gamma:gfs1.0: about to withdraw this file system
GFS2: fsid=gamma:gfs1.0: telling LM to unmount
GFS2: fsid=gamma:gfs1.0: withdrawn
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_lm_withdraw+0xd9/0x10a
 [] ? gfs2_meta_check_ii+0x3c/0x48
 [] ? gfs2_meta_indirect_buffer+0xf0/0x14a
 [] ? gfs2_block_map+0x1a3/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
WARNING: at fs/buffer.c:1188 gfs2_block_map+0x2be/0x9fe()
Hardware name: PDSMi
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x78/0x8c
 [] ? warn_slowpath_fmt+0x45/0x4a
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
---[ end trace 54fad1a4877f173c ]---
BUG: unable to handle kernel paging request at ffffffff813b8f0f
IP: [] __brelse+0x7/0x26
PGD 1625067 PUD 1629063 PMD 12001e1 
Oops: 0003 [#1] SMP 
CPU 0 
Modules linked in:

Pid: 18047, comm: gzip Tainted: G        W   3.0.0 #1 Supermicro PDSMi/PDSMi+
RIP: 0010:[]  [] __brelse+0x7/0x26
RSP: 0018:ffff880185d85800  EFLAGS: 00010286
RAX: 00000000e8df8948 RBX: ffff8801f3fb6c18 RCX: ffff880185d857d0
RDX: 0000000000000010 RSI: 000000000002ccee RDI: ffffffff813b8eaf
RBP: 0000000000000000 R08: ffff880185d85890 R09: ffff8801f3fb6c18
R10: 00000000000029e0 R11: 0000000000000078 R12: ffff880147986000
R13: ffff880147986140 R14: 00000000000029c1 R15: 0000000000001000
FS:  00007fb7d320f700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff813b8f0f CR3: 0000000212a73000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 18047, threadinfo ffff880185d84000, task ffff880147dc4ed0)
Stack:
 ffffffff811537f6 0000000000000000 00000000000022a1 0000000000000001
 000000000000ffff 0000000000000000 ffffffff8102a7ce 0000000000000001
 0000000100000008 00000000fffffffb 0000000000000001 ffff8801f3fb6c18
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x7d/0x8c
 [] ? printk+0x43/0x48
 [] ? alloc_page_buffers+0x62/0xba
 [] ? block_read_full_page+0x141/0x260
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? do_mpage_readpage+0x49b/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
Code: 31 00 45 31 f6 fe 85 88 00 00 00 48 89 df e8 a2 1a fc ff eb 03 45 31 f6 5b 4c 89 f0 5d 41 5c 41 5d 41 5e c3 8b 47 60 85 c0 74 05  ff 4f 60 c3 48 c7 c2 96 7b 4d 81 be a4 04 00 00 31 c0 48 c7 
RIP  [] __brelse+0x7/0x26
 RSP 
CR2: ffffffff813b8f0f
---[ end trace 54fad1a4877f173d ]---

As I suspected, log rotation appeared to trigger the problem and handed us the above traceback. Running fsck.gfs2 resulted in:


# fsck.gfs2 -y /dev/drbd1
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Error: resource group 7339665 (0x6ffe91): free space (64473) does not match bitmap (64658)
The rgrp was fixed.
Error: resource group 7405179 (0x70fe7b): free space (64249) does not match bitmap (64299)
(50 blocks were reclaimed)
The rgrp was fixed.
Error: resource group 7470693 (0x71fe65): free space (65456) does not match bitmap (65464)
(8 blocks were reclaimed)
The rgrp was fixed.

...snip...

Ondisk and fsck bitmaps differ at block 133061348 (0x7ee5ae4) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 133061349 (0x7ee5ae5) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #133061031 (0x7ee59a7) free count inconsistent: is 65232 should be 65508
Inode count inconsistent: is 37 should be 0
Resource group counts updated
Inode count inconsistent: is 1267 should be 1266
Resource group counts updated
Pass5 complete      
The statfs file is wrong:

Current statfs values:
blocks:  188730628 (0xb3fcd04)
free:    176443034 (0xa844e9a)
dinodes: 644117 (0x9d415)

Calculated statfs values:
blocks:  188730628 (0xb3fcd04)
free:    177426468 (0xa935024)
dinodes: 493059 (0x78603)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete

Filesystem was remounted after a 7 minute fsck and we’ll see if it happens again tomorrow.

Tags: drbd, gfs2
Posted in Scalability | 2 Comments »

WordPress, Varnish and ESI Plugin

Sunday, June 5th, 2011

This post is a version of the slideshow presentation I did at Hack and Tell in Fort Lauderdale, Florida at The Whitetable Foundation on Saturday, June 4, 2011.

Briefly, I created a Plugin that enabled Fragment Caching with WordPress and Varnish. The problem we ran into with normal page caching methods was related to the fact that this particular client had people visiting many pages per visit, requiring the sidebar to be regenerated on uncached (cold) pages. By caching the sidebar and the page and assembling the page using Edge Side Includes, we can cache the sidebar which contains the most database intensive queries separately from the page. Thus, a visitor moving from one page to a cold page, only needs to wait for the page to generate and pull the sidebar from the cache.

What problem are we solving?

We had a high traffic site where surfers visited multiple pages, and, a very interactive site. Surfers left a lot of comments which meant we were constantly purging the page cache. This resulted in the sidebar having to be regenerated numerous times – even when it wasn’t truly necesssary.

What are our goals?

First, we want that Time to First Byte to be as quick as possible – surfers hate to wait and if you have a site that takes 12 seconds before they see any visible indication that there is something happening, most will leave.

We needed to keep the site interactive, which meant purging pages from cache when posts were made.

We had to have fast pageloads – accomplished by caching the static version of the page and doing as few calculations as possible to deliver the content.

We needed fast static content loading. Apache does very well, but, isn’t the fastest webserver out there.

How does the WordPress front page work?

The image above is a simple representation of a page that has a header, an article section where three articles are shown and a sidebar. Each of those elements is built from a number of SQL queries, assembled and displayed to the surfer. Each plugin that is used, especially filter plugins that look at content and modify it before output add a little latency – resulting in a slower page display.

How does an Article page work?

An article page works very similar to the frontpage except our content block now only contains the contents from one post. Sometimes additional plugins are called to display the post content dealing with comments, social media sharing icons, greetings based on where you’re visiting from (Google, Digg, Reddit, Facebook, etc) and many more. We also see the same sidebar on our site which contains the site navigation, advertisements and other content.

What Options do we Have?

There are a number of existing caching plugins that I have benchmarked in the past. Notably we have:

* WP-Varnish
* W3 Total Cache
* WP Super Cache
* WordPress-Varnish-ESI
* and many others

Page Caching

With Page Caching, you take the entire generated page and cache it either in ram or on disk. Since the page doesn’t need to be generated from the database, the static version of the page is served much more quickly.

Fragment Caching

With Fragment Caching, we’re able to cache the page and a smaller piece that is often repeated, but, perhaps doesn’t change as often as the page. When a websurfer comments on a post, the sidebar doesn’t need to be regenerated, but, the page does.

WordPress and Varnish

Varnish doesn’t deal well with cookies, and WordPress uses a lot of cookies to maintain information about the current web surfer. Some plugins also add their own cookies to track things so that their plugin works.

Varnish can do domain name normalization which may be desired or not. Many sites redirect the bare domain to the www.domain.com. If you do this, you can modify your Varnish Cache Language (VCL) to make sure it always hands back the proper host header.

There are other issues with Varnish that affect how well it caches. There are a number of situations where Varnish doesn’t work as you would expect, but, this can all be addressed with VCL.

Purging – caching is easy, purging is hard once you graduate beyond a single server setup.

WordPress and Varnish with ESI

In this case, our plugin caches the page and the sidebar separately, and allows Varnish to assemble the page prior to sending it to the server. This is going to be a little slower than page caching, but, in the long run, if you have a lot of page to page traffic, having that sidebar cached will make a significant impact.

Possible Solutions

You could hardcode templates and write modules to cache CPU or Database heavy widgets and in some cases, that is a good solution.

You could create a widget that handles the work to cache existing widgets. There is a plugin called Widget Cache, but, I didn’t find it to have much benefit when testing.

Many of the plugins could be rewritten to use client-side javascript. This way, caching would allow the javascript to be served and the actual computational work would be done on the client’s web browser.

Technical Problems

When the plugin was originally written, Varnish didn’t support compressing ESI assembled pages which resulted in a very difficult to manage infrastructure.

WordPress uses a lot of cookies which need to be dealt with very carefully in Varnish’s configuration.

What sort of Improvement?

Before the ESI Widget	After the ESI Widget
12 seconds time to first byte	.087 seconds time to first byte
.62 requests per second	567 requests per second
Huge number of elements	Moved some elements to a ‘CDN’ url

WordPress Plugin

In the above picture, we can see the ESI widget has been added to the sidebar, and we’ve added our desired widgets to the new ESI Widget Sidebar.

Varnish VCL – vcl_recv

sub vcl_recv {
    if (req.request == "BAN") {
       ban("req.http.host == " + req.http.host +
              "&& req.url == " + req.url);
       error 200 "Ban added";
    }
    if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
      unset req.http.cookie;
      set req.url = regsub(req.url, "\?.*$", "");
    }
    If (!(req.url ~ "wp-(login|admin)")) {
      unset req.http.cookie;
    }
}

In vcl_recv, we set up rules to allow the plugin to purge content, we do a little manipulation to cache static assets and ignore some of the cache breaking arguments specified after the ? and we aggressively remove cookies.

Varnish VCL – vcl_fetch

sub vcl_fetch {
  if ( (!(req.url ~ "wp-(login|admin)")) || (req.request == "GET") ) {
                unset beresp.http.set-cookie;
  }
  set beresp.ttl = 12h;

  if (req.url ~ "\.(gif|jpg|jpeg|swf|css|js|flv|mp3|mp4|pdf|ico|png)(\?.*|)$") {
    set beresp.ttl = 365d;
  } else {
    set beresp.do_esi = true;
  }
}

Here, we remove cookies set by the backend. We set our timeout to 12 hours, overriding any expire time. Since the widget purges cached content, we can set this to a longer expiration time – eliminating additional CPU and database work. For static asset, we set a one year expiration time, and, if it isn’t a static asset, we parse it for ESI. The ESI parsing rule needs to be refined considerably as it currently parses objects that wouldn’t contain ESI.

Did Things Break?

Purging broke things and revealed a bug in PHP’s socket handling.

Posting Comments initially broke as a result of cookie handling that was a little too aggressive.

Certain plugins break that rely on being run on each pageload such as WP Greet Box and many of the Post Count and Statistics plugins.

Apache logs are rendered virtually useless since most of the queries are handled by Varnish and never hit the backend. You can log from varnishncsa, but, Google Analytics or some other webbug statistics program is a little easier to use.

End Result

Varnish 3.0, currently in beta, allows compression of ESI assembled pages, and, now can accept compressed content from the backend – allowing the Varnish server to exist at a remote location, possibly opening up avenues for companies to provide Varnish hosting in front of your WordPress site using this plugin.

Varnish ESI powered sites became much easier to deploy with 3.0. Before 2.0, you needed to run Varnish to do the ESI assembly, then, into some other server like Nginx to compress the page before sending it to the surfer, or, you would be stuck handing uncompressed pages to your surfers.

Other Improvements

* Minification/Combining Javascript and CSS
* Proper ordering of included static assets – i.e. include .css files before .js, use Async javascript includes.
* Spriting images – combining smaller images and using CSS to alter the display port resulting in one image being downloaded rather than a dozen tiny social media buttons.
* Inline CSS for images – if your images are small enough, they could be included inline in your CSS – saving an additional fetch for the web browser.
* Multiple sidebars – currently, the ESI widget only handles one sidebar.

How can I get the code?

http://code.google.com/p/wordpress-varnish-esi/

Tags: esi, Varnish, wordpress
Posted in Scalability | No Comments »

Apache mod_rewrite Performance issue discussion and fix

Monday, May 16th, 2011

This weekend I was with a client that was having some issues unrelated to this issue, but, it raised an interesting point. Apache’s handlers have a load order dependent on the modules installed and there are certain modules that slow down apache enough that you want to avoid them on production servers – mod_status being one of those.

The story behind this one is probably something that you’ve run into. WebApp written for one machine, client base grows and it is time to expand. Moving from one server to two, is infinitely harder than moving from two to three. However, you have a legacy that you need to support – clients that won’t change the hyperlink pointing to some API that you’ve designed, so, you use mod_rewrite to fix the problem.

A simple mod_rewrite can redirect the traffic to our old location to the new location so that you don’t need to worry about clients that aren’t going to change the HTML. Lets also pretend this app was written before RESTful APIs were handy and we need to also pass the query string.

RewriteEngine on
RewriteRule ^specialapi.php$ http://newserver.superapp.com/specialapi.php [R=301,L,QSA]

So, after some testing, we’re satisfied that things work as expected and we’re happy that we could split things effectively.

What happens in a request for that url

Our original API did some processing of the request based on some command line arguments, and redirects the person elsewhere. When we do a normal request for this object, using strace, we get the following output:

accept(4, {sa_family=AF_INET6, sin6_port=htons(49632), inet_pton(AF_INET6, "2001:470:5:590::cd34", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 10
fcntl64(10, F_GETFD)                    = 0
fcntl64(10, F_SETFD, FD_CLOEXEC)        = 0
getsockname(10, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "2604:3500::c:21", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
fcntl64(10, F_GETFL)                    = 0x2 (flags O_RDWR)
fcntl64(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
gettimeofday({1305564609, 151401}, NULL) = 0
gettimeofday({1305564609, 151686}, NULL) = 0
read(10, "GET /test/api.php HTTP/1.1\r\nHost"..., 8000) = 435
gettimeofday({1305564609, 153050}, NULL) = 0
gettimeofday({1305564609, 153303}, NULL) = 0
gettimeofday({1305564609, 153521}, NULL) = 0
gettimeofday({1305564609, 153741}, NULL) = 0
gettimeofday({1305564609, 153933}, NULL) = 0
gettimeofday({1305564609, 154152}, NULL) = 0
gettimeofday({1305564609, 154317}, NULL) = 0
gettimeofday({1305564609, 154533}, NULL) = 0
gettimeofday({1305564609, 154722}, NULL) = 0
gettimeofday({1305564609, 154914}, NULL) = 0
gettimeofday({1305564609, 155103}, NULL) = 0
gettimeofday({1305564609, 155295}, NULL) = 0
gettimeofday({1305564609, 155483}, NULL) = 0
gettimeofday({1305564609, 156089}, NULL) = 0
gettimeofday({1305564609, 156279}, NULL) = 0
gettimeofday({1305564609, 156496}, NULL) = 0
gettimeofday({1305564609, 156685}, NULL) = 0
gettimeofday({1305564609, 156877}, NULL) = 0
gettimeofday({1305564609, 157065}, NULL) = 0
stat64("/var/www/uc/test/api.php", {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
open("/var/www/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/var/www/uc/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 11
fcntl64(11, F_GETFD)                    = 0x1 (flags FD_CLOEXEC)
fcntl64(11, F_SETFD, FD_CLOEXEC)        = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
read(11, "ErrorDocument 404 /index.html\n", 4096) = 30
read(11, "", 4096)                      = 0
close(11)                               = 0
open("/var/www/uc/test/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/var/www/uc/test/api.php/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOTDIR (Not a directory)
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={60, 0}}, NULL) = 0
rt_sigaction(SIGPROF, {0xb70c1a60, [PROF], SA_RESTART}, {0xb70c1a60, [PROF], SA_RESTART}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0
umask(077)                              = 022
umask(022)                              = 077
getcwd("/", 4095)                       = 2
chdir("/var/www/uc/test")               = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={30, 0}}, NULL) = 0
time(NULL)                              = 1305564609
open("/var/www/uc/test/api.php", O_RDONLY|O_LARGEFILE) = 11
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
mmap2(NULL, 22, PROT_READ, MAP_SHARED, 11, 0) = 0xb6de7000
munmap(0xb6de7000, 22)                  = 0
close(11)                               = 0
chdir("/")                              = 0
umask(022)                              = 022
open("/dev/urandom", O_RDONLY)          = 11
read(11, "\247q\340\"", 4)              = 4
close(11)                               = 0
open("/dev/urandom", O_RDONLY)          = 11
read(11, "\216\241*W", 4)               = 4
close(11)                               = 0
open("/dev/urandom", O_RDONLY)          = 11
read(11, "\270\267\22+", 4)             = 4
close(11)                               = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
writev(10, [{"HTTP/1.1 200 OK\r\nDate: Mon, 16 M"..., 237}, {"\37\213\10\0\0\0\0\0\0\3", 10}, {"+I-.\1\0", 6}, {"\f~\177\330\4\0\0\0", 8}], 4) = 261
gettimeofday({1305564609, 174811}, NULL) = 0
gettimeofday({1305564609, 175003}, NULL) = 0
read(10, 0xb93489e0, 8000)              = -1 EAGAIN (Resource temporarily unavailable)
write(7, "2001:470:5:590::cd34 - - [16/May"..., 214) = 214
write(8, "vhost_combined\n", 15)        = 15

Briefly, the request comes in for the asset http://testserver.com/test/api.php as you can see by the:

Apache checks to see if the file exists:
stat64("/var/www/uc/test/api.php", {st_mode=S_IFREG|0644, st_size=22, ...}) = 0

And does something odd:
open("/var/www/uc/test/api.php/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOTDIR (Not a directory)

Even though the file exists, and isn’t a directory, apache is checking to see if there is a .htaccess file in the api.php directory. This is where part of the issue comes to light.
Eventually, apache changes to the directory and serves the content:
chdir("/var/www/uc/test")               = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={30, 0}}, NULL) = 0
time(NULL)                              = 1305564609
open("/var/www/uc/test/api.php", O_RDONLY|O_LARGEFILE) = 11
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=22, ...}) = 0

So, a normal request works, and we’re able to see what Apache is doing. Now, lets put our modified rule in to redirect people to the new location:
accept(4, {sa_family=AF_INET6, sin6_port=htons(50286), inet_pton(AF_INET6, "2001:470:5:590::cd34", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 10
fcntl64(10, F_GETFD)                    = 0
fcntl64(10, F_SETFD, FD_CLOEXEC)        = 0
getsockname(10, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "2604:3500::c:21", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
fcntl64(10, F_GETFL)                    = 0x2 (flags O_RDWR)
fcntl64(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
gettimeofday({1305565527, 718766}, NULL) = 0
gettimeofday({1305565527, 718990}, NULL) = 0
read(10, "GET /test/api.php HTTP/1.1\r\nHost"..., 8000) = 435
gettimeofday({1305565527, 719683}, NULL) = 0
gettimeofday({1305565527, 719909}, NULL) = 0
gettimeofday({1305565527, 720127}, NULL) = 0
gettimeofday({1305565527, 720347}, NULL) = 0
gettimeofday({1305565527, 720539}, NULL) = 0
gettimeofday({1305565527, 720732}, NULL) = 0
gettimeofday({1305565527, 720921}, NULL) = 0
gettimeofday({1305565527, 721936}, NULL) = 0
gettimeofday({1305565527, 722127}, NULL) = 0
gettimeofday({1305565527, 722343}, NULL) = 0
gettimeofday({1305565527, 722533}, NULL) = 0
gettimeofday({1305565527, 722724}, NULL) = 0
gettimeofday({1305565527, 722913}, NULL) = 0
gettimeofday({1305565527, 723106}, NULL) = 0
gettimeofday({1305565527, 723295}, NULL) = 0
gettimeofday({1305565527, 723487}, NULL) = 0
gettimeofday({1305565527, 723676}, NULL) = 0
gettimeofday({1305565527, 723869}, NULL) = 0
gettimeofday({1305565527, 724058}, NULL) = 0
stat64("/var/www/uc/test/api.php", {st_mode=S_IFREG|0644, st_size=22, ...}) = 0
open("/var/www/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/var/www/uc/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 11
fcntl64(11, F_GETFD)                    = 0x1 (flags FD_CLOEXEC)
fcntl64(11, F_SETFD, FD_CLOEXEC)        = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
read(11, "ErrorDocument 404 /index.html\n", 4096) = 30
read(11, "", 4096)                      = 0
close(11)                               = 0
open("/var/www/uc/test/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 11
fcntl64(11, F_GETFD)                    = 0x1 (flags FD_CLOEXEC)
fcntl64(11, F_SETFD, FD_CLOEXEC)        = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=72, ...}) = 0
read(11, "RewriteEngine on\nRewriteRule ^ap"..., 4096) = 72
read(11, "", 4096)                      = 0
close(11)                               = 0
open("/var/www/uc/test/api.php/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOTDIR (Not a directory)
writev(10, [{"HTTP/1.1 301 Moved Permanently\r\n"..., 303}, {"\37\213\10\0\0\0\0\0\0\3", 10}, {"mP\301N\3030\f\275\367+LOpX\334\26\t!\224E\32k\21\2236\250D9p\364\32\263"..., 236}, {"\314\226,\242>\1\0\0", 8}], 4) = 557
gettimeofday({1305565527, 734362}, NULL) = 0
gettimeofday({1305565527, 734577}, NULL) = 0
read(10, 0xb93489e0, 8000)              = -1 EAGAIN (Resource temporarily unavailable)
write(7, "2001:470:5:590::cd34 - - [16/May"..., 215) = 215
write(8, "vhost_combined\n", 15)        = 15

In this case, we see something that shouldn’t really happen. Even though our mod_rewrite has rewritten the url, apache is still checking to see if api.php and api.php/.htaccess exist:
stat64("/var/www/uc/test/api.php", {st_mode=S_IFREG|0644, st_size=22, ...}) = 0

open("/var/www/uc/test/api.php/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOTDIR (Not a directory)

So, even with the mod_rewrite rule passing the file over to another machine, apache is still testing the existence of the file and a directory named api.php containing the file .htaccess. The latter check being the one that we’re going to fix.
accept(4, {sa_family=AF_INET6, sin6_port=htons(50516), inet_pton(AF_INET6, "2001:470:5:590::cd34", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 10
fcntl64(10, F_GETFD)                    = 0
fcntl64(10, F_SETFD, FD_CLOEXEC)        = 0
getsockname(10, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "2604:3500::c:21", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
fcntl64(10, F_GETFL)                    = 0x2 (flags O_RDWR)
fcntl64(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
gettimeofday({1305565791, 419574}, NULL) = 0
gettimeofday({1305565791, 419798}, NULL) = 0
read(10, "GET /test/api.php HTTP/1.1\r\nHost"..., 8000) = 409
gettimeofday({1305565791, 420459}, NULL) = 0
gettimeofday({1305565791, 420687}, NULL) = 0
gettimeofday({1305565791, 420905}, NULL) = 0
gettimeofday({1305565791, 421126}, NULL) = 0
gettimeofday({1305565791, 421319}, NULL) = 0
gettimeofday({1305565791, 421603}, NULL) = 0
gettimeofday({1305565791, 421891}, NULL) = 0
gettimeofday({1305565791, 422112}, NULL) = 0
gettimeofday({1305565791, 422360}, NULL) = 0
gettimeofday({1305565791, 422585}, NULL) = 0
gettimeofday({1305565791, 422809}, NULL) = 0
gettimeofday({1305565791, 423063}, NULL) = 0
gettimeofday({1305565791, 423313}, NULL) = 0
gettimeofday({1305565791, 423567}, NULL) = 0
gettimeofday({1305565791, 423818}, NULL) = 0
gettimeofday({1305565791, 424071}, NULL) = 0
gettimeofday({1305565791, 424297}, NULL) = 0
stat64("/var/www/uc/test/api.php", 0xbf8e9bfc) = -1 ENOENT (No such file or directory)
lstat64("/var", {st_mode=S_IFDIR|S_ISGID|0755, st_size=148, ...}) = 0
lstat64("/var/www", {st_mode=S_IFDIR|S_ISGID|0711, st_size=78, ...}) = 0
open("/var/www/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = -1 ENOENT (No such file or directory)
lstat64("/var/www/uc", {st_mode=S_IFDIR|S_ISGID|0755, st_size=4096, ...}) = 0
open("/var/www/uc/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 11
fcntl64(11, F_GETFD)                    = 0x1 (flags FD_CLOEXEC)
fcntl64(11, F_SETFD, FD_CLOEXEC)        = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=30, ...}) = 0
read(11, "ErrorDocument 404 /index.html\n", 4096) = 30
read(11, "", 4096)                      = 0
close(11)                               = 0
lstat64("/var/www/uc/test", {st_mode=S_IFDIR|S_ISGID|0755, st_size=48, ...}) = 0
open("/var/www/uc/test/.htaccess", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 11
fcntl64(11, F_GETFD)                    = 0x1 (flags FD_CLOEXEC)
fcntl64(11, F_SETFD, FD_CLOEXEC)        = 0
fstat64(11, {st_mode=S_IFREG|0644, st_size=72, ...}) = 0
read(11, "RewriteEngine on\nRewriteRule ^ap"..., 4096) = 72
read(11, "", 4096)                      = 0
close(11)                               = 0
lstat64("/var/www/uc/test/api.php", 0xbf8e9bfc) = -1 ENOENT (No such file or directory)
writev(10, [{"HTTP/1.1 301 Moved Permanently\r\n"..., 303}, {"\37\213\10\0\0\0\0\0\0\3", 10}, {"mP\301N\3030\f\275\367+LOpX\334\26\t!\224E\32k\21\2236\250D9p\364\32\263"..., 236}, {"\314\226,\242>\1\0\0", 8}], 4) = 557
gettimeofday({1305565791, 435764}, NULL) = 0
gettimeofday({1305565791, 435986}, NULL) = 0
read(10, 0xb934a9e8, 8000)              = -1 EAGAIN (Resource temporarily unavailable)
write(7, "2001:470:5:590::cd34 - - [16/May"..., 215) = 215
write(8, "vhost_combined\n", 15)        = 15

So, in this case we’re left with:
stat64("/var/www/uc/test/api.php", 0xbf8e9bfc) = -1 ENOENT (No such file or directory)
and
lstat64("/var/www/uc/test/api.php", 0xbf8e9bfc) = -1 ENOENT (No such file or directory)

And we’re not trying to open /var/www/uc/test/api.php/.htaccess, so, we’ve made the process a little smoother.
Briefly, when you use mod_rewrite to redirect traffic from an existing file, move the file out of the way to save extra lookups.
Additionally, you can move your mod_rewrite into your config file, and set AllowOverride none in your config which will prevent Apache from looking for .htaccess files in each of your directories. If you have a lot of static content being accessed, this will help considerably.

Tags: apache2, mod_rewrite
Posted in Scalability | No Comments »

Reverse Engineering Youtube Statistics Generation Algorithm

Saturday, November 27th, 2010

While surfing Youtube a while back, I noticed that you could view the statistics for a given video. While most of the videos I view are quite boring and have low viewcounts, I thought that might be the trigger — Only popular videos have stats. However, while surfing Youtube today to see how they handled some statistics, I saw some patterns emerge that tossed that theory out the window. Videos with even a few hundred views had statistics.

Since we can assume that Google has kept track of every view and statistic possible since it was merged with their platform, even old videos have data back into late 2007 as evidenced by many different videos. Some videos mention 30 Nov 2007 as the earliest data collection date.

So, we face a quandary. We have videos from 2005 through today, stats from late 2007 through today and stats displayed on the video display page that have been rolled out since mid 2010. Old videos that don’t currently display stats obviously are gathering stats but must have a flag saying that the old data hasn’t been imported as it will only mention Honors for this Video. How do you approach the problem?

We know that the data is collected and applied in batches and it appears that every video has statistics from a particular date forward. Recent videos all have full statistics, even with a few hundred views, no comments, no favorites. The catalyst doesn’t appear to be when someone has an interaction with a video, merely viewing a video must signal the system to backfill statistics. There is probably some weight given to popular videos, though, those videos would have a lot more history. One must balance the time required to import a very popular video versus importing the history from hundreds of less popular videos. One of the benefits of bigtable – if architected properly – would be to process each video’s history in one shot, set the stats processed flag and do the next video. One might surmise that Google knew to collect the view data, but, may not have thought about how the data would be used.

How do you choose videos to be processed? When you process the live views, you might decide to put a video into a queue for backfill processing. But, on a very lightly viewed video, this might delay backfilling another video where statistics might be more interesting or provocative. We can assume that we have a fixed date in time where a video doesn’t require backfilling which makes our data backfill decision a little easier.

As the logs are being processed, we might keep a list of the video_id, creation date and number of daily views. That data would be inserted into a backfill queue for our backfill process. In the backfill process, we would look at the creation date, number of daily views and number of mentions in the backfill queue. To figure out a priority list of the items to process, we might look at the velocity of hits from one day to the next – triggering a job queue entry on a video that is suddenly getting popular. We might also influence decisions based on the views and the creation date delta off the fixed point in time where stat displays started. This would allow us to take a lightly viewed video that was created just before our fixed point and prioritize that in the backfill queue. Now we’ve got a dual priority system that would allow us to tackle two problems at the same time, and intersect in the middle. Each day, new entries are inserted into the queue, altering priority of existing and current entries which would allow the stats to be backfilled in a manner that would appear to be very proactive.

At some point, videos that haven’t been viewed that were created prior to the fixed point in time could be added to the cleanup queue. Since they weren’t viewed, generating the statistics for them isn’t as important. And, if a video has been viewed, it was already in the queue. Since the queue could dispatch the jobs to as many machines as Google wanted, stats could be added to Youtube videos based on the load of their distributed computer.

What do you think?

How would you backfill log data from an estimated 900 million videos, serving 2 billion video views a week.

Tags: google, scalability, youtube
Posted in Scalability | No Comments »

Using Redis (or Memcached) as a buffer for SQL resulting in near-realtime stats

Saturday, October 23rd, 2010

While I’ve used memcached for a number of things where MySQL’s query cache just isn’t quick enough, the nature of a key-value store without unions didn’t work for this particular project. While it would have been easy enough to run memcached alongside Redis, two software stacks to solve the same problem wasn’t appealing.

What I’ve come up with will work for either memcached or Redis and the theory is simple:

Create a unique key for your counter, increment the key, store the key in a list. Have a separate process iterate through the list, write the summarized data to your database, reinsert key into list if it is for the current hour.

Using r as our Redis object, the pseudocode looks like:

    dayhour_key = time.strftime('%Y%m%d%H', time.localtime())
    r.sinterstore('processlog', ['log'])
    numitems = r.scard('processlog')    # return number of items in our set 'log'

    for loop in range(0, numitems):
        logkey = r.spop('processlog')     # grab an item from our set 'log' and delete it from the set
        (table,dayhour) = logkey.split(':')
        count = r.get(logkey)     # get the count from our key
        if count == 0:
            # if the count is 0, delete the key (leftover from same hour decrement)
            r.delete(logkey)
        else:
            if dayhour < dayhour_key:
                # do our atomic update/insert incrementing table by count
                r.srem('log', logkey)
                r.delete(logkey)             # delete our key, it is not from the current hour
            else:
                # if we are processing the current hour, we must decrement by count in case 
                # another process modified the value while we were working
                r.decrby(logkey, count)    # decrement the key by count
                r.sadd('log', logkey)          # add the key to our set for processing

The concept is to use a key that is as granular as the data you want to keep. In this case we append a datehour stamp of yyyymmddHH (year, month, day, hour) to our unique id and end up with a key of stat:id:datehour. We use stat: to signify that the entry is for Statistics. For Zone 1 we end up with a key of stat:1:2010102314 (assuming 2pm) which is incremented and added to our 'log' set. When our log daemon runs in the current hour, we decrement the key by the count, and readd it to our set. If the log daemon runs on something in the past, we know that it cannot receive any updates so we are free to delete the key. Since we have used pop to remove the item from the list, any data prior to the current hour is automatically removed from the set, but, we need to add any key from the current hour.

We decrement the key by the count in the current hour just in case something else has updated that key while we were working. If in the next hour the count is 0, our routine skips it, but, still needs to delete the key.

In preliminary testing, we've saved roughly 280 transactions per second and stats are rarely more than a minute or two behind realtime. It also allowed us to move from daily to hourly statistics. The same theory could be applied to per-minute statistics as well.

Tags: memcached, redis, sql
Posted in Scalability | 2 Comments »

Random Musings of an Insane Mind

This is my blog, there are many others like it but this one is mine.