Random Musings of an Insane Mind

XFS Filesystem Benchmarking thoughts due to real world situation

November 15th, 2011

I suspect I’m going to learn something about XFS today that is going to create a ton of work.

Writing a benchmark test (in bash) that uses bonnie++ to benchmark along with two other real world scenarios. I’ve thought a lot about how to replicate some real-world testing, but, benchmarks usually stress a worst case scenario and rarely replicate real-world scenarios.

Benchmarks shouldn’t really be published as the end-all, be-all of testing and you have to make sure you’re testing the piece you’re looking at, not your benchmark tool.

I’ve tried to benchmark Varnish, Tux, Nginx multiple times and I’ve seen numerous benchmarks that claim one is insanely faster than the others for this workload or that workload, but, there are always potential issues in their tests. A benchmark should be published with every bit of information possible so that others can replicate the test in the same way and potentially point out configuration issues.

One benchmark I read lately showed Nginx reading a static file at 35krps, and Varnish flatlined at 8krps. My first thought was, is it caching or talking to the backend? There was a snapshot of varnishstat supporting the notion that it was indeed cached, but, was the shmlog mounted on a ram based tmpfs? Was varnishstat running while the benchmark was?

Benchmarks test particular workloads – workloads you may not see. What you learn from a benchmark is how this load is affected by that setting – so when you start to see symptoms, your benchmarking has taught you what knob to turn to fix things.

Based on my impressions of the filesystem issues we’re running into on a few production boxes, I am convinced lazy-count=0 is a problem. While I did benchmark it and received different results, logic dictates that lazy-count=1 should be enabled for almost all workloads. Another value I’m looking at is -i size=256 – which is the default for XFS. I believe this should be larger which would really assist directories with tens of thousands of files. -b 8192 might be a good compromise since many of these sites are running small files, but, the average filesize is 5120 bytes – slightly over the 4096 byte block – meaning that each file written receives two inodes – and two metadata updates. logsize should be increased on heavy write machines, and I believe the default setting is too low even for normal workloads.

With that in mind, I’ve got 600 permutations of filesystem tests, which need to be run four times to check each mount option, which again need to be run three times to check each IO scheduler.

I’ll use the same methodology to test ext4 which is going to be a lot easier due to fewer knobs, but, I believe XFS is still going to win based on some earlier testing I did.

In this quick test, I increased deletes from about 5.3k/sec to 13.4k/sec which took a little more than six minutes. I suspect this machine will be running tests for the next few days after I write the test script.

Posted in Scalability | 1 Comment »

Startup or Bustup?

November 6th, 2011

Lets set the stage.

Joe has an idea but lacks the funding to bring the idea to reality. An investor brings money to the table to help Joe bring his idea to fruition. A split of 55/45 is negotiated, with 45% going to the investor. Fast forward a few years and we have a startup that has a developed product, a few thousand users but no revenue and very little cash left to keep the startup running.

There are three choices:

Fold – accept that the idea didn’t gain traction, convince the investor to take the loss and move on. Sometimes, even though your heart may not find this to be an acceptable resolution, it might be the right answer. Unless you’ve got a lot of charisma and a well defined business model, you might find it difficult to obtain funding from him and his acquaintances for new projects. Investors understand that ideas fail and knowing when to cut the cord is a valuable skill.

Accept more investment capital – while the idea hasn’t gained enough traction yet, another investment will dilute the existing holdings based on the valuation of the company and the amount of cash received.

Bootstrap – hunker down, define your business model, hustle, and do it.

Lets assume that you don’t fold because the idea has merit, it just didn’t have enough time to gain enough traction.

In this situation, deciding to take an investment carries with it two problems. Your investor’s percentage gets diluted as does the founder. However, due to your equity split, the founder faces the additional problem of losing majority control of the startup. This is where it gets interesting.

As a founder you have to ask yourself, are you an entrepreneur with other ideas that you can develop if you are ousted, or, is this idea your baby?

A new investor may not be privy to the the reasons why the company didn’t gain the traction or why you ran out of money, but, in the back of your mind you might think that since you’ve lost the majority vote, you could be replaced at any time. If you’re doing a good job and you have the vision, why would your investors opt to replace you? Perhaps they will bring in someone to push you, but, they invested in you and your idea. If you have the passion, very few people they would bring in to replace you would have that gift. If they do remove you, you’ve got other ideas that you can now pursue. If you aren’t replaced, and the company later becomes very successful, you might not have made as much money as you could have, a successful exit for your investors raises your clout – making it much easier to find investments for your next idea.

Bootstrap. If you know the idea has the ability to start producing income shortly, by all means, deploy. If you’re confident in the idea, find bridge financing to get you to sustainability. Once you’ve worked out a revenue model and start collecting revenue from clients, your ability to find investors goes up dramatically – if you need investors at that point.

Only you can decide which path is the right one.

Posted in Startups | No Comments »

Ext4, XFS and BtrFS benchmarks and testing

November 1st, 2011

Recently I talked about versioning filesystems available for OSS systems. While most of our server farms use XFS, we have been moving to Ext4 on a number of machines. This wasn’t done as a precursor to BtrFS but problems we’ve been having with XFS on very large filesystems. The fact that we can migrate Ext4 to BtrFS in-place is just a coincidental bonus.

While ZFS is still a consideration if we move to FreeBSD (I was not suitably impressed with Debian’s K*BSD project enough to consider it stable enough for production), I felt that looking at BtrFS might be worth a look. There is also CephFS but that requires a little more infrastructure as you need to run a cluster and it isn’t really made for single machine deployments.

We’re also going to make some assumptions and do things you might not want to do on a home system. Since the data center we’re in has a 100% power SLA, we can be sure that we won’t lose power and can be a little more aggressive. We disable atime which may negatively impact clients if you are dealing with a disk that handles your mailspool. Also, recent versions of XFS handle atime updates much differently, so, the performance boost from noatime is negligible.

Command used to test:

/usr/sbin/bonnie++ -s 8g -n 512

Ext4

mkfs -t ext4 /dev/sda5
mount -o noatime /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   332  99 54570  12 23512   5  1615  98 62905   6 131.8   3
Latency             24224us     471ms     370ms   13739us     110ms    5257ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 24764  69 267757  98  2359   6 25084  67 388005  98  1403   3
Latency              1258ms    1402us   11767ms    1187ms      66us   11682ms

1.96,1.96,version,1,1320193244,8G,,332,99,54570,12,23512,5,1615,98,62905,6,131.8,3,512,,,,,24764,69,267757,98,2359,6,25084,67,388005,98,1403,3,24224us,471ms,370ms,13739us,110ms,5257ms,1258ms,1402us,11767ms,1187ms,66us,11682ms

Ext4 with journal conversion and mount options

mkfs -t ext4 /dev/sda5
tune2fs -o journal_data_writeback /dev/sda5
mount -o rw,noatime,data=writeback,barrier=0,nobh,commit=60 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   335  99 53396  11 25240   6  1619  99 62724   6 130.9   5
Latency             23955us     380ms     231ms   15962us     143ms   16261ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 24253  65 266963  98  2341   6 24567  65 389243  98  1392   3
Latency              1232ms    1405us   11500ms    1232ms     130us   11543ms

1.96,1.96,version,1,1320192213,8G,,335,99,53396,11,25240,6,1619,99,62724,6,130.9,5,512,,,,,24253,65,266963,98,2341,6,24567,65,389243,98,1392,3,23955us,380ms,231ms,15962us,143ms,16261ms,1232ms,1405us,11500ms,1232ms,130us,11543ms

XFS:

mount -t xfs -f /dev/sda5
mount -o noatime /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   558  98 55174   9 26660   6  1278  96 62598   6 131.4   5
Latency             14264us     227ms     253ms   77527us   85140us     773ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  2468  19 386301  99  4311  25  2971  22 375624  99   546   3
Latency              1986ms     346us    1341ms    1580ms      82us    5904ms
1.96,1.96,version,1,1320194740,8G,,558,98,55174,9,26660,6,1278,96,62598,6,131.4,5,512,,,,,2468,19,386301,99,4311,25,2971,22,375624,99,546,3,14264us,227ms,253ms,77527us,85140us,773ms,1986ms,346us,1341ms,1580ms,82us,5904ms

XFS, mount options:

mkfs -t xfs -f /dev/sda5
mount -o noatime,logbsize=262144,logbufs=8 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   563  98 55423   9 26710   6  1328  99 62650   6 129.5   5
Latency             14401us     345ms     298ms   20328us     119ms     357ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  3454  26 385552 100  5966  35  4459  34 375917  99   571   3
Latency              1625ms     360us    1323ms    1243ms      67us    5060ms

1.96,1.96,version,1,1320196498,8G,,563,98,55423,9,26710,6,1328,99,62650,6,129.5,5,512,,,,,3454,26,385552,100,5966,35,4459,34,375917,99,571,3,14401us,345ms,298ms,20328us,119ms,357ms,1625ms,360us,1323ms,1243ms,67us,5060ms

XFS, file system creation options and mount options:

mkfs -t xfs -d agcount=32 -l size=64m -f /dev/sda5
mount -o noatime,logbsize=262144,logbufs=8 /dev/sda5 /mnt

Results:

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   561  97 54674   9 26502   6  1235  95 62613   6 131.4   5
Latency             14119us     346ms     247ms   94238us   76841us     697ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512  9576  73 383305 100 14398  85  9156  70 373557  99  2375  14
Latency              1110ms     375us     301ms     850ms      36us    5772ms
1.96,1.96,version,1,1320198613,8G,,561,97,54674,9,26502,6,1235,95,62613,6,131.4,5,512,,,,,9576,73,383305,100,14398,85,9156,70,373557,99,2375,14,14119us,346ms,247ms,94238us,76841us,697ms,1110ms,375us,301ms,850ms,36us,5772ms

BtrFS:

mkfs -t btrfs /dev/sda5
mount -o noatime /dev/sda5 /mnt

Also, make sure CONFIG_CRYPTO_CRC32C_INTEL is set in the kernel, or loaded as a module and use an Intel CPU.

Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
version          8G   254  99 54778   9 23070   8  1407  92 59932  13 131.2   5
Latency             31553us     264ms     826ms   94269us     180ms   17963ms
Version  1.96       ------Sequential Create------ --------Random Create--------
version             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                512 17034  83 256486 100 13485  97 15282  76 38942  73  1472  23
Latency               126ms    2162us   11295us   71992us   20713us   28647ms

1.96,1.96,version,1,1320204006,8G,,254,99,54778,9,23070,8,1407,92,59932,13,131.2,5,512,,,,,17034,83,256486,100,13485,97,15282,76,38942,73,1472,23,31553us,264ms,826ms,94269us,180ms,17963ms,126ms,2162us,11295us,71992us,20713us,28647ms

Analysis

Ext4 is considerably better than Ext3 was last time we ran the check. Even with the allocation group tweaks and mount options we use, Ext4 isn’t a bad alternative and shows some improvements over XFS. However, BtrFS even with the Intel CRC hardware acceleration, the Random Create Read benchmark shows a significant drop.

I believe our recent conversion to Ext4 isn’t negatively impacting things based on the typical workload machines see.

I’ll continue to work with BtrFS and see if I can figure out why that one particular benchmark performs so poorly, but, some of the other options present in BtrFS since it is a versioning filesystem will be quite useful.

Machine specs:

* Linux version 3.1.0 #3 SMP Tue Nov 1 16:23:42 EDT 2011 i686 GNU/Linux
* P4/3.0ghz, 2gb RAM
* Western Digital SATA2 320gb 7200 RPM Drive

Posted in Web Infrastructure | 2 Comments »

Versioning Filesystem choices using OSS

November 1st, 2011

One of the clusters we have uses DRBD between two machines with GFS2 mounted on DRBD in dual primary. I’d played around with Gluster and Lustre, OCFS2, AFS and many others and I’ve used NetApps in the past, but, I’ve never been extremely happy with any of the distributed and clustered filesystems.

With my recent thinking on SetUID mode or SetGID to deal with particular problems led me to look at a versioning filesystem. Currently that leaves ZFS and BtrFS.

I’ve used ZFS in the past on Solaris and it is supported natively within FreeBSD. Since we use Debian, there is Debian’s K*BSD project which puts the Debian userland on the BSD kernel – making most of our in-house management processes easy to convert. Using ZFS under Linux requires using Fuse which could introduce performance issues.

The other option we have is BtrFS. BtrFS is less mature, but, also has the ability to handle in-place migrations from ext3/ext4. While this doesn’t really help much since we primarily run XFS, future machines could use ext4 until BtrFS is deemed stable enough at which point they could be live converted.

In testing, XFS and Ext4 have similar performance when well tuned which means we shouldn’t see any real significant difference with either. Granted this disagrees with some current benchmarks, but, those benchmarks didn’t appear to set the filesystem up correctly and didn’t modify the mount parameters to allow for more buffers to be used. When dealing with small files, XFS needs a little more RAM and the journal logbuffers needs to be increased – keeping more of the log in RAM before being replayed and committed. Large file performance is usually deemed superior with XFS, but, properly tuning Ext3 (and by inference, Ext4), we can change the performance characteristics of Ext3/4 and get about 95% of XFS’s large file performance.

Currently we keep two generations of weekly machine backups. While this wouldn’t change, we actually could do checkpointing and more frequent snapshots so that a file uploaded and modified or deleted would have a much better chance of being able to be restored. One of the things about versioning filesystems is the ability to do hourly or daily snapshots which should allow us to reduce the data loss if a site is exploited or catastrophically damaged through a mistake.

So, we’ve got three potential solutions in order of confidence that the solution will work:

* FreeBSD ZFS
* Debian/K*BSD ZFS
* Debian BtrFS

This weekend I’ll start putting the two Debian solutions through their paces to see if I feel comfortable with either. I’ve got a chassis swap to do this week and we’ll probably switch that machine from XFS to Ext4 in preparation as well. Most of the new machines we’ve been putting online now use Ext4 due to some of the issues I’ve had with XFS.

Ideally, I would like to start using BtrFS on every machine, but, if I need to move things over to FreeBSD, I would have to make some very tough decisions and migrations.

Never a dull moment.

Posted in Web Infrastructure | 1 Comment »

GFS2 Kernel Oops

October 30th, 2011

For a few years I’ve run a system using DRBD replication between two machines with GFS2 running in dual primary mode to test a theory on a particular type of web hosting I’ve been developing.

For months the system will run fine, then, out of the blue, one of the nodes will drop from the cluster, reboot and we’ve never seen anything in the logs. It’ll run another 120-180 days without incident and then will reboot again with no real indication of the problem. We knew it was a kernel panic or kernel oops, but, the logs never were flushed to disk when the machine was rebooted.

Imagine our luck when two days in a row, at roughly the same time of day, the node rebooted. Even though we have remote syslog set up, we’ve never caught it.

/etc/sysctl.conf was changed so that panic_on_oops was set to 0, a number of terminal sessions were opened from another machine tailing various logs, and we were hoping to have the problem occur again.

/etc/sysctl.conf:

kernel.panic=5
kernel.panic_on_oops=0

At 6:25am, coincidentally during log rotation, the GFS2 partition umounted, but, the machine didn’t reboot. Checking our terminal, we still had access to dmesg, and, we had some logs:

GFS2: fsid=gamma:gfs1.0: fatal: invalid metadata block
GFS2: fsid=gamma:gfs1.0:   bh = 1211322 (magic number)
GFS2: fsid=gamma:gfs1.0:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 401
GFS2: fsid=gamma:gfs1.0: about to withdraw this file system
GFS2: fsid=gamma:gfs1.0: telling LM to unmount
GFS2: fsid=gamma:gfs1.0: withdrawn
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_lm_withdraw+0xd9/0x10a
 [] ? gfs2_meta_check_ii+0x3c/0x48
 [] ? gfs2_meta_indirect_buffer+0xf0/0x14a
 [] ? gfs2_block_map+0x1a3/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
WARNING: at fs/buffer.c:1188 gfs2_block_map+0x2be/0x9fe()
Hardware name: PDSMi
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x78/0x8c
 [] ? warn_slowpath_fmt+0x45/0x4a
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
---[ end trace 54fad1a4877f173c ]---
BUG: unable to handle kernel paging request at ffffffff813b8f0f
IP: [] __brelse+0x7/0x26
PGD 1625067 PUD 1629063 PMD 12001e1 
Oops: 0003 [#1] SMP 
CPU 0 
Modules linked in:

Pid: 18047, comm: gzip Tainted: G        W   3.0.0 #1 Supermicro PDSMi/PDSMi+
RIP: 0010:[]  [] __brelse+0x7/0x26
RSP: 0018:ffff880185d85800  EFLAGS: 00010286
RAX: 00000000e8df8948 RBX: ffff8801f3fb6c18 RCX: ffff880185d857d0
RDX: 0000000000000010 RSI: 000000000002ccee RDI: ffffffff813b8eaf
RBP: 0000000000000000 R08: ffff880185d85890 R09: ffff8801f3fb6c18
R10: 00000000000029e0 R11: 0000000000000078 R12: ffff880147986000
R13: ffff880147986140 R14: 00000000000029c1 R15: 0000000000001000
FS:  00007fb7d320f700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff813b8f0f CR3: 0000000212a73000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 18047, threadinfo ffff880185d84000, task ffff880147dc4ed0)
Stack:
 ffffffff811537f6 0000000000000000 00000000000022a1 0000000000000001
 000000000000ffff 0000000000000000 ffffffff8102a7ce 0000000000000001
 0000000100000008 00000000fffffffb 0000000000000001 ffff8801f3fb6c18
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x7d/0x8c
 [] ? printk+0x43/0x48
 [] ? alloc_page_buffers+0x62/0xba
 [] ? block_read_full_page+0x141/0x260
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? do_mpage_readpage+0x49b/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
Code: 31 00 45 31 f6 fe 85 88 00 00 00 48 89 df e8 a2 1a fc ff eb 03 45 31 f6 5b 4c 89 f0 5d 41 5c 41 5d 41 5e c3 8b 47 60 85 c0 74 05  ff 4f 60 c3 48 c7 c2 96 7b 4d 81 be a4 04 00 00 31 c0 48 c7 
RIP  [] __brelse+0x7/0x26
 RSP 
CR2: ffffffff813b8f0f
---[ end trace 54fad1a4877f173d ]---

As I suspected, log rotation appeared to trigger the problem and handed us the above traceback. Running fsck.gfs2 resulted in:


# fsck.gfs2 -y /dev/drbd1
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Error: resource group 7339665 (0x6ffe91): free space (64473) does not match bitmap (64658)
The rgrp was fixed.
Error: resource group 7405179 (0x70fe7b): free space (64249) does not match bitmap (64299)
(50 blocks were reclaimed)
The rgrp was fixed.
Error: resource group 7470693 (0x71fe65): free space (65456) does not match bitmap (65464)
(8 blocks were reclaimed)
The rgrp was fixed.

...snip...

Ondisk and fsck bitmaps differ at block 133061348 (0x7ee5ae4) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 133061349 (0x7ee5ae5) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #133061031 (0x7ee59a7) free count inconsistent: is 65232 should be 65508
Inode count inconsistent: is 37 should be 0
Resource group counts updated
Inode count inconsistent: is 1267 should be 1266
Resource group counts updated
Pass5 complete      
The statfs file is wrong:

Current statfs values:
blocks:  188730628 (0xb3fcd04)
free:    176443034 (0xa844e9a)
dinodes: 644117 (0x9d415)

Calculated statfs values:
blocks:  188730628 (0xb3fcd04)
free:    177426468 (0xa935024)
dinodes: 493059 (0x78603)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete

Filesystem was remounted after a 7 minute fsck and we’ll see if it happens again tomorrow.

Posted in Scalability | 2 Comments »

« Older Entries

Newer Entries »

Random Musings of an Insane Mind

This is my blog, there are many others like it but this one is mine.

XFS Filesystem Benchmarking thoughts due to real world situation

Startup or Bustup?

Ext4, XFS and BtrFS benchmarks and testing

Ext4

Ext4 with journal conversion and mount options

XFS:

XFS, mount options:

XFS, file system creation options and mount options:

BtrFS:

Analysis

Versioning Filesystem choices using OSS

GFS2 Kernel Oops

Home

Pages

Categories

Links

Random Musings of an Insane Mind

This is my blog, there are many others like it but this one is mine.

XFS Filesystem Benchmarking thoughts due to real world situation

Startup or Bustup?

Ext4, XFS and BtrFS benchmarks and testing

Ext4

Ext4 with journal conversion and mount options

XFS:

XFS, mount options:

XFS, file system creation options and mount options:

BtrFS:

Analysis

Versioning Filesystem choices using OSS

GFS2 Kernel Oops

Home

Pages

Categories

Tags

Links