Posts Tagged ‘drbd’

Versioning Filesystem choices using OSS

Tuesday, November 1st, 2011

One of the clusters we have uses DRBD between two machines with GFS2 mounted on DRBD in dual primary. I’d played around with Gluster and Lustre, OCFS2, AFS and many others and I’ve used NetApps in the past, but, I’ve never been extremely happy with any of the distributed and clustered filesystems.

With my recent thinking on SetUID mode or SetGID to deal with particular problems led me to look at a versioning filesystem. Currently that leaves ZFS and BtrFS.

I’ve used ZFS in the past on Solaris and it is supported natively within FreeBSD. Since we use Debian, there is Debian’s K*BSD project which puts the Debian userland on the BSD kernel – making most of our in-house management processes easy to convert. Using ZFS under Linux requires using Fuse which could introduce performance issues.

The other option we have is BtrFS. BtrFS is less mature, but, also has the ability to handle in-place migrations from ext3/ext4. While this doesn’t really help much since we primarily run XFS, future machines could use ext4 until BtrFS is deemed stable enough at which point they could be live converted.

In testing, XFS and Ext4 have similar performance when well tuned which means we shouldn’t see any real significant difference with either. Granted this disagrees with some current benchmarks, but, those benchmarks didn’t appear to set the filesystem up correctly and didn’t modify the mount parameters to allow for more buffers to be used. When dealing with small files, XFS needs a little more RAM and the journal logbuffers needs to be increased – keeping more of the log in RAM before being replayed and committed. Large file performance is usually deemed superior with XFS, but, properly tuning Ext3 (and by inference, Ext4), we can change the performance characteristics of Ext3/4 and get about 95% of XFS’s large file performance.

Currently we keep two generations of weekly machine backups. While this wouldn’t change, we actually could do checkpointing and more frequent snapshots so that a file uploaded and modified or deleted would have a much better chance of being able to be restored. One of the things about versioning filesystems is the ability to do hourly or daily snapshots which should allow us to reduce the data loss if a site is exploited or catastrophically damaged through a mistake.

So, we’ve got three potential solutions in order of confidence that the solution will work:

* FreeBSD ZFS
* Debian/K*BSD ZFS
* Debian BtrFS

This weekend I’ll start putting the two Debian solutions through their paces to see if I feel comfortable with either. I’ve got a chassis swap to do this week and we’ll probably switch that machine from XFS to Ext4 in preparation as well. Most of the new machines we’ve been putting online now use Ext4 due to some of the issues I’ve had with XFS.

Ideally, I would like to start using BtrFS on every machine, but, if I need to move things over to FreeBSD, I would have to make some very tough decisions and migrations.

Never a dull moment.

GFS2 Kernel Oops

Sunday, October 30th, 2011

For a few years I’ve run a system using DRBD replication between two machines with GFS2 running in dual primary mode to test a theory on a particular type of web hosting I’ve been developing.

For months the system will run fine, then, out of the blue, one of the nodes will drop from the cluster, reboot and we’ve never seen anything in the logs. It’ll run another 120-180 days without incident and then will reboot again with no real indication of the problem. We knew it was a kernel panic or kernel oops, but, the logs never were flushed to disk when the machine was rebooted.

Imagine our luck when two days in a row, at roughly the same time of day, the node rebooted. Even though we have remote syslog set up, we’ve never caught it.

/etc/sysctl.conf was changed so that panic_on_oops was set to 0, a number of terminal sessions were opened from another machine tailing various logs, and we were hoping to have the problem occur again.

/etc/sysctl.conf:

kernel.panic=5
kernel.panic_on_oops=0

At 6:25am, coincidentally during log rotation, the GFS2 partition umounted, but, the machine didn’t reboot. Checking our terminal, we still had access to dmesg, and, we had some logs:

GFS2: fsid=gamma:gfs1.0: fatal: invalid metadata block
GFS2: fsid=gamma:gfs1.0:   bh = 1211322 (magic number)
GFS2: fsid=gamma:gfs1.0:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 401
GFS2: fsid=gamma:gfs1.0: about to withdraw this file system
GFS2: fsid=gamma:gfs1.0: telling LM to unmount
GFS2: fsid=gamma:gfs1.0: withdrawn
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_lm_withdraw+0xd9/0x10a
 [] ? gfs2_meta_check_ii+0x3c/0x48
 [] ? gfs2_meta_indirect_buffer+0xf0/0x14a
 [] ? gfs2_block_map+0x1a3/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
------------[ cut here ]------------
WARNING: at fs/buffer.c:1188 gfs2_block_map+0x2be/0x9fe()
Hardware name: PDSMi
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 18047, comm: gzip Not tainted 3.0.0 #1
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x78/0x8c
 [] ? warn_slowpath_fmt+0x45/0x4a
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? drive_stat_acct+0xf3/0x12e
 [] ? do_mpage_readpage+0x160/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
---[ end trace 54fad1a4877f173c ]---
BUG: unable to handle kernel paging request at ffffffff813b8f0f
IP: [] __brelse+0x7/0x26
PGD 1625067 PUD 1629063 PMD 12001e1 
Oops: 0003 [#1] SMP 
CPU 0 
Modules linked in:

Pid: 18047, comm: gzip Tainted: G        W   3.0.0 #1 Supermicro PDSMi/PDSMi+
RIP: 0010:[]  [] __brelse+0x7/0x26
RSP: 0018:ffff880185d85800  EFLAGS: 00010286
RAX: 00000000e8df8948 RBX: ffff8801f3fb6c18 RCX: ffff880185d857d0
RDX: 0000000000000010 RSI: 000000000002ccee RDI: ffffffff813b8eaf
RBP: 0000000000000000 R08: ffff880185d85890 R09: ffff8801f3fb6c18
R10: 00000000000029e0 R11: 0000000000000078 R12: ffff880147986000
R13: ffff880147986140 R14: 00000000000029c1 R15: 0000000000001000
FS:  00007fb7d320f700(0000) GS:ffff88021fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffff813b8f0f CR3: 0000000212a73000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process gzip (pid: 18047, threadinfo ffff880185d84000, task ffff880147dc4ed0)
Stack:
 ffffffff811537f6 0000000000000000 00000000000022a1 0000000000000001
 000000000000ffff 0000000000000000 ffffffff8102a7ce 0000000000000001
 0000000100000008 00000000fffffffb 0000000000000001 ffff8801f3fb6c18
Call Trace:
 [] ? gfs2_block_map+0x2be/0x9fe
 [] ? warn_slowpath_common+0x7d/0x8c
 [] ? printk+0x43/0x48
 [] ? alloc_page_buffers+0x62/0xba
 [] ? block_read_full_page+0x141/0x260
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? do_mpage_readpage+0x49b/0x49f
 [] ? pagevec_lru_move_fn+0xab/0xc1
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? mpage_readpages+0xd0/0x12a
 [] ? gfs2_unstuff_dinode+0x383/0x383
 [] ? bit_waitqueue+0x14/0x63
 [] ? gfs2_readpages+0x67/0xa8
 [] ? sd_prep_fn+0x2c1/0x902
 [] ? gfs2_readpages+0x3b/0xa8
 [] ? __do_page_cache_readahead+0x11b/0x1c0
 [] ? ra_submit+0x19/0x1d
 [] ? generic_file_aio_read+0x2b4/0x5e0
 [] ? do_sync_read+0xab/0xe3
 [] ? vfs_read+0xa3/0x10f
 [] ? sys_read+0x45/0x6e
 [] ? system_call_fastpath+0x16/0x1b
Code: 31 00 45 31 f6 fe 85 88 00 00 00 48 89 df e8 a2 1a fc ff eb 03 45 31 f6 5b 4c 89 f0 5d 41 5c 41 5d 41 5e c3 8b 47 60 85 c0 74 05  ff 4f 60 c3 48 c7 c2 96 7b 4d 81 be a4 04 00 00 31 c0 48 c7 
RIP  [] __brelse+0x7/0x26
 RSP 
CR2: ffffffff813b8f0f
---[ end trace 54fad1a4877f173d ]---

As I suspected, log rotation appeared to trigger the problem and handed us the above traceback. Running fsck.gfs2 resulted in:


# fsck.gfs2 -y /dev/drbd1
Initializing fsck
Validating Resource Group index.
Level 1 RG check.
(level 1 passed)
Error: resource group 7339665 (0x6ffe91): free space (64473) does not match bitmap (64658)
The rgrp was fixed.
Error: resource group 7405179 (0x70fe7b): free space (64249) does not match bitmap (64299)
(50 blocks were reclaimed)
The rgrp was fixed.
Error: resource group 7470693 (0x71fe65): free space (65456) does not match bitmap (65464)
(8 blocks were reclaimed)
The rgrp was fixed.

...snip...

Ondisk and fsck bitmaps differ at block 133061348 (0x7ee5ae4) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 133061349 (0x7ee5ae5) 
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #133061031 (0x7ee59a7) free count inconsistent: is 65232 should be 65508
Inode count inconsistent: is 37 should be 0
Resource group counts updated
Inode count inconsistent: is 1267 should be 1266
Resource group counts updated
Pass5 complete      
The statfs file is wrong:

Current statfs values:
blocks:  188730628 (0xb3fcd04)
free:    176443034 (0xa844e9a)
dinodes: 644117 (0x9d415)

Calculated statfs values:
blocks:  188730628 (0xb3fcd04)
free:    177426468 (0xa935024)
dinodes: 493059 (0x78603)
The statfs file was fixed.
Writing changes to disk
gfs2_fsck complete 

Filesystem was remounted after a 7 minute fsck and we’ll see if it happens again tomorrow.

Entries (RSS) and Comments (RSS).
Cluster host: li