Finding my XFS Bug
Recently one of our servers had some filesystem corruption – corruption that has occurred more than once over time. As we use hardlinks a lot with link-dest and rsync, I’m reasonably sure the issue occurs due to the massive number of hardlinks and deletions that take place on that system.
I’ve written a small script to repeatedly test things and started it running a few minutes ago. My guess is that the problem should show up in a few days.
#!/bin/bash RSYNC=/usr/bin/rsync REVISIONS=10 function rsync_kernel () { DATE=`date +%Y%m%d%H%M%S` BDATES="" loop=0 for f in `ls -d1 /tmp/2011*` do BDATES[$loop]=$f loop=$(($loop+1)) done CT=${#BDATES[*]} if (( $CT > 0 )) then RECENT=${BDATES[$(($CT-1))]} LINKDEST=" --link-dest=$RECENT" else RECENT="/tmp/linux-3.0.3" LINKDEST=" --link-dest=/tmp/linux-3.0.3" fi $RSYNC -aplxo $LINKDEST $RECENT/ $DATE/ if (( ${#BDATES[*]} >= $REVISIONS )) then DELFIRST=$(( ${#BDATES[*]} - $REVISIONS )) loop=0 for d in ${BDATES[*]} do if (( $loop < = $DELFIRST )) then `rm -rf $d` fi loop=$(($loop+1)) done fi } while [ 1==1 ] do rsync_kernel echo . sleep 1 done
October 6th, 2011 at 1:39 pm
After 12 hours, no corruption yet.
I’m curious if the problem is in the inode64 code and won’t surface in the 32bit inodes on this 1gb partition.
October 7th, 2011 at 1:35 am
23.5 hours later:
Message from syslogd@test at Oct 7 01:31:56 …
kernel:Oops: 0000 [#1] SMP
Message from syslogd@test at Oct 7 01:31:56 …
kernel:Process rsync (pid: 15871, ti=e9d34000 task=f4a691a0 task.ti=e9d34000)
Message from syslogd@test at Oct 7 01:31:56 …
kernel:Stack:
Message from syslogd@test at Oct 7 01:31:56 …
kernel:Call Trace:
Message from syslogd@test at Oct 7 01:31:56 …
kernel:Code: dd 60 00 00 89 d8 e8 87 5d 00 00 8b 54 24 34 c7 02 00 00 00 00 bd 05 00 00 00 89 e8 83 c4 10 5b 5e 5f 5d c3 57 56 53 89 d3 85 c0 <8b> b2 8c 00 00 00 75 14 85 f6 74 72 81 7e 1c 3c 12 00 00 75 69
Message from syslogd@test at Oct 7 01:31:56 …] xfs_trans_brelse+0x7/0x9a SS:ESP 0068:e9d35ce4
kernel:EIP: [
Message from syslogd@test at Oct 7 01:31:56 …
kernel:CR2: 000000001001008c
Write failed: Broken pipe
tsavo:~ mcd$
October 26th, 2011 at 12:59 pm
Again, ran into this issue. 3 hour xfs_repair, lost about 90 files.
I need to compile a kernel with debug, and run the console tty to another machine since it does appear to hang the machine, and, the log files never get committed to, even though they were on a different filesystem.