Posts Tagged ‘Adaptec 31205’

btrfs gets very slow, metadata almost full

Thursday, March 7th, 2013

One of our storage servers that has had problems in the past. Originally it seemed like XFS was having a problem with the large filesystem, so, we gambled and decided to use btrfs. After eight days running, the machine has gotten extremely slow for disk I/O to the point where backups that should take minutes, were taking hours.

Switching the disk scheduler from cfq to noop to deadline appeared to have only short-term benefits at which point the machine bogged down again.

We’re running an Adaptec 31205 with 11 Western Digital 2.0 terabyte drives in hardware Raid 5 with roughly 19 terabytes accessible on our filesystem. During the first few days of backups, we would easily hit 800mb/sec inbound, but, after a few machines had been backed up to the server, 100mb/sec was optimistic with 20-40mb/sec being more normal. We originally attributed this to rsync of thousands of smaller files rather than the large files moved on some of the earlier machines. Once we started overlapping machines to get their second generational backup, the problem was much more evident.

The Filesystem:

# df -h /colobk1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda8        19T  8.6T  9.6T  48% /colobk1

# btrfs fi show
Label: none  uuid: 3cd405c7-5d7d-42bd-a630-86ec3ca452d7
	Total devices 1 FS bytes used 8.44TB
	devid    1 size 18.14TB used 8.55TB path /dev/sda8

Btrfs Btrfs v0.19

# btrfs filesystem df /colobk1
Data: total=8.34TB, used=8.34TB
System, DUP: total=8.00MB, used=940.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=106.25GB, used=104.91GB
Metadata: total=8.00MB, used=0.00

The machine

# uname -a
Linux st1 3.8.0 #1 SMP Tue Feb 19 16:09:18 EST 2013 x86_64 GNU/Linux

# btrfs --version
Btrfs Btrfs v0.19

As it stands, we appear to be running out of Metadata space. Since our used metadata space is more than 75% of our total metadata space, updates are taking forever. The initial filesystem was not created with any special inode or leaf parameters, so, it is using the defaults.

The btrfs wiki points to this particular tuning option which seems like it might do the trick. Since you can run the balance while the filesystem is in use and check its status, we should be able to see whether it is making a difference.

I don’t believe it is going to make a difference as we have only a single device exposed to btrfs, but, here’s the command we’re told to use:

btrfs fi balance start -dusage=5 /colobk1

After a while, the box returned with:

# btrfs fi balance start -dusage=5 /colobk1
Done, had to relocate 0 out of 8712 chunks

# btrfs fi df /colobk1
Data: total=8.34TB, used=8.34TB
System, DUP: total=8.00MB, used=940.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=107.25GB, used=104.95GB
Metadata: total=8.00MB, used=0.00

So it added 1GB to the metadata size. At first glance, it is still taking considerable time to do the backup of a single machine of 9.7gb – over 2 hours and 8 minutes when the first backup took under 50 minutes. I would say that the balance didn’t do anything positive as we have a single device. I suspect that the leafsize and nodesize might be the difference here – requiring a format and backup of 8.6 terabytes of data again. It took two and a half minutes to unmount the partition after it had bogged down and after running the balance.

mkfs -t btrfs -l 32768 -n 32768 /dev/sda8

# btrfs fi df /colobk1
Data: total=8.00MB, used=0.00
System, DUP: total=8.00MB, used=32.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.00GB, used=192.00KB
Metadata: total=8.00MB, used=0.00

# df -h /colobk1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda8        19T   72M   19T   1% /colobk1

XFS took 52 minutes to back up the machine. XFS properly tuned took 51 minutes. Btrfs tested with the leafnode set took 51 minutes. I suspect I need to run things for a week to get the extent’s close to filled again and check it again. In any case, it is a lot faster than it was with the default settings.

* Official btrfs wiki

Adaptec 31205 under Debian

Saturday, September 25th, 2010

We have a Storage Server with 11 2tb drives in a Raid5. During a recent visit, we heard the alarm, but, no red light on any drive was visible nor was the light on the front of the chassis lit. Knowing it was a problem waiting to happen, but, without being able to see which drive had caused the array to fail, we scheduled a maintenance window that happened to coincide with a kernel upgrade.

In the meantime, we attempted to install the RPM and java management system to no avail. So, we weren’t able to read the controller status to find out what the problem was.

When we rebooted the machine, the array status was degraded and it prompted us to hit enter to accept the configuration or control-A to enter the admin. We entered the admin, Manage array, all drives are present and working. Immediately the array status changes to rebuilding with no indication which drive had failed and was being readded.

Exiting the admin, saving the config, the client said, pull the machine offline until it is fixed. This started what seemed like an endless process. We figured we would let it rebuild while it was online, but, disable it from the cluster. We installed a new kernel, 2.6.36-rc5, rebooted and this is where the trouble started. On boot, the new kernel got an I/O error, the channel hung, it forced a reset and then sat there for about 45 seconds. After it continued, it paniced as it was unable to read /dev/sda1.

Rebooting and entering the admin, we’re faced with an array that is marked offline. After identifying each of the drives through disk utils to make sure that they are recognized, we forced the array back online and rebooted into the old kernel. As it turns out, something in our 2.6.36-rc5 disables the array and sets it offline. It takes 18 hours to rebuild the array and return it to optimal status.

After the machine comes up, we knew we had a problem on one of the directories on the system and this seemed like an opportune time to run xfs_repair. About 40 minutes into it, we run into an I/O error with a huge block number and bam, the array is offline again.

In Disk Util in the ROM we start the test on the first drive. It takes 5.5 hours to run through the first disk which puts us at an estimated 60+ hours to check all 11 drives in the array. smartctl doesn’t allow us to independently check the drives, so, we fire up a second machine and mount each of the drives looking for any possible telltale signs in the S.M.A.R.T. data stored on the drives. Two drives show some abnormal numbers and we have an estimated 11 hours to check those disks. 5.5 hours later, the first disk is clean, less than 30 minutes later, we have our culprit. Relocating a number of bad sectors results in the controller hanging again, yet, no red fault light anywhere to be seen, no indication in the Adaptec manager that this drive is bad.

Replacing the drive and going back into the admin shows us a greyed out drive which immediately starts reconstructing. We reboot the system into the older kernel and start xfs_repair again. After two hours, it has run into a number of errors, but no I/O Errors.

It is obvious we’ve had some corruption for quite some time. We had a directory we couldn’t delete because it claimed it had files, however, no files were in the directory. We had 2 directories with files that we couldn’t do anything with and couldn’t even mv them to an area outside our working directories. We figured it was an xfs bug that we had hit due to the 18 terabyte size of the partition, but guessed that an xfs_repair would fix this. It was a minor annoyance to the client until we could get to a maintenance interval so we waited. In reality, this should have been a sign that we had some issues and we should have pushed the client harder to allow us to diagnose this much earlier. There is some data corruption, but, this is the second in a pair of backup servers for their cluster. Resyncing the data to a known good source will fix this without too much difficulty.

After four hours, xfs_repair is reporting issues like:

bad directory block magic # 0 in block 0 for directory inode 21491241467
corrupt block 0 in directory inode 21491241467
        will junk block
no . entry for directory 21491241467
no .. entry for directory 21491241467
problem with directory contents in inode 21491241467
cleared inode 21491241467
        - agno = 6
        - agno = 7
        - agno = 8
bad directory block magic # 0 in block 1947 for directory inode 34377945042
corrupt block 1947 in directory inode 34377945042
        will junk block
bad directory block magic # 0 in block 1129 for directory inode 34973370147
corrupt block 1129 in directory inode 34973370147
        will junk block
bad directory block magic # 0 in block 3175 for directory inode 34973370147
corrupt block 3175 in directory inode 34973370147
        will junk block

It appears that we have quite a bit of data corruption due to a bad drive which is precisely why we use Raid.

The array failed, why didn’t the Adaptec on-board manager know which drive had failed? Had we gotten the Java application to run, I’m still not convinced it would have told us which drive was throwing the array into degraded status. Obviously the card knew something was wrong as the alarm was on. Each drive has a fault light and an activity light, but, all of the drives allowed the array to be rebuilt and claimed the status was Optimal. During initialization, the Adaptec does light the fault and activity lights for each drive so it seems reasonable that when the drive encountered errors, it could have lit the fault light so we knew which drive to replace. When running xfs_repair and receiving the I/O error where it couldn’t relocate the block, why didn’t the Adaptec controller immediately fail the drive?

All in all, I’m not too happy with Adaptec right now. A 2tb hard drive failed which cost us roughly 60 hours to diagnose and put back into service. The failing drive should have been tagged and removed from the raid set immediately and marked. As it is right now, even though it was running in degraded mode, we shouldn’t have seen any corruption, however, xfs_repair is finding a considerable number of errors.

The drives report roughly 5600 hours online which corresponds to the eight months we’ve had the machine online and based on the number of files xfs_repair is finding are bad, I believe that drive had been failing for quite some time and Adaptec has failed us. While we have a considerable number of Adaptec controllers, we’ve never seen a failure like this.

unable to mount root fs on unknown-block(0,0)

Sunday, January 31st, 2010

After building a system for the new backup servers that utilized an Adaptec 31205 controller, I always prefer to use a kernel that we’ve tuned inhouse.

Upon booting into the kernel I had built, I received:

unable to mount root fs on unknown-block(0,0)

Since the drive size on the array was very large, the Debian Installer automatically created an EFI GUID Partition table, which my kernel was not set up for.

In the kernel makemenu, File Systems, Partition Types, enable Advanced partition selection. Near the bottom is EFI GUID Partition support. Enable that, recompile your kernel and you should be set.

One reboot later and voila:

st1:/colobk1# uname -a
Linux st1 #1 SMP Fri Jan 29 21:43:32 EST 2010 x86_64 GNU/Linux
st1:/colobk1# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             462M  232M  207M  53% /
tmpfs                 2.0G     0  2.0G   0% /lib/init/rw
udev                   10M   60K   10M   1% /dev
tmpfs                 2.0G     0  2.0G   0% /dev/shm
/dev/sda8              19T  305G   18T   2% /colobk1
/dev/sda5             1.9G   55M  1.8G   3% /home
/dev/sda4             949M  4.2M  945M   1% /tmp
/dev/sda6             2.4G  204M  2.2G   9% /usr
/dev/sda7             9.4G  237M  9.1G   3% /var

Entries (RSS) and Comments (RSS).
Cluster host: li