btrfs gets very slow, metadata almost full

One of our storage servers that has had problems in the past. Originally it seemed like XFS was having a problem with the large filesystem, so, we gambled and decided to use btrfs. After eight days running, the machine has gotten extremely slow for disk I/O to the point where backups that should take minutes, were taking hours.

Switching the disk scheduler from cfq to noop to deadline appeared to have only short-term benefits at which point the machine bogged down again.

We’re running an Adaptec 31205 with 11 Western Digital 2.0 terabyte drives in hardware Raid 5 with roughly 19 terabytes accessible on our filesystem. During the first few days of backups, we would easily hit 800mb/sec inbound, but, after a few machines had been backed up to the server, 100mb/sec was optimistic with 20-40mb/sec being more normal. We originally attributed this to rsync of thousands of smaller files rather than the large files moved on some of the earlier machines. Once we started overlapping machines to get their second generational backup, the problem was much more evident.

The Filesystem:

# df -h /colobk1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda8        19T  8.6T  9.6T  48% /colobk1

# btrfs fi show
Label: none  uuid: 3cd405c7-5d7d-42bd-a630-86ec3ca452d7
	Total devices 1 FS bytes used 8.44TB
	devid    1 size 18.14TB used 8.55TB path /dev/sda8

Btrfs Btrfs v0.19

# btrfs filesystem df /colobk1
Data: total=8.34TB, used=8.34TB
System, DUP: total=8.00MB, used=940.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=106.25GB, used=104.91GB
Metadata: total=8.00MB, used=0.00

The machine

# uname -a
Linux st1 3.8.0 #1 SMP Tue Feb 19 16:09:18 EST 2013 x86_64 GNU/Linux

# btrfs --version
Btrfs Btrfs v0.19

As it stands, we appear to be running out of Metadata space. Since our used metadata space is more than 75% of our total metadata space, updates are taking forever. The initial filesystem was not created with any special inode or leaf parameters, so, it is using the defaults.

The btrfs wiki points to this particular tuning option which seems like it might do the trick. Since you can run the balance while the filesystem is in use and check its status, we should be able to see whether it is making a difference.

I don’t believe it is going to make a difference as we have only a single device exposed to btrfs, but, here’s the command we’re told to use:

btrfs fi balance start -dusage=5 /colobk1

After a while, the box returned with:

# btrfs fi balance start -dusage=5 /colobk1
Done, had to relocate 0 out of 8712 chunks

# btrfs fi df /colobk1
Data: total=8.34TB, used=8.34TB
System, DUP: total=8.00MB, used=940.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=107.25GB, used=104.95GB
Metadata: total=8.00MB, used=0.00

So it added 1GB to the metadata size. At first glance, it is still taking considerable time to do the backup of a single machine of 9.7gb – over 2 hours and 8 minutes when the first backup took under 50 minutes. I would say that the balance didn’t do anything positive as we have a single device. I suspect that the leafsize and nodesize might be the difference here – requiring a format and backup of 8.6 terabytes of data again. It took two and a half minutes to unmount the partition after it had bogged down and after running the balance.

mkfs -t btrfs -l 32768 -n 32768 /dev/sda8

# btrfs fi df /colobk1
Data: total=8.00MB, used=0.00
System, DUP: total=8.00MB, used=32.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=1.00GB, used=192.00KB
Metadata: total=8.00MB, used=0.00

# df -h /colobk1
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda8        19T   72M   19T   1% /colobk1

XFS took 52 minutes to back up the machine. XFS properly tuned took 51 minutes. Btrfs tested with the leafnode set took 51 minutes. I suspect I need to run things for a week to get the extent’s close to filled again and check it again. In any case, it is a lot faster than it was with the default settings.

* Official btrfs wiki

Tags: Adaptec 31205, btrfs, performance, tuning, xfs

This entry was posted on Thursday, March 7th, 2013 at 11:29 pm and is filed under Scalability. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.

cd34 Says:
March 12th, 2013 at 8:35 pm

So far, using the larger leaf and node size arguments have helped considerably.

mkfs -t btrfs -l 32768 -n 32768 /dev/sda8

# btrfs fi df /colobk1
Data: total=8.10TB, used=8.10TB
System, DUP: total=8.00MB, used=928.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=83.00GB, used=82.80GB
Metadata: total=8.00MB, used=0.00

cd34 Says:
March 15th, 2013 at 11:43 am

And at some point last night, the machine got very sluggish again.

# uptime
 11:39:56 up 16 days, 13:14,  1 user,  load average: 104.95, 103.67, 100.58

Gunther Piez Says:
May 2nd, 2013 at 5:07 am

The problem isn’t that you are running out of metadata space. Metadata is dynamically allocated in 256 MiB chunks, and as long as there are 256 MiB free space on you drive, all is well.

The problem with btrfs is that its performance is rapidly degrading because of fragmentation (free space, metadata and data), which is quite heavy on btrfs because of it’s design. Creating the fs with leafsisze and nodesize 16384 helps a lot, and so does mounting with compression=lzo,noatime, but in the long run it loses performance.

I have a similar problem on my backup system, which after a year incremental backups (snapshots are really great) is slowed down to a grinding crawl.

BTW, the “defragmentation” command given at the btrfs wiki does only defragment you data, which usually isn’t a big problem at all, the directories need to be defragmented.

Ladislav Jech Says:
July 14th, 2013 at 5:00 am

same here :-), the system was almost unable to response, I started
on the Arch linux wiki can be found a command to defragment entire system or mounted btrfs folder:
andromeda / # find /devel -xdev -type f -print -exec btrfs filesystem defrag ‘{}’ \;

where -xdev option prevents to apply on non-btrfs or othr btrfs subvolume inside the volume to be defragmented.

Also running balancing with more metadata usage can help, but using higher percetage of block usage may result for long run process depending on block group found and system IO performance.
andromeda / # btrfs balance start -dusage=50 /devel

This makes my system fast again, I didn’t do any performance test before neither after but see the difference :-) I think the defragmentation can be executed every day/week, but the balancing I will execute like each month as all data (after usage filtered) need to be accessed if I understand the functionality right.

Random Musings of an Insane Mind

This is my blog, there are many others like it but this one is mine.