Archive for the ‘Hardware’ Category

KVM guest extremely slow, Bug in Host Linux 3.2.2 kernel

Friday, March 22nd, 2013

Client upgraded a KVM instance today, rebooted it and the machine is extremely slow.

The instance is a Debian system and running 3.1.0-1-amd64 which appears to have a bug with time. This causes the machine to respond to packets very sporadically which doesn’t allow anything to be done without a lot of delay. To make matters worse, he’s using a filesystem that is not supported on the host so we can’t just mount the LVM partition and put an older kernel on the machine.

Transferring the 22mb kernel stops at 55%-66%, using rsync –partial results in timeouts and never gets the file transferred. So, we’re stuck with trying to move files around.

Enter the split command

split -b 1m linux-image-3.2.0-2-amd64_3.2.17-1_amd64.deb

which results in a bunch of files named xaa through xaw. Now we can transfer these 1mb at a time which takes quite a bit of time, but, we get them moved over.

cat xa* > linux-image-3.2.0-2-amd64_3.2.17-1_amd64.deb
md5sum linux-image-3.2.0-2-amd64_3.2.17-1_amd64.deb

After verifying the checksum is correct:

dpkg -i linux-image-3.2.0-2-amd64_3.2.17-1_amd64.deb
reboot

However, this didn’t seem to fix the issue. Even creating a fresh installation doesn’t allow the network to work properly, but, I was able to mount the partition in another VM that was ext3 so I could copy over the ext4 filesystem and be able to mount it. For now, I need to probably pull the other VMs off that machine and get down to the root of the issue as I suspect rebooting either will result in the same problem.

Networking on the bare metal works fine. Networking on each of the still running VMs is working, but, on the VM I restarted and the one I just created, networking is not working properly, and, both are using the same scripts that had been used before.

As it turns out, the kernel issue is related to the host. A new kernel was compiled, instances moved off and the host was rebooted into the new kernel. Everything appears to be working fine and the machine came right up on reboot. I’m not 100% happy with the kernel config, but, things are working. Amazing that the bug hadn’t been hit in 480 days that the host was up, but, now that it was identified and fixed, I was also able to apply a few tweaks which should speed things up a bit with some of the enhanced virtio drivers.

Make sure your KVM host machine has the loop device and every filesystem you expect a client might mount. While we did have backups that were seven days old, there was still some data worth retrieving.

Seagate Drive Fails right out of the shrink wrap

Friday, October 8th, 2010

We keep a number of drives on hand to replace failures. Yesterday, a drive started failing as evidenced by the smartd logs and the machine having sudden load spikes for no reason. Looking through the logs did show evidence that the drive was being hard reset and reconnected.

So, we installed the following drive:

Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3250820AS

Within 12 hours, here is a piece of the smartd log:

  7 Seek_Error_Rate         0x000f   085   062   030    Pre-fail  Always       -       352398337
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1545
195 Hardware_ECC_Recovered  0x001a   069   060   000    Old_age   Always       -       181870471

The drive was purchased from our supplier perhaps a year ago when we bought a large batch of these bulk. It was previously unopened, in the original sealed static bag and it already registers 1545 hours. I trust our hardware supplier as we’ve been buying from them for almost 11 years, but, either they or Seagate rewrapped a drive to make it appear new.

The drive that it replaced was an older Western Digital:

  9 Power_On_Hours          0x0032   060   060   000    Old_age   Always       -       29872
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       53

Almost 30000 hours, 39 power cycles. There’s a reason we usually buy Western Digital.

Entries (RSS) and Comments (RSS).
Cluster host: li