Archive for the ‘Web Infrastructure’ Category

Planning to fail

Thursday, June 19th, 2008

Years ago, I was taught that to avoid downtime, you bought the best router, the best switches, the best computers, multithousand dollar machines, the best storage servers and put everything together…. and then hoped.

Somewhere along the line, during a phone call at 3am talking with a tech from NetApp regarding a shelf failure, it seemed that the strategy had failed.  NetApp did dispatch before we even knew the shelf had failed — their system sent an email out with the status, NetApp had a shelf couriered to the facility and the phone call I received was from the facility asking if they should let the NetApp technician in.

NetApp handled the problem exceedingly well, but, we didn’t have 100% availability because the shelf failed and knocked out multiple disks in the array.  While NetApp did suggest we would have 100% uptime with the solution, based on the facts, I don’t believe there is any way we could have had 100%.  As it stood, 99.95% was what we ended up with.

The data center we were located in at the time had two 30+ minute unplanned power outages.  Multiple power feeds, single vault construction.  Another hosting company recently had a multi-day outage because they had a single power vault for their data center.  Even more ironic, a transformer failure 3 years earlier should have taught them that lesson.

So, what does one do?

HP used to want to sell really expensive hardware as did Sun when chasing five nines, 99.999%.  Google took a different approach using COTS (Commercial Off The Shelf) hardware.

Google planned to fail.  In fact, their entire distributed computer design handles failures very gracefully.  When a node crashes, the disk storage from that machine is replicated elsewhere on the network, its tasks are handed to other nodes.  This gives Google a distinct advantage.  Because their network is designed with cheaper hardware, they can put more of it online.  With more equipment comes more CPU capacity and this allows Google to do expensive calculations that other companies can only dream of.

You can buy one dual Xeon quadcore machine that is expensive and reliable for $4500, or, you can buy 3 dual Xeon quadcore machines for $4500.  Three machines on Google’s distributed computer will provide a primary and two backups for the same price, however, all three machines are available to provide Google with CPU and Disk space for the same price as the more expensive machine.

Because Google engineered their network expecting failure, their solution is obviously very robust and approaches the holy grail of five nines.  Geographically dispersed, fault tolerant network, fault tolerant distributed computer ensures you get the results you’re looking for quickly… every time.

Did Google start their mission thinking this way, or did they hire the right thinkers?  Its certainly helped our thought processes as we engineer solutions for our clients.

Power Math

Wednesday, June 18th, 2008

Greenness aside, its difficult to develop truly dense computing when you’re limited to x amount of power in a cabinet or cage.  Many data centers talk about engineering x watts per square foot.  With air cooling, 450-550 watts per square foot is supposedly the max.  One data center claims 1500 watts per square foot using air cooling.  Power converts to heat, more power = more heat.  While a facility may have the electrical capacity, most are engineering for the cooling capacity.

4800 watts is the limit at a particular data center for 1 cabinet.  Remember watts and amps and volts?

Watts = Volts * Amps

120Volt 20Amp service = 2400 Watts.

208Volt 20Amp service = 4160 Watts.

So, if you have a cabinet with 2 120V@20A circuits, and consider that dual quadcore Xeon servers come with 500 Watt power supplies, one cabinet with ten 500 watt machines exceeds the power available in a cabinet.  Exceeding 80% of the circuit is considered bad form, some companies specify a 75% maximum.

Supermicro Blade servers are slightly more efficient with 10 dual quadcore Xeon servers using 2000 watts, leaving room for support equipment.  The same blade chassis uses 1500 watts if you use 208V.

Yes, I know a server doesn’t use the max wattage all the time, but, the higher the CPU utilization, the closer to that max you get.  Startup power drain also approaches that max.  There are power strips that do staggered power-on, set max amp draw with shutoff, etc.  These aren’t problems you run into with the breadrack and minitower data centers because you just plug machines in until the circuit blows, then, remove the last one.

How much power does a device use:

  • 10/100/1000 48 port switch with 4 fiber uplinks, 45 watts (peak .4 amps)
  • P4/3.0ghz, 2gb RAM, 320gb SATA Hard Drive, 73.2 watts (peak 1.2 amps)
  • Core2Duo E6600, 2gb RAM, 320gb SATA Hard drive, 114.8 watts (peak 1.0 amps)
  • Core2Quad Q6600 64bit mode, 2gb RAM, 320gb SATA Hard drive x 2, 103.8 watts (peak 1.2 amps)
  • Core2Quad Q6600 32bit mode, 2gb RAM, 320gb SATA, 110.9 watts (peak 1.0 amps)

So, even though most of these machines have 260-320 watt power supplies, their in-use wattage is less than half the stated max which allows a little more density.  So, without that metering, if you were to spec everything based on the books, you’d be seriously underutilizing your racks.

Raritan and APC both make Power Strips that allow you to monitor power with the Raritan being a little more pricey, but, allowing metering down to the individual outlet.

Granted, you still need to plan ahead, without knowing the true power use and efficiency of your power supplies, you could severely overbudget or underbudget your power needs.

What does my day have in store?  Doing a ton more power math for our next expansion.

Entries (RSS) and Comments (RSS).
Cluster host: li