Planning to fail
Years ago, I was taught that to avoid downtime, you bought the best router, the best switches, the best computers, multithousand dollar machines, the best storage servers and put everything together…. and then hoped.
Somewhere along the line, during a phone call at 3am talking with a tech from NetApp regarding a shelf failure, it seemed that the strategy had failed. NetApp did dispatch before we even knew the shelf had failed — their system sent an email out with the status, NetApp had a shelf couriered to the facility and the phone call I received was from the facility asking if they should let the NetApp technician in.
NetApp handled the problem exceedingly well, but, we didn’t have 100% availability because the shelf failed and knocked out multiple disks in the array. While NetApp did suggest we would have 100% uptime with the solution, based on the facts, I don’t believe there is any way we could have had 100%. As it stood, 99.95% was what we ended up with.
The data center we were located in at the time had two 30+ minute unplanned power outages. Multiple power feeds, single vault construction. Another hosting company recently had a multi-day outage because they had a single power vault for their data center. Even more ironic, a transformer failure 3 years earlier should have taught them that lesson.
So, what does one do?
HP used to want to sell really expensive hardware as did Sun when chasing five nines, 99.999%. Google took a different approach using COTS (Commercial Off The Shelf) hardware.
Google planned to fail. In fact, their entire distributed computer design handles failures very gracefully. When a node crashes, the disk storage from that machine is replicated elsewhere on the network, its tasks are handed to other nodes. This gives Google a distinct advantage. Because their network is designed with cheaper hardware, they can put more of it online. With more equipment comes more CPU capacity and this allows Google to do expensive calculations that other companies can only dream of.
You can buy one dual Xeon quadcore machine that is expensive and reliable for $4500, or, you can buy 3 dual Xeon quadcore machines for $4500. Three machines on Google’s distributed computer will provide a primary and two backups for the same price, however, all three machines are available to provide Google with CPU and Disk space for the same price as the more expensive machine.
Because Google engineered their network expecting failure, their solution is obviously very robust and approaches the holy grail of five nines. Geographically dispersed, fault tolerant network, fault tolerant distributed computer ensures you get the results you’re looking for quickly… every time.
Did Google start their mission thinking this way, or did they hire the right thinkers? Its certainly helped our thought processes as we engineer solutions for our clients.