Documentation Redux

June 27th, 2008

156 hours.

That’s what it took to track down a solution to a problem with some Open Source software.  The software was written in the early 2000 time frame, the last documentation update was in 2006.  The scenario that we were designing was documented on a page written in 2004.

The issue we ran into must be something that someone else has stumbled into because it is a very basic piece of the operation of this piece of software, but, in perusing all of the available documentation, using google to find any possible references, looking through all FAQs, committed code, mailing lists, etc. the solution presented itself on a page last updated in January of 2000.

A three line mention.

That’s it.

The author of the FAQ written in 2004 that describes the process and documents every step save for one very important part.  The three line mention in another FAQ, coincidentally written by the FAQ author in an email that was sent to a mailing list and included in someone else’s FAQ.

This is the inherent cost in Open Source.

We’re not using this software in an odd manner — in fact, the feature we were trying to use is one of the three fundamental uses.  The software hasn’t changed much in the last 4 years, but, it just goes to show you that documentation is easily forgotton in the Open Source world.

Would I have it any other way?  No.  I prefer open source because we can develop solutions that give us a competitive edge, and, if we need to, we can change the code to fix problems that the developers won’t fix.

Often times our requirements are based on a business case which conflicts with some of the purist open source coders.

Open Source Documentation, what?

June 22nd, 2008

The technology is there to make it very easy for an open source project to document.  Wiki’s, blogs, web access to revision control software, but, documentation is usually done as an afterthought, or worse, left up to the people that may not completely understand the product.

I have written technical documentation many times in my life and I evaluate a lot of open source projects to see if they fit into our organization and I can tell you, a large percentage of the documentation for many open source projects is extremely bad.

A recent project that we started to evaluate had configs on their web site documentation, which was powered by a wiki, that directly contradicted their mailing list responses from the company.  Thirty seconds later and I was able to correct the documentation to reflect the right information, but, that post which was 3 months old and directly referenced the incorrect wiki page never elicited an update.

Open Source Project maintainers — if you want people to use your product, you MUST provide good documentation.  Samples of config files with quick comments about usability are a start if you’re not going to completely document the required config files, but, a project with little to no documentation will not get adopted by the masses.

While I appreciate the fact that the coders don’t like to write documentation, if you are going to publish a project and expect people to use it, take some time to write some documentation.  When someone suggests changes or makes modifications to the wiki, be receptive rather than adversarial.

Your project will succeed much more quickly.

Also, if you have a commercial support package, and while I’m beta testing some software, I am also testing your support team’s attitude.  I know the same guys hanging out on irc, monitoring the mailing list and responding to bug and feature inquiries are the same people I’m going to be contacting for support.  Treat me wrong and I’ll find another solution.

Monetizing GPLed software isn’t easy — I know that.  But make it easy for those of us that will end up relying on your solution and are willing to pay for a support contract to make sure we get the support we need.

Planning to fail

June 19th, 2008

Years ago, I was taught that to avoid downtime, you bought the best router, the best switches, the best computers, multithousand dollar machines, the best storage servers and put everything together…. and then hoped.

Somewhere along the line, during a phone call at 3am talking with a tech from NetApp regarding a shelf failure, it seemed that the strategy had failed.  NetApp did dispatch before we even knew the shelf had failed — their system sent an email out with the status, NetApp had a shelf couriered to the facility and the phone call I received was from the facility asking if they should let the NetApp technician in.

NetApp handled the problem exceedingly well, but, we didn’t have 100% availability because the shelf failed and knocked out multiple disks in the array.  While NetApp did suggest we would have 100% uptime with the solution, based on the facts, I don’t believe there is any way we could have had 100%.  As it stood, 99.95% was what we ended up with.

The data center we were located in at the time had two 30+ minute unplanned power outages.  Multiple power feeds, single vault construction.  Another hosting company recently had a multi-day outage because they had a single power vault for their data center.  Even more ironic, a transformer failure 3 years earlier should have taught them that lesson.

So, what does one do?

HP used to want to sell really expensive hardware as did Sun when chasing five nines, 99.999%.  Google took a different approach using COTS (Commercial Off The Shelf) hardware.

Google planned to fail.  In fact, their entire distributed computer design handles failures very gracefully.  When a node crashes, the disk storage from that machine is replicated elsewhere on the network, its tasks are handed to other nodes.  This gives Google a distinct advantage.  Because their network is designed with cheaper hardware, they can put more of it online.  With more equipment comes more CPU capacity and this allows Google to do expensive calculations that other companies can only dream of.

You can buy one dual Xeon quadcore machine that is expensive and reliable for $4500, or, you can buy 3 dual Xeon quadcore machines for $4500.  Three machines on Google’s distributed computer will provide a primary and two backups for the same price, however, all three machines are available to provide Google with CPU and Disk space for the same price as the more expensive machine.

Because Google engineered their network expecting failure, their solution is obviously very robust and approaches the holy grail of five nines.  Geographically dispersed, fault tolerant network, fault tolerant distributed computer ensures you get the results you’re looking for quickly… every time.

Did Google start their mission thinking this way, or did they hire the right thinkers?  Its certainly helped our thought processes as we engineer solutions for our clients.

Power Math

June 18th, 2008

Greenness aside, its difficult to develop truly dense computing when you’re limited to x amount of power in a cabinet or cage.  Many data centers talk about engineering x watts per square foot.  With air cooling, 450-550 watts per square foot is supposedly the max.  One data center claims 1500 watts per square foot using air cooling.  Power converts to heat, more power = more heat.  While a facility may have the electrical capacity, most are engineering for the cooling capacity.

4800 watts is the limit at a particular data center for 1 cabinet.  Remember watts and amps and volts?

Watts = Volts * Amps

120Volt 20Amp service = 2400 Watts.

208Volt 20Amp service = 4160 Watts.

So, if you have a cabinet with 2 120V@20A circuits, and consider that dual quadcore Xeon servers come with 500 Watt power supplies, one cabinet with ten 500 watt machines exceeds the power available in a cabinet.  Exceeding 80% of the circuit is considered bad form, some companies specify a 75% maximum.

Supermicro Blade servers are slightly more efficient with 10 dual quadcore Xeon servers using 2000 watts, leaving room for support equipment.  The same blade chassis uses 1500 watts if you use 208V.

Yes, I know a server doesn’t use the max wattage all the time, but, the higher the CPU utilization, the closer to that max you get.  Startup power drain also approaches that max.  There are power strips that do staggered power-on, set max amp draw with shutoff, etc.  These aren’t problems you run into with the breadrack and minitower data centers because you just plug machines in until the circuit blows, then, remove the last one.

How much power does a device use:

  • 10/100/1000 48 port switch with 4 fiber uplinks, 45 watts (peak .4 amps)
  • P4/3.0ghz, 2gb RAM, 320gb SATA Hard Drive, 73.2 watts (peak 1.2 amps)
  • Core2Duo E6600, 2gb RAM, 320gb SATA Hard drive, 114.8 watts (peak 1.0 amps)
  • Core2Quad Q6600 64bit mode, 2gb RAM, 320gb SATA Hard drive x 2, 103.8 watts (peak 1.2 amps)
  • Core2Quad Q6600 32bit mode, 2gb RAM, 320gb SATA, 110.9 watts (peak 1.0 amps)

So, even though most of these machines have 260-320 watt power supplies, their in-use wattage is less than half the stated max which allows a little more density.  So, without that metering, if you were to spec everything based on the books, you’d be seriously underutilizing your racks.

Raritan and APC both make Power Strips that allow you to monitor power with the Raritan being a little more pricey, but, allowing metering down to the individual outlet.

Granted, you still need to plan ahead, without knowing the true power use and efficiency of your power supplies, you could severely overbudget or underbudget your power needs.

What does my day have in store?  Doing a ton more power math for our next expansion.

If you could have it all….

June 17th, 2008

I’m a bit of a web performance nut.  I like technology when it is used to solve real challenges and won’t use technology for technology’s sake.  When you look at today’s scalability problems of all of the web 2.0 shops, one only needs to make one real generalization.

What is the failing point of today’s sites?  How many stories have you read in the media about some rising star that gets mentioned on yahoo or digg or slashdot?  Generally, their site crashes under the crushing load (I’ve had sites slashdotted, its not as big a deal as they would have you believe).  But, the problem we face is multifaceted.

Developer learns php.  Developer discovers MySQL.  Developer stumbles across concept.  Developer cobbles together code, buys hosting — sometimes on a virtual/shared hosting environment, sometimes on a VPS, sometimes a dedicated server.  But, the software that performs well for a few friends hitting the site and acting as beta testers is never really pushed.  While the pages look nice, the engine behind them is usually poorly conceived, or worse, designed thinking that the single server or dual server web/mysql combination is going to keep them alive.

95% of the software designed and distributed under Open Source Licenses doesn’t understand the unique challenges behind a site that needs to handle 20 visitors versus 20000 visitors per hour.  Tuning apache to handle high traffic, tuning mysql indexes and mysql’s configuration and writing applications designed for high traffic is not easy.  Debugging and repairing those applications after they’ve been deployed is even harder.  Repairing while maintaining backwards compatibility adds a whole new level of complexity.

Design with scalability in mind.  I saw a blog the other day where someone was replacing a 3 server setup behind a load balancer with a single machine because the complexity of 100% uptime made their job harder.  Oh really?

What happens when your traffic needs outgrow that one server?  Whoops, I’m back to that load balanced solution that I just left.

What are the phases that you need to look for?

Is your platform ready for 100,000 users a day?  If not, what do you need to do to make sure it is ready?  Where are your bottlenecks? Where does your software break down?  What is your expansion plan?  When do you split your mysql writers and readers?  Where does your appliction boundary start and end?  What do you think breaks next?  Where is our next bottleneck?

What happens with a digg or slashdot that crushes a site?  Usually, its a site that has all sorts of dynamic content with ill conceived mysql queries generated in realtime every pageload.  I can remember a CMS framework that did 54 sql queries to display the front page.   That is just rediculous and I dumped that framework 5 minutes after seeing that.  Pity, they did have a good concept.

So, with scalability in mind, how does one engineer a solution?  LAMP isn’t the answer.

You pick a framework that doesn’t use the usual paradigms of an application.  Why should you worry about a protocol, you should design the application divorced from the protocol.  You develop an application that faces the web rather than talking direct to the web because other applications might talk to your application.  When it comes time to scale, you add machines without having to worry about task distribution.  Google does it, you should too.

Mantissa solves that problem by being a framework that encompasses all of that.  If some of these Web 2.0 sites thought about their deployment like google did — expansion wouldn’t create much turmoil.  To grow, you just add more machines to the network.

Entries (RSS) and Comments (RSS).
Cluster host: li