Professional MySQL Developer?

March 9th, 2009

Once in a while I get to troubleshoot problems on machines specifically with MySQL.  Sometimes that task is difficult because the vendor written code has been Zend Encoded.  Perhaps in the name of performance, possibly to protect his code, or, to prevent me from seeing just how bad his code is.  Diagnosing MySQL issues forensically requires a few scripts, patience, and time.

This time, I modified /etc/mysql/my.cnf on two machines and added the following:

log        = /var/log/mysql/mysql.log
log-slow-queries= /var/log/mysql/mysql-slow.log
log-queries-not-using-indexes

This gave me three things.  A full log of all queries executed.  I have written a small program to normalize the queries and sort them so I can focus on the queries that take 90% of the time.  I then get queries that take more than 10 seconds — which can be altered by adjusting long_query_time, and, I get any query that doesn’t use an index.  Usually I will focus on the unindexed queries as those generally cause the problems.  Very few people understand the impact of table locks and that usually is where most problems occur.

The first example comes from an application which claims to be able to support 100,000 visitors per day.  Perhaps with enough hardware behind it can, but, in its current state, on a quad core xeon, about 40,000 per day is about all they can push.

Where do we start?  So few lines, so many errors.

CREATE TABLE `iplog` (
`ip` varchar(255) NOT NULL default ”,
`pid` bigint(20) NOT NULL default ‘0’,
`viewed` varchar(255) NOT NULL default ”,
`locked` tinyint(1) NOT NULL default ‘0’,
PRIMARY KEY  (`ip`,`pid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The above schema replicates functionality in the prior version of the software.  The old version of the system didn’t do things this way, so, I am unsure why the developer chose to alter his habits and break things in so many ways.

ip varchar(255)

IPv4 currently only takes 15 spaces.  char(15) would have been more efficient taking 16 bytes to store a 15 digit IP address including the ‘.’.  Even varchar(15) would have taken 20 bytes to store the 15 digit IP address.  But varchar(255) just takes the cake.  The proper way to store an IPv4 address is unsigned int.  Even barring the fact that varchar(15) or char(15) would have been more space efficient, using an unsigned int allows the BTree that MyISAM creates to be a lot more balanced.

Then, lets add an index to the IP field.  Knowing that indexes space pad keys to their full length, the Mysql index file on the disk is actually larger than the data file itself.

I might object to the use of InnoDB on a table with questionable value.  The overhead for InnoDB on data that is valuable only for a few minutes seems like overkill, but, we’ll give some credit for data integrity.

That really wasn’t the worst offender.  Another application written by another professional programmer included roughly 80 queries to display one page.  Once you dig deeper into the code, you find queries like this:

# Query_time: 0  Lock_time: 0  Rows_sent: 1  Rows_examined: 984
SELECT count(VID) AS total_videos FROM video WHERE type=’public’ AND ( keyword LIKE ‘%keyword%’  OR keyword LIKE ‘%keyword%’ OR keyword LIKE ‘%keyword%’ OR keyword LIKE ‘%keyword%’);

First, VID is auto_increment, not null.  This particular query causes mysql to scan the results from the above query, and the verify VID to see if it is null for any row.  Due to the table construction, it cannot be null, so, SELECT count(*) would have been faster.

Second, they disassembled the input, wrapped it in % so that they could do floating matches rather than using a fulltext index which would have been much faster.  As the table only has 984 rows now, once they start growing the site, they’ll start to see more performance issues.  A fulltext index would allow for quite a bit more growth.

While the following query isn’t optimizable, the results could be cached.

# Query_time: 0  Lock_time: 0  Rows_sent: 5  Rows_examined: 155
SELECT * from channel ORDER BY rand() LIMIT 5;

As a result of the ORDER BY rand(), MySQL will need to do a full table scan each and every time this query is called.  In this case, twice for each pageload.

Another particularly interesting query:

select tradeId from trades2 where tradeDomain = ‘noref’;

Taken on its own, you can see where tradeDomain may have been omitted in the indexes.  However, the table name gives you some indication of the mess.  Here’s another query that ran without an index:

select tradeId from trades28 where tradeDomain = ‘noref’;

There are a total of 50 trades tables.  This corresponds to the number of ‘sites’ in the sites table.  It seems our developer couldn’t quite figure out how to index on two fields.

In the last 4 hours, the unindexed queries from this one site have resulted in the following:

-rw-r—–  1 mysql adm  10869491 2009-03-09 10:53 mysql-slow.log

Roughly 54000 queries in 4 hours that had no index.  Compared to yesterday’s logs when I added a number of indices, it is a world of difference.

However, fixing the issues in the code is still more expensive than throwing hardware at the problem.  The problem you run into with using hardware as a crutch is that eventually you reach an impasse where the problem must be fixed, or, you face other bottlenecks.

Based on a few changes made by adding indexes, the load on those two machines has dropped considerably.  Each of the vendors that wrote the systems that I analyzed claim to handle well more traffic than either of the machines is currently handling.  The claims were written on their web sites roughly 3 years ago, and the machines that they are currently running on a much newer and much more powerful than they were at the time the software was installed.

One of the software vendors is out of the business due to too many support issues.  One software vendor continues churning out bad code and getting people to pay $600 for it.  The third developer disappeared and was replaced by a team of two programmers who appear to know what they are doing and are releasing version 5.0 (the current, latest version is 3.1) ‘soon’.  I know it will be soon because their web site, last updated in July 2008, says they are in limited beta.

I guess it is time to order some hardware.

Hiding Data in Plain Sight

March 4th, 2009

I had a breakfast meeting today with a company involved in forensic reconstruction of data after a possible crime had been committed.  Somehow the conversation shifted slightly and we talked about the process and one of the people said, “You know, it wouldn’t be so bad if we didn’t have to wade through all that spam and not find anything worthwhile in the email messages that showed how the person communicated.”

At this point I said, have you ever thought that they could be using Spam Steganography?  Eyebrows were raised, the conversation paused and I was met with a blank stare for about 30 seconds.

The assumption is that encrypted data needs to look like encrypted data or a string of numbers and letters that are unintelligable.  While this system doesn’t really produce well hidden data, the premise is valid.

Dear Friend ; Thank-you for your interest in our publication
. If you no longer wish to receive our publications
simply reply with a Subject: of “REMOVE” and you will
immediately be removed from our club ! This mail is
being sent in compliance with Senate bill 1816 ; Title
3 ; Section 304 . This is not multi-level marketing
. Why work for somebody else when you can become rich
within 45 days . Have you ever noticed more people
than ever are surfing the web & people love convenience
! Well, now is your chance to capitalize on this .
We will help you SELL MORE and use credit cards on
your website . You are guaranteed to succeed because
we take all the risk ! But don’t believe us ! Ms Ames
of Montana tried us and says “I was skeptical but it
worked for me” ! We are licensed to operate in all
states . We implore you – act now . Sign up a friend
and you’ll get a discount of 80% ! Best regards . Dear
E-Commerce professional , Especially for you – this
breath-taking news . We will comply with all removal
requests . This mail is being sent in compliance with
Senate bill 1626 ; Title 1 ; Section 301 . This is
different than anything else you’ve seen ! Why work
for somebody else when you can become rich in 38 weeks
. Have you ever noticed most everyone has a cellphone
plus people love convenience ! Well, now is your chance
to capitalize on this . We will help you decrease perceived
waiting time by 200% plus use credit cards on your
website ! You are guaranteed to succeed because we
take all the risk . But don’t believe us . Mr Jones
of Georgia tried us and says “Now I’m rich many more
things are possible” ! This offer is 100% legal ! So
make yourself rich now by ordering immediately ! Sign
up a friend and you’ll get a discount of 60% . Best
regards !

The above message decodes to: This is a test message

I now wonder if they will be redoing that investigation looking for stegonagraphically encoded spam.

Legacy Code Fix versus Code Rewrite

February 28th, 2009

Python Frameworks use the DRY Method — Don’t Repeat Yourself.  That is a powerful mantra to follow when developing applications.

I am faced with a quandry.  I have an application that consists of roughly 40000 lines.  That code was written in php many years ago, with a handbuilt web framework implemented with smarty.  There are a number of issues with the existing code, including inline functions duplicated through multiple files, poorly validated input and bad structure.  Outsourcing some code development appeared to be cost effective.  In the end, the code quality churned out by that vendor was sub-par.  Maintenance of that codebase easily costs twice as much as it should.

This week, a few requirements cropped up which brought up an interesting question.  Knowing that the code consists of 40000 lines of poorly written, difficult to maintain code, I debated whether fixing the existing code would be quicker than rewriting the relevent portion and coding the addendum.  TurboGears, a python webapp framework would shrink the code considerably since it is thin middleware on top of a wsgi compliant server.

Where it took 45 lines of code to do a halfway decent job of validating a few input fields in php with smarty, the equivalent code in TurboGears consists of a model definition containing the validator segment and a few lines of code to define the page.  Updating the database becomes one line of code, replacing 8-12 lines of code.

I had planned to convert the application over to TurboGears eventually, but, the scope of this current project gives me an opportunity to convert a piece of the application over while adding the new features, and leaving the legacy code running as is.

The features I need to add will take roughly 150 lines of Python/TurboGears code, or perhaps 1500-2000 lines of php to achieve the same functionality.  I have debated using another PHP framework as a stopgap, but, I have yet to find a decent form library for it that works well.

If I had to pick a favorite, ad_form from openacs would top the list.  TurboGears and Genshi with formencode come in as a close second.

I believe rewriting the portions of the app I need to write will probably take roughly the same amount of time as it would take to patch the existing code.  The investment in time will put me closer to finishing the complete rewrite of the existing system.

An added advantage is that I can fix architectural issues with the existing package that couldn’t easily be reworked without considerable effort.  If the code you are maintaining is over five years old, you owe it to yourself to check out some of the other frameworks out there.  Prior to settling on TurboGears, I looked at Django, Catalyst, Mason and a number of other frameworks.  I even searched the PHP frameworks but didn’t find anything with the strengths I saw with TurboGears.

Concurrent processing

February 5th, 2009

It has been a while since I’ve written parallel processing or concurrent processing code.  Threaded programming is something that even the experts that wrote apache and php have problems with, yet, writing this type of code is somewhat enjoyable.

It started a few years ago when I replaced 112k lines of C code and libraries that never quite did what the design document specified.  I communicated the idea to the coders, the coders wrote the design document and delivered a product that didn’t even do what the design document specified.

The code was scrapped and rewritten in perl and comprises about 1200 lines of code not including the CPAN libraries used.  The code is faster, more reliable, and is very agnostic to its task.  It has more capabilities but leaves more of the work to the tasks that is passes around which allows the code to handle communications and dispatch.

While initial testing was rather thorough based on the bugs and issues encountered during the previous version’s reign.  While we’ve run into minor glitches with the new code, it is considerably more reliable to the point where it is tasked to do more.  While the dispatch method was rewritten, concurrent task collision wasn’t tested nearly enough.

And therein lies the problem.  The previous system accepted a task, opened a connection and waited until the task completed.  Collisions couldn’t occur because each task would open a connection to the remote machine and wait until the task completed.  For short tasks, this wasn’t a real issue.  Longer tasks risked the socket timing out.  If 15 tasks were sent, 15 connections remained open until the tasks completed.

The replacement system handed off the task but didn’t wait for the task to complete.  The remote machine would handle its packet and return the task results.  The issue of multiple tasks being added for the same machine results in a few collisions.  Task order isn’t important, but, sometimes a task is fetched twice or a task is missed and left in queue.  A task in queue is redispatched, but, the double fetch issue has been difficult to debug.  Put in the slightest amount of debugging code and voila, tasks are dispatched properly under every test that can be thrown at it.  Remove the debugging code and the error returns.

While the task is left in the queue for processing and the file locking for the state machine has been double and triple checked, but, I’m sure once I dig into it, I’ll find some logic error that leaves a stale lock or incorrectly clears a lock.

I remember I used to love writing code like this, though, I always dreaded debugging it.

This Word, “Scaling”

June 30th, 2008

It seems that everyone on the blogosphere, including Divmod, is talking about “scaling” these days.  I’d like to talk a bit about what we mean ­— and by “we” I mean both the Twisted community and Divmod, Inc., — when we talk about “scaling”.

[more]

One of the issues that I continually deal with is the scaling of an application and its platform.  I believe this author makes the point rather well.

Entries (RSS) and Comments (RSS).
Cluster host: li