mysql-python and Snow Leopard

September 1st, 2009

After the upgrade to Snow Leopard, mysql-python cannot be installed through easy_install.

* Install mysql’s x86_64 version from the .dmg file (I had a problem doing this when booted into the 64bit kernel, a 32bit kernel macbook had no problem) With the 64bit kernel, the system reported ‘no mountable file systems’ when trying to mount the .dmg file. A reboot into 32bit mode allowed the .dmg to be mounted and installed.
* Change into your virtual environment if desired, source bin/activate
* fetch MySQL-python-1.2.3c1

tar xzf MySQL-python-1.2.3c1.tar.gz
cd MySQL-python-1.2.3c1
ARCHFLAGS='-arch x86_64' python setup.py build
ARCHFLAGS='-arch x86_64' python setup.py install

If everything works, you should see:

$ python
Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>>

Some of the possible things you’ll encounter:

After python setup.py build:

ld: warning: in build/temp.macosx-10.6-universal-2.6/_mysql.o, file is not of required architecture
ld: warning: in /usr/local/mysql/lib/libmysqlclient.dylib, file is not of required architecture
ld: warning: in /usr/local/mysql/lib/libmygcc.a, file is not of required architecture
ld: warning: in build/temp.macosx-10.6-universal-2.6/_mysql.o, file is not of required architecture
ld: warning: in /usr/local/mysql/lib/libmysqlclient.dylib, file is not of required architecture
ld: warning: in /usr/local/mysql/lib/libmygcc.a, file is not of required architecture

This means that you have the i386 version of mysql installed. Or, if you have the x86_64 version, you have didn’t include the proper ARCHFLAGS command.

ImportError: dynamic module does not define init function (init_mysql)

this means that the easy_install or the build/install process was run which tried to build a ppc/i386/x86_64 combined build.

If you see messages like:

$ python
Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
/Users/xxxxx/Python/django/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-macosx-10.6-universal.egg/_mysql.py:3: UserWarning: Module _mysql was already imported from /Users/xxxxx/Python/django/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-macosx-10.6-universal.egg/_mysql.pyc, but /Users/xxxxx/Python/django/MySQL-python-1.2.3c1 is being added to sys.path
>>> 

Then you are still in the build directory. cd .. and try again.

Mysql Query Optimization

August 28th, 2009

I heard a comment from a developer the other day:

You don’t need indexes on small tables.

So I asked what the definition of a small table was. He said, anything with a few hundred rows. So I said, 2300 rows? Well….. 24000 rows? Well….. 292000 rows? That’s large. I showed him unindexed queries in his application dealing with tables that had 2300, 24000 and 292000 rows.

Avoid tablescans

When MySQL deals with a query that is unindexed, it does a full tablescan to see if each record in the table meets the criteria specified. On a small table, if the query is executed frequently, the MySQL query cache might be able to serve the query. However, on a larger table, or a table with large rows, it must read every row, check the fields, possibly create a temporary table in ram or disk, and return the results. On a small site, you might not notice it, but, on a large system, forcing tablescans on tables with even a few thousand rows will slow things down considerably:

Uptime: 60016 Threads: 11 Questions: 105460332 Slow queries: 197769 Opens: 5819 Flush tables: 1 Open tables: 1320 Queries per second avg: 1757.204

Slow queries are sometimes unavoidable, but, often, slow queries are missing an index.

Use the slow-query log to find potential issues

When analyzing a system to find problems, putting:

log-queries-not-using-indexes

in the my.cnf file and restarting mysql will log the unindexed queries to the slowquery log.

What can be indexed?

The rule of thumb when writing indexes is to write your query in such a way that you reduce the result set as quickly as possible, with the highest cardinality possible. What does this mean?

If you are collecting data of the IP address and the Date, your query against date,ip will actually be worse than ip,date. Imagine receiving 40000 hits to your site on the same date. If you were looking for the number of hits that a particular IP had, you would search the 41 hits they have made over time, and then the 8 that they had today. If you queried by date,ip, you would search 40000 rows then would receive the 8 they had today. Each index you have, adds extra overhead and an index file should be as small as possible. IP addresses can be represented in an unsigned int which takes much less space than the varchar(15) usually used. Remember when you index a varchar field, indexing will spacepad the key to the full length. If you have a variable length field you want indexed, you might be able to figure out the significant portion of that field by finding the average length and adding a few characters for good measure and indexing fieldname(15) rather than the entire field. If a query is longer than the 15 characters, you have still created a significant reduction in the number of rows that it must check.

Cardinality refers to the uniqueness of the data. The more unique the data, the lower the chance that you’ll have thousands of records that match the first criteria. When the data is very similar, the index as built on disk will become imbalanced resulting in slower queries. Since MyISAM and InnoDB use a B-Tree index (or R-Tree if you use a spatial index), data that is similar when inserted, can create a very imbalanced tree which leads to slower lookups. An optimize table can resort and reindex the table to eliminate this, but, you can’t do that on an extremely large, active table without impacting response times.

# Query_time: 0 Lock_time: 0 Rows_sent: 1 Rows_examined: 3323
SELECT * FROM websites_geo where (zoneid = ‘5135’) LIMIT 1;

In this case, zoneid is not indexed on the table websites_geo. Adding an index on zoneid eliminates the tablescan on this query.

Check for equality, not inequality.

An index can only check equality. A query checking to see if values are not equal, cannot be indexed.

# Query_time: 0 Lock_time: 0 Rows_sent: 5 Rows_examined: 2548
SELECT * FROM websites where (id = ‘1056692’ && status != ‘d’ && status != ‘n’) order by rand() LIMIT 5;

# Query_time: 0 Lock_time: 0 Rows_sent: 10 Rows_examined: 2544
SELECT * FROM websites where (status != ‘n’ && status != ‘d’ && traffic > 3000) order by added desc LIMIT 10;

These two queries show two different issues, but, deal with the same fundamental issue. First, id is not indexed which would have at least limited the result set to 9 records rather than 2548. The status check isn’t able to use an index. On the second query, status is checked followed by traffic. There are other queries issued that check status,traffic,clicks_high. When we look at status (which should be an enum or char(1) rather than varchar(1)), we find that there are only 4 values used. By indexing on id,status and status,traffic,clicks_high, we could alter the queries as such:

SELECT * FROM websites where (id = ‘1056692’ && status in (‘g’,’ ‘)) order by rand() LIMIT 5;

SELECT * FROM websites where (status in (‘g’,’ ‘) && traffic > 3000) order by added desc LIMIT 10;

which would result in both queries using an index.

Choose your data types intelligently.

As a secondary point, id (though it is numeric) happens to be a text field. If you index id in this case, you would have to specify a key length.

mysql> select max(length(id)) from websites;
+—————–+
| max(length(id)) |
+—————–+
| 22 |
+—————–+
1 row in set (0.02 sec)

mysql> select avg(length(id)) from websites;
+—————–+
| avg(length(id)) |
+—————–+
| 8.3315 |
+—————–+
1 row in set (0.00 sec)

mysql>

Based on this, we might decide to set the key length to 22 as it is a relatively small number and allows room to grow. Personally, I would have opted to have the id be an unsigned int which would be much smaller, but, the application developer uses alphanumeric id’s which are exposed externally. With sharding, you could use the id throughout the various tables, or, you could map the text id to a numeric id internally for all of the various tables.

There are a number of possible solutions to help any SQL engine perform better. And your data set will dictate some of the things that you can do to make data access quicker.

Helping MySQL Help You

If you do select * from table where condition_a=1 and condition_b=2 in one place, and select * from table where condition_b=2 and condition_a=1, setting up a single index on condition_a,condition_b and adjusting your second query, reversing the conditions to the same order as the keys on the index will increase performance.

Limit your results

Another thing that will help considerably is using a limit clause. So many times a programmer will do: select * from table where condition_a=1 which returns 2300 rows but only the first few rows are used. A limit clause will prevent a lot of data from being fetched by MySQL and buffered waiting for the response. select * from table where condition_a=1 limit 20 would hand you the first 20 records.

Avoid reading the data file, do all your work from the Index

Additionally, if you have a table and only need three of the columns from the result, select fielda,fieldb,fieldc from table where condition_a=1 will return only the three fields. As an added boost, if the fields you are checking can be answered from the index, the query will never hit the actual data file and will be answered from the index. Many times I’ve added a field that wasn’t needed in the index, just to eliminate the lookup of the key in the index then the corresponding read of the data file.

Let MySQL do the work

MySQL reads tables, filters results, can do some calculations. Going through 40000 records to pick the best 100 is still faster in MySQL than allowing PHP to fetch 40000 rows and do calculations and sorts to come up with that 100 rows. Index, optimize, and allow MySQL to do the database work.

Summary

Making MySQL work more efficiently goes a long way towards making your database driven site work better. Adding six indexes to the system resulted in quicker response times and an increase in the transactions per second.

Uptime: 32405 Threads: 1 Questions: 58729705 Slow queries: 64122 Opens: 2911 Flush tables: 1 Open tables: 295 Queries per second avg: 1812.366

Previously, MySQL was generating 3.26 slow queries per second. Now we’re just beneath 2 slow queries per second and our system is processing 55 more transactions per second. There is still a bit more analysis to do to identify the slow queries that are still running and to alter the queries to reverse the inequality checks, but, even just adding indexes to a few tables has helped noticeably. Once the developer is able to make some changes to the application, I’m sure we’ll see an additional speedup.

Rapid Application Development using Turbogears and Django

August 8th, 2009

For the last 14 months we’ve been developing an application to replace 90000 lines of PHP code. Rewriting the application from scratch to support I18N and many of the enhancements it needed was deemed to be a better long term solution.

When that project was first started, I spent a month with Django and a month with Turbogears writing the same test application so that I could compare the development cycle. Both have matured and I needed to do a rapid turnaround on another project. I decided to give Django another look since it had hit version 1.0 and had added quite a few features that were missing in my preliminary evaluation. What follows is a discussion of the major points from both of the frameworks.

Turbogears has excellent form handling. Except for the Forms Wizard in Django, working with forms is much easier in Turbogears. Validation of the form and the resulting database update methods are much cleaner in Turbogears. Django, while slightly more complex in handling form input, does handle things with a single function which might enhance readability in a large project.

Database handling through SQL Alchemy in Turbogears is much better than the database methods in Django. Yes, you can use SQL Alchemy in Django now, but, their default ORM has plenty of room for improvement.

Turbogears is true MVC. Their terminology and methods are true to the paradigm set forth by Smalltalk. Django is MVC, but they call it MTV, Model, Template, View. The differences are slight and the developers of both have made good decisions. Adapting to either project’s methods is quick and not a hindrance.

Django’s definitely wins with Authentication/Authorization. Methods to handle registration, user creation, login and forgotten passwords are built in and wrapped with very nice templates that can be overridden. For Turbogears, repoze.who and repoze.what have been pulled from Plone and put into place. While Turbogears works with repoze, the decisions made and the lack of full support behind it make it difficult to implement.

Django feels faster. Comparing 14 months of development in Turbogears on an application to an application written in Django this week, the template engine, database access and pageload time seemed faster. Django is a lighter weight framework and you are closer to the data. Turbogears puts a little more insulation in which makes some coding easier at the expense of performance.

Maintainability of code would have to go to Turbogears. IBM once stated that the maximum number of bugfree lines of code that could be written was 23. With Turbogears, much of the heavy lifting is handled by widgets and decorators and your model resulting in a smaller codebase. Django requires more code to do the same task unless you utilize some of the snippets. Turbogears makes certain assumptions and has wrapped many of the libraries that make development easy in the default installation. Django’s default installation lacks those decisions, but, you are not prevented from establishing your own middleware. If you were developing a number of Django projects, you would pick and choose snippets that would replicate the decisions that Turbogears has already made.

URL Mapping is much easier to implement with Django. While routes support in Turbogears is present, Django’s regexp mapping is much easier to manipulate.

Community, hands down, Django wins. With a much larger installed base, bugs are found and fixed much more quickly. While Turbogears was founded on loftier principles, execution and follow through are lacking. Development is done when it is needed by a client project in the core group of developers. There is a definite air of condescension when the project developers handle questions from potential developers. With Django, there are people of all experience levels willing to help on groups.google, IRC, and thorough documentation that far exceeds most of the open source documentation out there.

Documentation, again, Django. Well organized, well thought out and well documented examples on Django’s part show dedication to having a framework that is well understood and well used. Turbogears recently moved to Sphinx, but, documentation generated from poorly documented code still means poor documentation. The tutorials and examples have been improving, but, they have a long way to go.

Genshi and Mako are supported fairly well with Turbogears and are both very good engines. Jinja is also supported which is a bit faster than Genshi and is powerful and very easy to work with. Django’s template language is also very flexible, powerful and easy to work with. Django had some definite advantages with a simpler language, but, neither Django or Turbogears stood out as a clear winner.

If you were looking to write an extremely heavy database or form involved site, I think Turbogears would be a good solution. If you choose Turbogears, be prepared to delve into the code when you are faced with problems. Bugs are not dealt with very promptly even though upgrades are pushed out. Be prepared to patch upgraded packages or hold certain packages once you move into production.

On the other hand, if you are writing a less complicated application, Django would be a good choice.

All told, the application developed this week in Django took about 12 hours and had I been working with Django for the last 14 months, I would estimate the project to have taken roughly 8 hours. Developed in Turbogears, I probably could have written it in 6 hours. PHP, to mimic all of the functionality would have taken 14-16 hours even using one of the numerous frameworks.

There is a definite advantage to using a framework and Python frameworks do appear to offer Rapid Application Development even over most of the PHP frameworks. For me, new development will probably be done in Django unless there is a strong case to use Turbogears.

* Turbogears 2.0
* Django
* Django Snippets

ESI Widget Issues in the Varnish, ESI, WordPress experiment

July 26th, 2009

The administration interface is quite simple. When the widget is installed, drag it to the Sidebar, then, drag any widgets that you want displayed to the ESI Widget Sidebar.

esi-widget

Current issues:
* When a user is logged in and comments on a post, their ‘login’ information is left on the page if they are the first person to hit the page when Varnish caches the page. If someone is logged in and visits a post page and the page hasn’t been previously cached, the html that shows their login status is cached, though, new visitors see the information, but lack the credentials.

Addons that don’t work properly:
* Any poll application (possible solution to wrap widget in an ESI block)
* Any stat application (unless they convert to a webbug tracker, this probably cannot be fixed easily)
* Any advertisement/banner rotator that runs internal. OpenX will work, as will most non-plugin
* Any postcount/postviews addon
* CommentLuv?
* ExecPHP (will cache the output, but does work)
* Manageable

Any plugin that does something at the time of the post or comment phase, that isn’t dependent on the logged in data should work without a problem. If it requires a login, or uses the IP address to determine whether a visitor has performed an action, will have a problem due to the excessive caching. For sites where the content is needed to be served quickly and there aren’t many comments, ESI Widget would work well.

Because of the way Varnish works, you wouldn’t necessarily have to run Varnish on the server running WordPress. Point the DNS at the Varnish server and set the backend for the host to your WordPress server’s IP address and you can have a Varnish server across the country caching your blog.

WordPress, Varnish and Edge Side Includes

July 22nd, 2009

While talking about WordPress and it’s abysmal performance in high traffic situations to a client, we started looking back at Varnish and other solutions to keep their machine responsive. Since most of the caching solutions generate a page, serve it and cache it, posts and comments tend to lag behind the cache. db-cache does work around this by caching the query objects so that the pages can be generated more quickly and does expire the cache when tables are updated, but, its performance is still lacking. Using APC’s opcode cache or memcached just seemed to add complexity to the overall solution.

Sites like perezhilton.com appear to run behind multiple servers running Varnish, use wp-cache, move the images off to a CDN which results in a 3 request per second site with an 18 second pageload. Varnish’s cache always shows an age of 0 meaning Varnish is acting more as a load balancer than a front-end cache.

Caching isn’t without its downside. Your weblogs will not represent the true traffic. Since Varnish intercepts and serves requests before they get to the backend, those hits never hit the log. Forget pageview/postview stats (even with addons) because the addon won’t get loaded except during caching. Certain Widgets that rely on cookies or IP addresses will need to be modified. A workaround is to use a Text Box Widget and do an ESI include of the widget. For this client, we needed only some of the basic widgets. The hits in the apache logs will come from an IP of 127.0.0.1. Adjust your apache configuration to show the X-Forwarded-For IP address in the logs. If you truly need statistics, you’ll need to use something like Google Analytics. Put their code outside your page elements so that waiting for that javascript to load doesn’t slow down the rendering in the browser.

The test site, http://varnish.cd34.com/ is running Varnish 2.0.4, Apache2-mpm-prefork 2.2.11, Debian/Testing, WordPress 2.8.2. I’ve loaded the default .xml import for testing templates so that there were posts with varied dates and construction in the site. To replicate the client’s site, the following Widgets were added the sidebar: Search, Archives, Categories, Pages, Recent Posts, Tag Cloud, Calendar. Calendar isn’t in the existing site, but, since it is a very ‘expensive’ SQL query to run, it made for a good benchmark.

The demo site is running on:

model name	: Intel(R) Celeron(R) CPU 2.40GHz
stepping	: 9
cpu MHz		: 2400.389
cache size	: 128 KB

with a Western Digital 80gb 7200RPM IDE drive. Since all of the benchmarking was done on the same machine without any config changes taking place between tests, our benchmarks should represent as even a test base as we can expect.

Regrettably, our underpowered machine couldn’t run the benchmark with 50 concurrent tests, nor, could it run the benchmarks with the Calendar Widget enabled. In order to get apachebench to run, we had to bump the number of requests down and reduce the number of concurrent tests.

These results are from Apache without Varnish.

Server Software:        Apache
Server Hostname:        varnish.cd34.com
Server Port:            80

Document Path:          /
Document Length:        43903 bytes

Concurrency Level:      10
Time taken for tests:   159.210 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      4408200 bytes
HTML transferred:       4390300 bytes
Requests per second:    0.63 [#/sec] (mean)
Time per request:       15921.022 [ms] (mean)
Time per request:       1592.102 [ms] (mean, across all concurrent requests)
Transfer rate:          27.04 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   7.0      0      25
Processing: 14785 15863 450.2  15841   17142
Waiting:     8209 8686 363.4   8517    9708
Total:      14785 15865 451.4  15841   17142

Percentage of the requests served within a certain time (ms)
  50%  15841
  66%  15975
  75%  16109
  80%  16153
  90%  16628
  95%  16836
  98%  17001
  99%  17142
 100%  17142 (longest request)

Normally we would have run the Varnish enabled test without the Calendar Widget, but, I felt confident enough to run the test with the widget in the sidebar. Varnish was configured with a 12 hour cache (yes, I know, I’ll address that later) and the ESI Widget was loaded.

Server Software:        Apache
Server Hostname:        varnish.cd34.com
Server Port:            80

Document Path:          /
Document Length:        45544 bytes

Concurrency Level:      50
Time taken for tests:   18.607 seconds
Complete requests:      10000
Failed requests:        0
Write errors:           0
Total transferred:      457980000 bytes
HTML transferred:       455440000 bytes
Requests per second:    537.44 [#/sec] (mean)
Time per request:       93.034 [ms] (mean)
Time per request:       1.861 [ms] (mean, across all concurrent requests)
Transfer rate:          24036.81 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.8      0      42
Processing:     1   92  46.2    105     451
Waiting:        0   91  45.8    104     228
Total:          2   93  46.0    105     451

Percentage of the requests served within a certain time (ms)
  50%    105
  66%    117
  75%    123
  80%    128
  90%    142
  95%    155
  98%    171
  99%    181
 100%    451 (longest request)

As you can see, even with the aging hardware, we went from .63 requests per second to 537.44 requests per second.

But, more about that 12 hour cache. The ESI Widget uses an Edge Side Include to include the sidebar into the template. Rather than just cache the entire page, we instruct Varnish to cache the page and include the sidebar. As a result, when a person surfs the site and goes from the front page to a post page, the sidebar doesn’t need to be regenerated when they go to the 2nd page. With wp-cache, it would have regenerated the sidebar Widgets and then cached the resulting page. Obviously, that 12 hour cache is going to affect the usability of the site, so, ESI widget purges the sidebar, front page and post page any time a post is updated or deleted or commented on. Voila, even with a long cache time, we are presented with a site that is dynamic and not delayed until wp-cache’s page cache expires. As this widget is a concept, I’m sure a little intelligence can be added to prevent the excessive purging in some cases, but, it does handle things reasonably well. There are some issues not currently handled with the ESI including how to handle users that are logged for comments. With some template modifications, I think those pieces can be handled with ESI to provide a lightweight method for the authentication portion.

While I have seen other sites mention Varnish and other methods to keep your wordpress installation alive in high traffic, I believe this approach is a step in the right direction. With the ESI widget, you can focus on your site, and let the server do the hard work. This methodology is based on a CMS that I have contemplated writing for many years, though, using Varnish rather than static files.

It is a concept developed in roughly four hours including the time to write the widget and do the benchmarking. It isn’t perfect, but does address the immediate needs of the one client. I think we can consider this concept a success.

If you don’t have the ability to modify your system to run Varnish, then you would be limited to running wp-cache and db-cache. If you can connect to a memcached server, you might consider running Memcached for WordPress as it will make quite a difference as well.

This blog site, http://cd34.com/blog/ is not running behind Varnish. To see the Varnish enabled site with ESI Widget, go to http://varnish.cd34.com/

Software Mentioned:

* Varnish ESI and Purge and Varnish’s suggestions for helping WordPress
* WordPress
* wp-cache
* db-cache

Sites used for reference:

* Supercharge WordPress
* SSI, Memcached and Nginx (with mentions of a Varnish/ESI configuration)

Varnish configuration used for ESI-Widget:

backend default {
.host = "127.0.0.1";
.port = "81";
}

sub vcl_recv {
 if (req.request == "PURGE") {
     purge("req.url == " req.url);
 }

 if (req.url ~ "\.(png|gif|jpg|ico|jpeg|swf|css|js)$") {
    unset req.http.cookie;
  }
  if (!(req.url ~ "wp-(login|admin)")) {
    unset req.http.cookie;
  }
}

sub vcl_fetch {
   set obj.ttl = 12h;
   if (req.url ~ "\.(png|gif|jpg|ico|jpeg|swf|css|js)$") {
      set obj.ttl = 24 h;
   } else {
      esi;  /* Do ESI processing */
   }
}

Entries (RSS) and Comments (RSS).
Cluster host: li