mysql 5.1’s query optimizer
Wednesday, October 7th, 2009While debugging an issue with an application that relies heavily on MySQL, an issue was brought up regarding the cardinality of the keys selected, and, the order in which the keys were indexed. With any relational database, in order to get the fastest performance, your query should reduce the result set as quickly as possible. Your data should have a high cardinality or variation in the data so that the B-Tree (or R-Tree) is more balanced. If your data consists of:
One thousand records with the date 2009-01-01
One thousand records with the date 2009-01-02
…
One thousand records with the date 2009-12-31
The cardinality or uniqueness of that column is low given the fact that you’ll have 365000 rows with blocks of one thousand having the same key. If you consider 125 different IP addresses per day generating those same thousand records, the cardinality or uniqueness of the IP addresses will be very high.
In order to show the performance differences in multiple indexing schemes and representations, a table has been created with an Unsigned Int column for the IP address, a varchar(15) for the IP address, a date column, and a varchar(80) for some text data. Because of the way the MySQL query processor works, it is possible to construct your query so that the results are answered from the index and the data file is never hit. A test sample was created that will be used for all of the tests. The file will be indexed, optimized, and the test run five times with the cumulative time used. The sample data that generates the queries against the database include 48000 of the ten million rows, plus 2000 randomly generated queries. Those results are then shuffled and written to a file for the tests. Testing hits versus misses emulates real world situations a little more accurately. All of the code used to run these tests is included in this post.
Test Setup
Creation of the table:
CREATE TABLE `querytest` ( `iip` int(10) unsigned DEFAULT NULL, `ipv` varchar(15) DEFAULT NULL, `date` date DEFAULT NULL, `randomtext` text ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Filling the table with data:
#!/usr/bin/python import MySQLdb import random import datetime import time lipsum = """ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi gravida congue nisi, nec auctor leo placerat nec. In hac habitasse platea dictumst. In rutrum blandit velit et varius. Integer commodo ipsum ut diam placerat feugiat. Curabitur viverra erat ut felis cursus mollis. Sed tempus tempor faucibus. Etiam eget arcu massa, eget dictum sapien. Nullam euismod purus vitae risus ultrices tempus. Mauris semper rhoncus lectus, sit amet laoreet mauris tincidunt et. Duis ut mauris massa. Nam semper, enim id fermentum tristique, ligula velit suscipit lacus, vitae ultrices mi arcu sit amet felis. Ut sit amet tellus eget lorem gravida malesuada. Integer nec massa quis mauris porta laoreet. Curabitur tincidunt nunc at mauris porttitor auctor. Mauris auctor faucibus tortor dignissim sodales. Sed ut tellus nisi, laoreet malesuada tortor. Vivamus blandit neque et nunc fringilla quis dignissim felis tincidunt. Nam nec varius orci. Duis pretium magna id urna fermentum commodo. Aliquam sollicitudin imperdiet leo eget ullamcorper. Quisque id mauris nec purus pulvinar bibendum. Fusce nunc metus, viverra in iaculis id, tempus nec neque. Aenean ac diam arcu, vitae condimentum lectus. Vivamus cursus iaculis tortor eget bibendum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Aenean elementum odio et nisl ornare at sodales eros porta. Duis mollis tincidunt neque, sed pulvinar enim ultrices a. Sed laoreet nunc ut nisl luctus a egestas quam luctus. Pellentesque non dui et neque ullamcorper condimentum ac ut turpis. Etiam a lectus odio, vitae bibendum arcu. Nulla egestas dolor ligula. Quisque rhoncus neque ultrices mi lacinia tempus. Sed scelerisque libero dui, quis vulputate leo. Phasellus nibh ante, viverra sed cursus ac, dictum et lectus. Suspendisse potenti. Ut dapibus augue vitae sem convallis in iaculis nibh bibendum. Mauris eu sapien in lacus pharetra fermentum. Etiam eleifend vulputate velit, a tempor augue ultrices vitae. Vestibulum varius orci ac justo adipiscing quis dignissim odio porttitor. Nam ac metus leo. Ut a porttitor lectus. Nunc accumsan ante non eros feugiat suscipit. Nulla facilisi. Nam molestie dignissim purus sed lacinia. Etiam tristique, eros vel condimentum fermentum, ipsum justo vulputate erat, sed faucibus nunc nisl id tellus. Aliquam a tempus leo. Nullam et sem nunc. Suspendisse potenti. Quisque ante lorem, aliquam sed aliquet vel, malesuada sit amet nisl. Vestibulum tristique velit pellentesque sapien ultrices non gravida ante blandit. Donec luctus nunc dictum felis feugiat sollicitudin. Nam lectus mi, porttitor sed adipiscing ac, pharetra a orci. Ut vitae eros vitae metus. """ db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest") cursor = db.cursor() length = len(lipsum) jan_1_2009 = time.mktime((2009, 1, 1, 0, 0, 0, 0, 0, 0)) for i in range (1, 10000001): # generate a random IP address rand_ip = random.randint(1,4294967295) # pull a random piece of text from lipsum with a random length start_pos = random.randint(1,length) end_pos = start_pos + random.randint(200,2000) random_text = lipsum[start_pos:end_pos] # pick a random date in 2009 rand_date = time.strftime("%Y-%m-%d",time.gmtime(jan_1_2009 + random.randint(1,365*60*60*24))) cursor.execute("insert into querytest (iip,ipv,date,randomtext) values (%s,inet_ntoa(%s),%s,%s)", (rand_ip, rand_ip, rand_date, random_text)) cursor.close () db.close ()
Generate test set:
#!/usr/bin/python import MySQLdb import random import datetime import time import socket import struct db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest") cursor = db.cursor() jan_1_2009 = time.mktime((2009, 1, 1, 0, 0, 0, 0, 0, 0)) cursor.execute("select iip,ipv,date from querytest order by rand() limit 48000") data = list(cursor.fetchall()) for i in range (1, 2001): # generate a random IP address rand_ip = random.randint(1,4294967295) # pick a random date in 2009 rand_date = time.strftime("%Y-%m-%d",time.gmtime(jan_1_2009 + random.randint(1,365*60*60*24))) data.append((rand_ip, socket.inet_ntoa(struct.pack('L',rand_ip)), rand_date)) random.shuffle(data) for datum in data: print "%s,%s,%s" % (datum[0], datum[1], datum[2]) cursor.close () db.close ()
At this point we have created the table, filled it with ten million rows, and generated a fifty thousand row query set to run against the table. Now, we need to categorize the theories to see whether cardinality plays as large a role as it used to.
The following tests will be performed
Index of iip,date
* Use the unsigned int representation of the IP address and the date
* Use the text representation of the IP address passed through inet_aton() and the date
Index of ipv, date
* Use the text representation of the IP address and the date
* Use the unsigned int representation of the IP address passed through inet_ntoa() and the date
Index of date,iip
* Use date and the unsigned int representation of the IP address
* Use date and the text representation of the IP address passed through inet_aton()
Index of date,ipv
* Use date and the unsigned int representation of the IP address
* Use date and the text representation of the IP address passed through inet_aton()
Each of the above tests will be run twice, once with select * and once with select ipv,date.
Benchmark Code
#!/usr/bin/python import MySQLdb import random import datetime import time import socket import struct import array def run_query(query, data, columna, columnb): for datum in data: cursor.execute(query, (datum[columna], datum[columnb])) result = cursor.fetchall() query_tests = [ ['create index querytest on querytest (iip,date)', 'select * from querytest where iip=%s and date=%s', 0, 2 ], ['create index querytest on querytest (iip,date) using HASH', 'select * from querytest where iip=%s and date=%s', 0, 2 ], ['create index querytest on querytest (iip,date)', 'select iip,date from querytest where iip=%s and date=%s', 0, 2 ], ['create index querytest on querytest (iip,date) using HASH', 'select iip,date from querytest where iip=%s and date=%s', 0, 2 ], ['create index querytest on querytest (iip,date)', 'select * from querytest where iip=inet_aton(%s) and date=%s', 1, 2 ], ['create index querytest on querytest (iip,date) using HASH', 'select * from querytest where iip=inet_aton(%s) and date=%s', 1, 2 ], ['create index querytest on querytest (iip,date)', 'select iip,date from querytest where iip=inet_aton(%s) and date=%s', 1, 2 ], ['create index querytest on querytest (iip,date) using HASH', 'select iip,date from querytest where iip=inet_aton(%s) and date=%s', 1, 2 ], ['create index querytest on querytest (ipv,date)', 'select * from querytest where ipv=%s and date=%s', 1, 2 ], ['create index querytest on querytest (ipv,date) using HASH', 'select * from querytest where ipv=%s and date=%s', 1, 2 ], ['create index querytest on querytest (ipv,date)', 'select ipv,date from querytest where ipv=%s and date=%s', 1, 2 ], ['create index querytest on querytest (ipv,date) using HASH', 'select ipv,date from querytest where ipv=%s and date=%s', 1, 2 ], ['create index querytest on querytest (ipv,date)', 'select * from querytest where ipv=inet_ntoa(%s) and date=%s', 0, 2 ], ['create index querytest on querytest (ipv,date) using HASH', 'select * from querytest where ipv=inet_ntoa(%s) and date=%s', 0, 2 ], ['create index querytest on querytest (ipv,date)', 'select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s', 0, 2 ], ['create index querytest on querytest (ipv,date) using HASH', 'select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s', 0, 2 ], ['create index querytest on querytest (date,iip)', 'select * from querytest where date=%s and iip=%s', 2, 0 ], ['create index querytest on querytest (date,iip) using HASH', 'select * from querytest where date=%s and iip=%s', 2, 0 ], ['create index querytest on querytest (date,iip)', 'select iip,date from querytest where date=%s and iip=%s', 2, 0 ], ['create index querytest on querytest (date,iip) using HASH', 'select iip,date from querytest where date=%s and iip=%s', 2, 0 ], ['create index querytest on querytest (date,iip)', 'select * from querytest where date=%s and iip=inet_aton(%s)', 2, 1 ], ['create index querytest on querytest (date,iip) using HASH', 'select * from querytest where date=%s and iip=inet_aton(%s)', 2, 1 ], ['create index querytest on querytest (date,iip)', 'select iip,date from querytest where date=%s and iip=inet_aton(%s)', 2, 1 ], ['create index querytest on querytest (date,iip) using HASH', 'select iip,date from querytest where date=%s and iip=inet_aton(%s)', 2, 1 ], ['create index querytest on querytest (date,ipv)', 'select * from querytest where date=%s and ipv=%s', 2, 1 ], ['create index querytest on querytest (date,ipv) using HASH', 'select * from querytest where date=%s and ipv=%s', 2, 1 ], ['create index querytest on querytest (date,ipv)', 'select ipv,date from querytest where date=%s and ipv=%s', 2, 1 ], ['create index querytest on querytest (date,ipv) using HASH', 'select ipv,date from querytest where date=%s and ipv=%s', 2, 1 ], ['create index querytest on querytest (date,ipv)', 'select * from querytest where date=%s and ipv=inet_ntoa(%s)', 2, 0 ], ['create index querytest on querytest (date,ipv) using HASH', 'select * from querytest where date=%s and ipv=inet_ntoa(%s)', 2, 0 ], ['create index querytest on querytest (date,ipv)', 'select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)', 2, 0 ], ['create index querytest on querytest (date,ipv) using HASH', 'select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)', 2, 0 ], ] db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest") cursor = db.cursor() queries = open('testquery.txt').readlines() query_array = [] for query_data in queries: query_array.append(query_data.rstrip('\n').split(',')) for test in query_tests: try: cursor.execute('alter table querytest drop index querytest') except: pass cursor.execute(test[0]) cursor.execute('optimize table querytest') print "Test: %s\n with Index: %s" % (test[1], test[0]) start_time = time.time() for loop in range (1,6): run_query(test[1], query_array, test[2], test[3]) end_time = time.time() print "Duration: %f seconds\n" % (end_time - start_time) cursor.close () db.close ()
Miscellaneous notes
P4/3.0ghz, 2gb RAM, Debian 3/Squeeze, Linux 2.6.31.1, WD 7200RPM SATA drive, SuperMicro P4SCI Motherboard
There are multiple tests that could have been run without dropping the index, recreating the index and optimizing the table. When testing a more limited set, results were a little sporadic due to a smaller initial test set and portions of the table and index being cached in the kernel cache. To ensure more consistent test results, every test was run in a consistent manner.
Benchmark Results
Test: select * from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 679.169198 seconds Test: select * from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) using HASH Duration: 692.634291 seconds Test: select iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 179.039791 seconds Test: select iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) using HASH Duration: 178.993962 seconds Test: select * from querytest where iip=inet_aton(%s) and date=%s with Index: create index querytest on querytest (iip,date) Duration: 672.836734 seconds Test: select * from querytest where iip=inet_aton(%s) and date=%s with Index: create index querytest on querytest (iip,date) using HASH Duration: 606.268787 seconds Test: select iip,date from querytest where iip=inet_aton(%s) and date=%s with Index: create index querytest on querytest (iip,date) Duration: 195.253512 seconds Test: select iip,date from querytest where iip=inet_aton(%s) and date=%s with Index: create index querytest on querytest (iip,date) using HASH Duration: 195.222058 seconds Test: select * from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) Duration: 741.876227 seconds Test: select * from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) using HASH Duration: 639.109309 seconds Test: select ipv,date from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) Duration: 167.049333 seconds Test: select ipv,date from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) using HASH Duration: 167.016152 seconds Test: select * from querytest where ipv=inet_ntoa(%s) and date=%s with Index: create index querytest on querytest (ipv,date) Duration: 578.565762 seconds Test: select * from querytest where ipv=inet_ntoa(%s) and date=%s with Index: create index querytest on querytest (ipv,date) using HASH Duration: 655.869390 seconds Test: select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s with Index: create index querytest on querytest (ipv,date) Duration: 181.555567 seconds Test: select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s with Index: create index querytest on querytest (ipv,date) using HASH Duration: 181.230911 seconds Test: select * from querytest where date=%s and iip=%s with Index: create index querytest on querytest (date,iip) Duration: 655.928799 seconds Test: select * from querytest where date=%s and iip=%s with Index: create index querytest on querytest (date,iip) using HASH Duration: 637.146124 seconds Test: select iip,date from querytest where date=%s and iip=%s with Index: create index querytest on querytest (date,iip) Duration: 181.637912 seconds Test: select iip,date from querytest where date=%s and iip=%s with Index: create index querytest on querytest (date,iip) using HASH Duration: 181.512190 seconds Test: select * from querytest where date=%s and iip=inet_aton(%s) with Index: create index querytest on querytest (date,iip) Duration: 603.553238 seconds Test: select * from querytest where date=%s and iip=inet_aton(%s) with Index: create index querytest on querytest (date,iip) using HASH Duration: 605.363284 seconds Test: select iip,date from querytest where date=%s and iip=inet_aton(%s) with Index: create index querytest on querytest (date,iip) Duration: 196.680399 seconds Test: select iip,date from querytest where date=%s and iip=inet_aton(%s) with Index: create index querytest on querytest (date,iip) using HASH Duration: 194.746056 seconds Test: select * from querytest where date=%s and ipv=%s with Index: create index querytest on querytest (date,ipv) Duration: 657.619028 seconds Test: select * from querytest where date=%s and ipv=%s with Index: create index querytest on querytest (date,ipv) using HASH Duration: 686.560066 seconds Test: select ipv,date from querytest where date=%s and ipv=%s with Index: create index querytest on querytest (date,ipv) Duration: 172.222691 seconds Test: select ipv,date from querytest where date=%s and ipv=%s with Index: create index querytest on querytest (date,ipv) using HASH Duration: 172.079220 seconds Test: select * from querytest where date=%s and ipv=inet_ntoa(%s) with Index: create index querytest on querytest (date,ipv) Duration: 726.031732 seconds Test: select * from querytest where date=%s and ipv=inet_ntoa(%s) with Index: create index querytest on querytest (date,ipv) using HASH Duration: 678.099808 seconds Test: select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s) with Index: create index querytest on querytest (date,ipv) Duration: 185.415666 seconds Test: select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s) with Index: create index querytest on querytest (date,ipv) using HASH Duration: 185.280880 seconds
Conclusions
Based on the data, I think we can say that the argument of B-Tree versus Hash doesn’t seem to make much difference. Neither is consistently better, and since the data and query test is identical, the results don’t really point to a clear winner. Avoiding Select * and pulling only the required fields makes a difference and if your result can be answered from the index rather than the data file, there is a substantial boost. Analysis of the results suggests that cardinality isn’t as important as it used to be. I am devising a method to further test cardinality as I do believe that live data will have somewhat different results from data after an optimize table has been run.
The winner in this case is:
Test: select ipv,date from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) Duration: 167.049333 seconds Test: select ipv,date from querytest where ipv=%s and date=%s with Index: create index querytest on querytest (ipv,date) using HASH Duration: 167.016152 seconds
I had actually expected int represented as unsigned int would be the fastest. However, there is probably a reasonable explanation why these two queries are slower:
Test: select iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 179.039791 seconds Test: select iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) using HASH Duration: 178.993962 seconds
Data in MySQL is represented as binary. The IP stored as an unsigned int takes 4 bytes, and the date takes 3. The key length in this case would be 7 bytes versus the index on IP stored as varchar(15) and the date taking 18 bytes. Even though the index in the second case is almost three times the size of the unsigned int IP, the MySQL client library converts all binary data to ASCII when communicating to avoid endian issues. That extra conversion results in a slightly slower result — measurable when you do 250000 queries against a 10 million record database.
A quick modification of the test shows the results of select *, versus select keyvaluea,keyvalueb and select data,keyvalueb. As you can see from the results below, MySQL will answer queries from the index if it doesn’t need to hit the data file.
Test: select * from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 637.420786 seconds Test: select iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 178.434477 seconds Test: select ipv,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 690.804990 seconds Test: select inet_ntoa(iip) as iip,date from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 183.817643 seconds
If you can structure your data well, there are significant performance gains to be had.
What does this mean?
Do you store IPs as unsigned int in the database? If you use varchar(15) or char(15), you’re talking about an eleven or ten byte savings per record at the expense of some conversion time. varchar uses 1 character to store the length of the stored data plus the length of the data. char is a fixed length based on the column length you specify.
Make sure you return only the columns that you need in your calculations — especially if you are running MySQL over a network.
Try to create your index to match the conditions that you are looking for, and, when possible, if you are searching for the result from a particular column, consider adding it to the index as well.
Always use count(*) rather than count(column) unless there is a valid reason for that column to contain NULL.
The Effect of count(*) versus count(date)
count(*) gives you the number of rows in the set that match the criteria you have set. count(date) counts the number of rows in the set that match the criteria where the date is not null. Many times, you’ll see someone do a count(id), and id by definition is a primary key, auto_increment and cannot be null. Because count(column) must read the table to ensure that the column specified is not null, it is forced to check every key, or, read the table for all of the matching rows to make sure the column retrieved doesn’t contain a null value. If the column being counted is one of the keys in the index, the performance change won’t be as dramatic. By counting a column that isn’t in the key and having to read the data, count(column) is considerably slower.
Results when the counted column is within the key and only 1 or 0 rows are expected:
Test: select count(*) as ct from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 175.727338 seconds Test: select count(iip) as ct from querytest where iip=%s and date=%s with Index: create index querytest on querytest (iip,date) Duration: 176.495198 seconds
When count returns more than one row, you can see the effect is much more detrimental. The first iteration of this test took so long that I shortened it to do five iterations of 100 queries. After 4 hours, and 18% complete, I shortened the test to do one iteration of ten queries. The results clearly demonstrate the issue without taking 20+ hours to run a single simple benchmark. Simply stated, unless you really have a valid reason to check your results to see if the column is null, DON’T!
Test: select count(*) as ct from querytest where date=%s with Index: create index querytest on querytest (date,iip) Duration: 0.408268 seconds Test: select count(ipv) as ct from querytest where date=%s with Index: create index querytest on querytest (date,iip) Duration: 3085.770998 seconds
The Fine Print
* Index columns used in your where conditions
* B-Tree versus Hash doesn’t appear to materially affect results
* storing IP as char(15) if the data is being returned to the client can be faster than storing an IP as an unsigned int. If the IP is not fetched but only used in comparisons, unsigned int is probably the better choice.
* Consider adding that extra column to your index to prevent MySQL from having to read the data file. Answering your query from the index is significantly faster.
* count(*) rather than count(column)
Live data will not act precisely as the benchmark — what live scenario ever does? But, I believe the tests above should show some of the performance gains available by structuring your tables and queries.
While MySQL 4, 5.0 and 5.1 will reorder conditions to match the index key, there are some significant performance gains from 4.x to 5.0. MySQL 5.1 didn’t show considerable gains from MySQL 5.0, but, there are some minor speed increases.