Posts Tagged ‘indices’

Designing MySQL Indexes (Indices)

Tuesday, February 7th, 2012

I’ve imported some data, I know how I want to query the data, but, using the provided schema, we end up with a table that is 8.1 million rows and we know very little about the composition of the data.

We know our query will be against two fields, name and timezone.

The two fields we’re dealing with:

name              | varchar(200)
timezone          | varchar(60)

The first thing we are faced with is the fact that our name is 200 characters. We could do:

create index name_timezone on geo_city (name,timezone);

But, that’s not going to be very efficient. Lets do a little checking to see what our data really looks like:

mysql> select avg(length(name)) from geo_city;
+-------------------+
| avg(length(name)) |
+-------------------+
|           13.6786 |
+-------------------+
1 row in set (13.53 sec)
mysql> select max(length(name)) from geo_city;
+-------------------+
| max(length(name)) |
+-------------------+
|               154 |
+-------------------+
1 row in set (12.71 sec)

So, now that we’ve got an idea of what our data looks like, we can adjust our index. You might be thinking, lets just create our index:

create index name_timezone on geo_city (name(20),timezone);

Note: When an index is created in MySQL, any key value is space padded to the full length of the key specified when the index was created. An index on a varchar(255) will have 255 bytes taken up per row, even if the data you are storing only has 35 bytes. Smaller index files mean faster lookups, especially if we can get the index to a size that can be cached in memory.

That might be fine in most cases, but, when we are testing for ranges, we throw any equality conditions after the range condition out of the key lookup. For example:

select * from tablename where condition_a>=1 and condition_a< =2 and condition_b=3;

In the above case, an index on condition_a,condition_b would ignore the fact that the index contained condition_b. Range checks must go after equality checks. In the above case, we want an index on condition_b,condition_a.

Note: As stated above, indexes are space padded to the length of the key. On a range check, using a BTree index, the query plan will only look at the conditions that match the index order until the first ranged query is hit, then, will do a memory or disk temp table based on the size of the results. Remember, any time you use a TEXT, BLOB field, any temporary table created is ALWAYS created on disk. Make sure your /tmp spool is not on a software raid partition and mount it with noatime.

Since we intend to use this table for JQuery Autocomplete lookups, our query will look something like:

select * from geo_city where name like 'wel%';
select * from geo_city where substr(name,0,3)='wel';

Note: when doing this, make sure you set your index collation to a _ci. You can do: show table status like ‘geo_city’; to see that your collation is set to utf8_general_ci. If the collation is not set to something that is Case Insensitive, condition checks for Wel and wel will return different results.

Rule number one of a good index is to make sure you have a high cardinality or uniqueness. A city name has a much higher cardinality than our timezone, but, since we’ll be doing range checks, the cardinality of the timezone+city will make our index lookups quicker.

Some basics on mysql query plan execution

select * from geo_city where name like 'wel%';

select * from geo_city where substr(name,0,3)='wel';

The first does a range check which we can verify with an explain, but the second one uses no index. You might be inclined to believe that the second condition would be faster, but, when you have a calculated field on the left hand side of the condition, MySQL will be forced to do a tablescan to calculate all rows. This will certainly cause performance problems.

Now we’re faced with the Timezone. Timezones have created problems for ages. The names aren’t consistent, but the Olson Database contains a fairly consistent table of names by using the largest populated cities in each of the Timezones as a marker. While this generally works, it is a little confusing for people in some timezones as they wouldn’t associate a city in China with their location in Australia. Moreover, the timezone names are somewhat inconsistent, so, we will convert the Humanized timezone names to a timezone offset. Our offsets now span a maximum of 6 characters, (-05:00, +00:00, +08:30, etc.) and we can now create an index of tz_offset,name(20) which should give us a 26 character index, by 8.1 million rows which results in an index of roughly 260mb. With our primary key index, the index on the geoname_id (for data integrity during upgrades):

-rw-rw---- 1 mysql mysql     25868 Feb  7 13:46 geo_city.frm
-rw-rw---- 1 mysql mysql 968730120 Feb  7 13:49 geo_city.MYD
-rw-rw---- 1 mysql mysql 392612864 Feb  7 13:51 geo_city.MYI

Now, our query:

select * from geo_city where tz_offset='-05:00' and name like 'wel%';

will use an index and should return results very quickly.

A quick test:

select name,admin1_code,latitude,longitude from geo_city where tz_offset='-05:00' and name like 'Wel%';
1036 rows in set (0.04 sec)

mysql> explain select name,admin1_code,latitude,longitude from geo_city where tz_offset='-05:00' and name like 'Wel%';
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
| id | select_type | table    | type  | possible_keys | key    | key_len | ref  | rows | Extra       |
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
|  1 | SIMPLE      | geo_city | range | nt,tzname     | tzname | 82      | NULL | 4231 | Using where |
+----+-------------+----------+-------+---------------+--------+---------+------+------+-------------+
1 row in set (0.00 sec)

Now, on to the next scaling issue – fixing GIS to not drill through the planet to determine distances.

Entries (RSS) and Comments (RSS).
Cluster host: li