<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Random Musings of an Insane Mind &#187; mysql</title>
	<atom:link href="http://cd34.com/blog/tag/mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://cd34.com/blog</link>
	<description>This is my blog, there are many others like it but this one is mine.</description>
	<lastBuildDate>Tue, 29 Jun 2010 04:22:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Why do you use an Object Relational Mapping (ORM) System in Development?</title>
		<link>http://cd34.com/blog/programming/python/why-do-you-use-an-object-relational-mapping-orm-system-in-development/</link>
		<comments>http://cd34.com/blog/programming/python/why-do-you-use-an-object-relational-mapping-orm-system-in-development/#comments</comments>
		<pubDate>Mon, 12 Oct 2009 17:51:54 +0000</pubDate>
		<dc:creator>cd34</dc:creator>
				<category><![CDATA[Python]]></category>
		<category><![CDATA[django]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[orm]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[pylons]]></category>
		<category><![CDATA[smarty]]></category>
		<category><![CDATA[sqlalchemy]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=819</guid>
		<description><![CDATA[Here&#8217;s a programmer that is saying goodbye to ORMs at Hatful of Hollow. And another site offering a tutorial of sorts dealing with ORMs Why should you use an ORM. While both have their points, both have missed a fundamental benefit that an ORM hands you. Most of my development is in Pylons. Django&#8217;s ORM [...]]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a programmer that is saying goodbye to ORMs at <a href="http://www.hatfulofhollow.com/posts/code/farewell-to-orms.html">Hatful of Hollow</a>.  </p>
<p>And another site offering a tutorial of sorts dealing with ORMs <a href="http://karwin.blogspot.com/2009/01/why-should-you-use-orm.html">Why should you use an ORM</a>.</p>
<p>While both have their points, both have missed a fundamental benefit that an ORM hands you.</p>
<p>Most of my development is in Pylons.  Django&#8217;s ORM and template language can do the same thing.  A programmer that has used PHP/Smarty to develop large scale systems will likely resist ORMs.  After working with a team to develop 90k+ lines of PHP/Smarty over a six year period, making the shift required a paradigm shift.</p>
<p>Let&#8217;s consider the following structure.  We have a cp_ticket table and a cp_ticket_detail table.  A Ticket can have multiple detail records.  The output we wish to have is:</p>
<pre>
ticket id, ticket header information
         ticket detail line
         ticket detail line #2
ticket id, ticket header information
         ticket detail line
         ticket detail line #2
         ticket detail line #3
ticket id, ticket header information
         ticket detail line
         ticket detail line #2
</pre>
<p>Our model:</p>
<pre>
class cp_ticket(DeclarativeBase):
    __tablename__ = 'cp_ticket'

    ticket_id = Column(mysql.MSBigInteger(20, unsigned = True), primary_key=True, autoincrement = True)
    priority = Column(mysql.MSEnum('1','2','3','4','5'), default = '3')

    ticket_detail = relation('cp_ticket_detail', order_by='cp_ticket_detail.ticket_detail_id')

class cp_ticket_detail(DeclarativeBase):
    __tablename__ = 'cp_ticket_detail'

    ticket_id = Column(mysql.MSBigInteger(20, unsigned = True), ForeignKey('cp_ticket.ticket_id'), default = '0')
    ticket_detail_id = Column(mysql.MSBigInteger(20, unsigned = True), primary_key=True, autoincrement = True)
    stamp = Column(mysql.MSTimeStamp, PassiveDefault('CURRENT_TIMESTAMP'))
    detail = Column(mysql.MSLongText, default = '')
</pre>
<p>Our query to pass to our template:</p>
<pre>
        tickets = meta.Session.query(cp_ticket).filter(cp_ticket.client_id==1).all()
</pre>
<p>Compared with the query as you would write it without an ORM:</p>
<pre>
select * from cp_ticket,cp_ticket_detail where client_id=1 and cp_ticket.ticket_id=cp_ticket_detail.ticket_id;
</pre>
<p>Both are doing the same fundamental thing, but, the ORM maps the results almost identical to the way we want to display the data.  This makes template design easy.</p>
<p>Using Mako, we use the following code to display the results:</p>
<pre>
&lt;table border="1">
 &lt;tr>&lt;th>Ticket ID&lt;/th>&lt;th>Status&lt;/th>&lt;th>Detail&lt;/th>&lt;/tr>
%for ticket in tmpl_context.tickets:
  &lt;tr>
    &lt;td>&lt;strong>${ticket.ticket_id}&lt;/strong>&lt;/td>
    &lt;td>&lt;strong>${ticket.priority}&lt;/strong>&lt;/td>
  &lt;/tr>
  %for detail in ticket.ticket_detail:
  &lt;tr>
    &lt;td>&lt;/td>
    &lt;td>${detail.stamp}&lt;/td>
    &lt;td>${detail.detail}&lt;/td>
  &lt;/tr>
  % endfor
% endfor
&lt;/table>
</pre>
<p>To do the same thing without using an ORM, you need to revert to a control break structure similar to the following:</p>
<pre>
current_ticket=0
for ticket in tickets:
  if (current_ticket != ticket.ticket_id):
    #new row, print the header
    print "&lt;tr>&lt;td>first piece&lt;/td>&lt;/tr>"
    current_ticket = ticket.ticket_id
  # print our detail row
  print "&lt;tr>&lt;td>&lt;/td>&lt;td>stamp and detail&lt;/td>&lt;/tr>"
</pre>
<p>Control Break structures require you to be able to set a variable within your template language.  Some template languages don&#8217;t allow that.  If your template language (in any language) can&#8217;t do variable assignments in the template, guess where your html generation logic needs to go?  </p>
<p>With an ORM, the template contains your display logic.  Your webmaster/design team can modify the template without having to modify html contained within your code.  The loops are simple to understand and designers usually have little problem avoiding the lines that start with %.</p>
<p>Sure, you could wrap much of this logic in your template to do the control-break structure, but, as you get more complex data, deciding how to display the data requires a define or some other functionality.</p>
<p>An ORM adds some insulation to the process, but, the result is a much easier page structure when displaying related data.  Granted there are some performance hits and SQLAlchemy appears to create some queries that are not optimal, unless there is a tremendous performance hit, I think the benefits of the ORM for developing a web application are tremendous.</p>
<p>Once you move into an environment where you are dealing with multiple developers, having a defined schema with comments is much easier than using reflection to figure out what the meaning of a status field as enum(&#8216;U&#8217;,'A&#8217;,'P&#8217;,'C&#8217;,'R&#8217;,'S&#8217;).  </p>
<p>However, as the original poster mentions, you can do raw SQL within SQLAlchemy and do all of your work with reflection as he has done with his ORM^H^H^H, abstraction.  If he&#8217;s still using SQLAlchemy, he can selectively decide when to use it and when to avoid it.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/programming/python/why-do-you-use-an-object-relational-mapping-orm-system-in-development/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>mysql 5.1&#8242;s query optimizer</title>
		<link>http://cd34.com/blog/scalability/mysql-5-1s-query-optimizer/</link>
		<comments>http://cd34.com/blog/scalability/mysql-5-1s-query-optimizer/#comments</comments>
		<pubDate>Wed, 07 Oct 2009 06:28:51 +0000</pubDate>
		<dc:creator>cd34</dc:creator>
				<category><![CDATA[Scalability]]></category>
		<category><![CDATA[benchmark]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=781</guid>
		<description><![CDATA[While debugging an issue with an application that relies heavily on MySQL, an issue was brought up regarding the cardinality of the keys selected, and, the order in which the keys were indexed. With any relational database, in order to get the fastest performance, your query should reduce the result set as quickly as possible. [...]]]></description>
			<content:encoded><![CDATA[<p>While debugging an issue with an application that relies heavily on MySQL, an issue was brought up regarding the cardinality of the keys selected, and, the order in which the keys were indexed.  With any relational database, in order to get the fastest performance, your query should reduce the result set as quickly as possible.  Your data should have a high cardinality or variation in the data so that the B-Tree (or R-Tree) is more balanced.   If your data consists of:</p>
<p>One thousand records with the date 2009-01-01<br />
One thousand records with the date 2009-01-02<br />
&#8230;<br />
One thousand records with the date 2009-12-31</p>
<p>The cardinality or uniqueness of that column is low given the fact that you&#8217;ll have 365000 rows with blocks of one thousand having the same key.  If you consider 125 different IP addresses per day generating those same thousand records, the cardinality or uniqueness of the IP addresses will be very high.</p>
<p>In order to show the performance differences in multiple indexing schemes and representations, a table has been created with an Unsigned Int column for the IP address, a varchar(15) for the IP address, a date column, and a varchar(80) for some text data.  Because of the way the MySQL query processor works, it is possible to construct your query so that the results are answered from the index and the data file is never hit.  A test sample was created that will be used for all of the tests.  The file will be indexed, optimized, and the test run five times with the cumulative time used.  The sample data that generates the queries against the database include 48000 of the ten million rows, plus 2000 randomly generated queries.  Those results are then shuffled and written to a file for the tests.  Testing hits versus misses emulates real world situations a little more accurately.  All of the code used to run these tests is included in this post.</p>
<h3>Test Setup</h3>
<p>Creation of the table:</p>
<pre>
CREATE TABLE `querytest` (
  `iip` int(10) unsigned DEFAULT NULL,
  `ipv` varchar(15) DEFAULT NULL,
  `date` date DEFAULT NULL,
  `randomtext` text
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
</pre>
<p>Filling the table with data:</p>
<pre>
#!/usr/bin/python

import MySQLdb
import random
import datetime
import time

lipsum = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Morbi gravida congue nisi, nec auctor leo placerat nec. In hac habitasse platea dictumst. In rutrum blandit velit et varius. Integer commodo ipsum ut diam placerat feugiat. Curabitur viverra erat ut felis cursus mollis. Sed tempus tempor faucibus. Etiam eget arcu massa, eget dictum sapien. Nullam euismod purus vitae risus ultrices tempus. Mauris semper rhoncus lectus, sit amet laoreet mauris tincidunt et. Duis ut mauris massa. Nam semper, enim id fermentum tristique, ligula velit suscipit lacus, vitae ultrices mi arcu sit amet felis. Ut sit amet tellus eget lorem gravida malesuada.

Integer nec massa quis mauris porta laoreet. Curabitur tincidunt nunc at mauris porttitor auctor. Mauris auctor faucibus tortor dignissim sodales. Sed ut tellus nisi, laoreet malesuada tortor. Vivamus blandit neque et nunc fringilla quis dignissim felis tincidunt. Nam nec varius orci. Duis pretium magna id urna fermentum commodo. Aliquam sollicitudin imperdiet leo eget ullamcorper. Quisque id mauris nec purus pulvinar bibendum. Fusce nunc metus, viverra in iaculis id, tempus nec neque. Aenean ac diam arcu, vitae condimentum lectus. Vivamus cursus iaculis tortor eget bibendum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Aenean elementum odio et nisl ornare at sodales eros porta. Duis mollis tincidunt neque, sed pulvinar enim ultrices a. Sed laoreet nunc ut nisl luctus a egestas quam luctus. Pellentesque non dui et neque ullamcorper condimentum ac ut turpis. Etiam a lectus odio, vitae bibendum arcu. Nulla egestas dolor ligula.

Quisque rhoncus neque ultrices mi lacinia tempus. Sed scelerisque libero dui, quis vulputate leo. Phasellus nibh ante, viverra sed cursus ac, dictum et lectus. Suspendisse potenti. Ut dapibus augue vitae sem convallis in iaculis nibh bibendum. Mauris eu sapien in lacus pharetra fermentum. Etiam eleifend vulputate velit, a tempor augue ultrices vitae. Vestibulum varius orci ac justo adipiscing quis dignissim odio porttitor. Nam ac metus leo. Ut a porttitor lectus. Nunc accumsan ante non eros feugiat suscipit.

Nulla facilisi. Nam molestie dignissim purus sed lacinia. Etiam tristique, eros vel condimentum fermentum, ipsum justo vulputate erat, sed faucibus nunc nisl id tellus. Aliquam a tempus leo. Nullam et sem nunc. Suspendisse potenti. Quisque ante lorem, aliquam sed aliquet vel, malesuada sit amet nisl. Vestibulum tristique velit pellentesque sapien ultrices non gravida ante blandit. Donec luctus nunc dictum felis feugiat sollicitudin. Nam lectus mi, porttitor sed adipiscing ac, pharetra a orci. Ut vitae eros vitae metus.
"""

db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest")
cursor = db.cursor()

length = len(lipsum)
jan_1_2009 = time.mktime((2009, 1, 1, 0, 0, 0, 0, 0, 0))

for i in range (1, 10000001):

  # generate a random IP address
  rand_ip = random.randint(1,4294967295)

  # pull a random piece of text from lipsum with a random length
  start_pos = random.randint(1,length)
  end_pos = start_pos + random.randint(200,2000)
  random_text = lipsum[start_pos:end_pos]

  # pick a random date in 2009
  rand_date = time.strftime("%Y-%m-%d",time.gmtime(jan_1_2009 + random.randint(1,365*60*60*24)))

  cursor.execute("insert into querytest (iip,ipv,date,randomtext) values (%s,inet_ntoa(%s),%s,%s)", (rand_ip, rand_ip, rand_date, random_text))

cursor.close ()
db.close ()
</pre>
<p>Generate test set:</p>
<pre>
#!/usr/bin/python

import MySQLdb
import random
import datetime
import time
import socket
import struct

db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest")
cursor = db.cursor()

jan_1_2009 = time.mktime((2009, 1, 1, 0, 0, 0, 0, 0, 0))

cursor.execute("select iip,ipv,date from querytest order by rand() limit 48000")

data = list(cursor.fetchall())

for i in range (1, 2001):

  # generate a random IP address
  rand_ip = random.randint(1,4294967295)

  # pick a random date in 2009
  rand_date = time.strftime("%Y-%m-%d",time.gmtime(jan_1_2009 + random.randint(1,365*60*60*24)))

  data.append((rand_ip, socket.inet_ntoa(struct.pack('L',rand_ip)), rand_date))

random.shuffle(data)
for datum in data:
  print "%s,%s,%s" % (datum[0], datum[1], datum[2])

cursor.close ()
db.close ()
</pre>
<p>At this point we have created the table, filled it with ten million rows, and generated a fifty thousand row query set to run against the table.  Now, we need to categorize the theories to see whether cardinality plays as large a role as it used to.</p>
<h3>The following tests will be performed</h3>
<p>Index of iip,date</p>
<p>* Use the unsigned int representation of the IP address and the date<br />
* Use the text representation of the IP address passed through inet_aton() and the date</p>
<p>Index of ipv, date</p>
<p>* Use the text representation of the IP address and the date<br />
* Use the unsigned int representation of the IP address passed through inet_ntoa() and the date</p>
<p>Index of date,iip</p>
<p>* Use date and the unsigned int representation of the IP address<br />
* Use date and the text representation of the IP address passed through inet_aton()</p>
<p>Index of date,ipv</p>
<p>* Use date and the unsigned int representation of the IP address<br />
* Use date and the text representation of the IP address passed through inet_aton()</p>
<p>Each of the above tests will be run twice, once with select * and once with select ipv,date.</p>
<h3>Benchmark Code</h3>
<pre>
#!/usr/bin/python

import MySQLdb
import random
import datetime
import time
import socket
import struct
import array

def run_query(query, data, columna, columnb):
    for datum in data:
      cursor.execute(query, (datum[columna], datum[columnb]))
      result = cursor.fetchall()

query_tests = [
               ['create index querytest on querytest (iip,date)',
                'select * from querytest where iip=%s and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (iip,date) using HASH',
                'select * from querytest where iip=%s and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (iip,date)',
                'select iip,date from querytest where iip=%s and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (iip,date) using HASH',
                'select iip,date from querytest where iip=%s and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (iip,date)',
                'select * from querytest where iip=inet_aton(%s) and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (iip,date) using HASH',
                'select * from querytest where iip=inet_aton(%s) and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (iip,date)',
                'select iip,date from querytest where iip=inet_aton(%s) and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (iip,date) using HASH',
                'select iip,date from querytest where iip=inet_aton(%s) and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (ipv,date)',
                'select * from querytest where ipv=%s and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (ipv,date) using HASH',
                'select * from querytest where ipv=%s and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (ipv,date)',
                'select ipv,date from querytest where ipv=%s and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (ipv,date) using HASH',
                'select ipv,date from querytest where ipv=%s and date=%s',
                1,
                2
               ],
               ['create index querytest on querytest (ipv,date)',
                'select * from querytest where ipv=inet_ntoa(%s) and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (ipv,date) using HASH',
                'select * from querytest where ipv=inet_ntoa(%s) and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (ipv,date)',
                'select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (ipv,date) using HASH',
                'select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s',
                0,
                2
               ],
               ['create index querytest on querytest (date,iip)',
                'select * from querytest where date=%s and iip=%s',
                2,
                0
               ],
               ['create index querytest on querytest (date,iip) using HASH',
                'select * from querytest where date=%s and iip=%s',
                2,
                0
               ],
               ['create index querytest on querytest (date,iip)',
                'select iip,date from querytest where date=%s and iip=%s',
                2,
                0
               ],
               ['create index querytest on querytest (date,iip) using HASH',
                'select iip,date from querytest where date=%s and iip=%s',
                2,
                0
               ],
               ['create index querytest on querytest (date,iip)',
                'select * from querytest where date=%s and iip=inet_aton(%s)',
                2,
                1
               ],
               ['create index querytest on querytest (date,iip) using HASH',
                'select * from querytest where date=%s and iip=inet_aton(%s)',
                2,
                1
               ],
               ['create index querytest on querytest (date,iip)',
                'select iip,date from querytest where date=%s and iip=inet_aton(%s)',
                2,
                1
               ],
               ['create index querytest on querytest (date,iip) using HASH',
                'select iip,date from querytest where date=%s and iip=inet_aton(%s)',
                2,
                1
               ],
               ['create index querytest on querytest (date,ipv)',
                'select * from querytest where date=%s and ipv=%s',
                2,
                1
               ],
               ['create index querytest on querytest (date,ipv) using HASH',
                'select * from querytest where date=%s and ipv=%s',
                2,
                1
               ],
               ['create index querytest on querytest (date,ipv)',
                'select ipv,date from querytest where date=%s and ipv=%s',
                2,
                1
               ],
               ['create index querytest on querytest (date,ipv) using HASH',
                'select ipv,date from querytest where date=%s and ipv=%s',
                2,
                1
               ],
               ['create index querytest on querytest (date,ipv)',
                'select * from querytest where date=%s and ipv=inet_ntoa(%s)',
                2,
                0
               ],
               ['create index querytest on querytest (date,ipv) using HASH',
                'select * from querytest where date=%s and ipv=inet_ntoa(%s)',
                2,
                0
               ],
               ['create index querytest on querytest (date,ipv)',
                'select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)',
                2,
                0
               ],
               ['create index querytest on querytest (date,ipv) using HASH',
                'select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)',
                2,
                0
               ],
              ]

db = MySQLdb.connect(host="localhost", user="querytest", passwd="qt1qt1", db="querytest")
cursor = db.cursor()

queries = open('testquery.txt').readlines()

query_array = []
for query_data in queries:
  query_array.append(query_data.rstrip('\n').split(','))

for test in query_tests:
  try:
    cursor.execute('alter table querytest drop index querytest')
  except:
    pass
  cursor.execute(test[0])
  cursor.execute('optimize table querytest')

  print "Test: %s\n with Index: %s" % (test[1], test[0])
  start_time = time.time()

  for loop in range (1,6):
    run_query(test[1], query_array, test[2], test[3])

  end_time = time.time()
  print "Duration: %f seconds\n" % (end_time - start_time)

cursor.close ()
db.close ()
</pre>
<h3>Miscellaneous notes</h3>
<p>P4/3.0ghz, 2gb RAM, Debian 3/Squeeze, Linux 2.6.31.1, WD 7200RPM SATA drive, SuperMicro P4SCI Motherboard</p>
<p>There are multiple tests that could have been run without dropping the index, recreating the index and optimizing the table.  When testing a more limited set, results were a little sporadic due to a smaller initial test set and portions of the table and index being cached in the kernel cache.  To ensure more consistent test results, every test was run in a consistent manner.</p>
<h3>Benchmark Results</h3>
<pre>
Test: select * from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 679.169198 seconds

Test: select * from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date) using HASH
Duration: 692.634291 seconds

Test: select iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 179.039791 seconds

Test: select iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date) using HASH
Duration: 178.993962 seconds

Test: select * from querytest where iip=inet_aton(%s) and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 672.836734 seconds

Test: select * from querytest where iip=inet_aton(%s) and date=%s
 with Index: create index querytest on querytest (iip,date) using HASH
Duration: 606.268787 seconds

Test: select iip,date from querytest where iip=inet_aton(%s) and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 195.253512 seconds

Test: select iip,date from querytest where iip=inet_aton(%s) and date=%s
 with Index: create index querytest on querytest (iip,date) using HASH
Duration: 195.222058 seconds

Test: select * from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date)
Duration: 741.876227 seconds

Test: select * from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date) using HASH
Duration: 639.109309 seconds

Test: select ipv,date from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date)
Duration: 167.049333 seconds

Test: select ipv,date from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date) using HASH
Duration: 167.016152 seconds

Test: select * from querytest where ipv=inet_ntoa(%s) and date=%s
 with Index: create index querytest on querytest (ipv,date)
Duration: 578.565762 seconds

Test: select * from querytest where ipv=inet_ntoa(%s) and date=%s
 with Index: create index querytest on querytest (ipv,date) using HASH
Duration: 655.869390 seconds

Test: select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s
 with Index: create index querytest on querytest (ipv,date)
Duration: 181.555567 seconds

Test: select ipv,date from querytest where ipv=inet_ntoa(%s) and date=%s
 with Index: create index querytest on querytest (ipv,date) using HASH
Duration: 181.230911 seconds

Test: select * from querytest where date=%s and iip=%s
 with Index: create index querytest on querytest (date,iip)
Duration: 655.928799 seconds

Test: select * from querytest where date=%s and iip=%s
 with Index: create index querytest on querytest (date,iip) using HASH
Duration: 637.146124 seconds

Test: select iip,date from querytest where date=%s and iip=%s
 with Index: create index querytest on querytest (date,iip)
Duration: 181.637912 seconds

Test: select iip,date from querytest where date=%s and iip=%s
 with Index: create index querytest on querytest (date,iip) using HASH
Duration: 181.512190 seconds

Test: select * from querytest where date=%s and iip=inet_aton(%s)
 with Index: create index querytest on querytest (date,iip)
Duration: 603.553238 seconds

Test: select * from querytest where date=%s and iip=inet_aton(%s)
 with Index: create index querytest on querytest (date,iip) using HASH
Duration: 605.363284 seconds

Test: select iip,date from querytest where date=%s and iip=inet_aton(%s)
 with Index: create index querytest on querytest (date,iip)
Duration: 196.680399 seconds

Test: select iip,date from querytest where date=%s and iip=inet_aton(%s)
 with Index: create index querytest on querytest (date,iip) using HASH
Duration: 194.746056 seconds

Test: select * from querytest where date=%s and ipv=%s
 with Index: create index querytest on querytest (date,ipv)
Duration: 657.619028 seconds

Test: select * from querytest where date=%s and ipv=%s
 with Index: create index querytest on querytest (date,ipv) using HASH
Duration: 686.560066 seconds

Test: select ipv,date from querytest where date=%s and ipv=%s
 with Index: create index querytest on querytest (date,ipv)
Duration: 172.222691 seconds

Test: select ipv,date from querytest where date=%s and ipv=%s
 with Index: create index querytest on querytest (date,ipv) using HASH
Duration: 172.079220 seconds

Test: select * from querytest where date=%s and ipv=inet_ntoa(%s)
 with Index: create index querytest on querytest (date,ipv)
Duration: 726.031732 seconds

Test: select * from querytest where date=%s and ipv=inet_ntoa(%s)
 with Index: create index querytest on querytest (date,ipv) using HASH
Duration: 678.099808 seconds

Test: select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)
 with Index: create index querytest on querytest (date,ipv)
Duration: 185.415666 seconds

Test: select ipv,date from querytest where date=%s and ipv=inet_ntoa(%s)
 with Index: create index querytest on querytest (date,ipv) using HASH
Duration: 185.280880 seconds
</pre>
<h3>Conclusions</h3>
<p>Based on the data, I think we can say that the argument of B-Tree versus Hash doesn&#8217;t seem to make much difference.  Neither is consistently better, and since the data and query test is identical, the results don&#8217;t really point to a clear winner.  Avoiding Select * and pulling only the required fields makes a difference and if your result can be answered from the index rather than the data file, there is a substantial boost.  Analysis of the results suggests that cardinality isn&#8217;t as important as it used to be.  I am devising a method to further test cardinality as I do believe that live data will have somewhat different results from data after an optimize table has been run.</p>
<p>The winner in this case is:</p>
<pre>
Test: select ipv,date from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date)
Duration: 167.049333 seconds

Test: select ipv,date from querytest where ipv=%s and date=%s
 with Index: create index querytest on querytest (ipv,date) using HASH
Duration: 167.016152 seconds
</pre>
<p>I had actually expected int represented as unsigned int would be the fastest.  However, there is probably a reasonable explanation why these two queries are slower:</p>
<pre>
Test: select iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 179.039791 seconds

Test: select iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date) using HASH
Duration: 178.993962 seconds
</pre>
<p>Data in MySQL is represented as binary. The IP stored as an unsigned int takes 4 bytes, and the date takes 3.  The key length in this case would be 7 bytes versus the index on IP stored as varchar(15) and the date taking 18 bytes.  Even though the index in the second case is almost three times the size of the unsigned int IP, the MySQL client library converts all binary data to ASCII when communicating to avoid endian issues.  That extra conversion results in a slightly slower result &#8212; measurable when you do 250000 queries against a 10 million record database.</p>
<p>A quick modification of the test shows the results of select *, versus select keyvaluea,keyvalueb and select data,keyvalueb.  As you can see from the results below, MySQL will answer queries from the index if it doesn&#8217;t need to hit the data file.</p>
<pre>
Test: select * from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 637.420786 seconds

Test: select iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 178.434477 seconds

Test: select ipv,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 690.804990 seconds

Test: select inet_ntoa(iip) as iip,date from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 183.817643 seconds
</pre>
<p>If you can structure your data well, there are significant performance gains to be had.</p>
<h3>What does this mean?</h3>
<p>Do you store IPs as unsigned int in the database?  If you use varchar(15) or char(15), you&#8217;re talking about an eleven or ten byte savings per record at the expense of some conversion time.  varchar uses 1 character to store the length of the stored data plus the length of the data.  char is a fixed length based on the column length you specify.</p>
<p>Make sure you return only the columns that you need in your calculations &#8212; especially if you are running MySQL over a network.</p>
<p>Try to create your index to match the conditions that you are looking for, and, when possible, if you are searching for the result from a particular column, consider adding it to the index as well.</p>
<p>Always use count(*) rather than count(column) unless there is a valid reason for that column to contain NULL.</p>
<h3>The Effect of count(*) versus count(date)</h3>
<p>count(*) gives you the number of rows in the set that match the criteria you have set.  count(date) counts the number of rows in the set that match the criteria where the date is not null.  Many times, you&#8217;ll see someone do a count(id), and id by definition is a primary key, auto_increment and cannot be null.  Because count(column) must read the table to ensure that the column specified is not null, it is forced to check every key, or, read the table for all of the matching rows to make sure the column retrieved doesn&#8217;t contain a null value.  If the column being counted is one of the keys in the index, the performance change won&#8217;t be as dramatic.  By counting a column that isn&#8217;t in the key and having to read the data, count(column) is considerably slower.</p>
<p>Results when the counted column is within the key and only 1 or 0 rows are expected:</p>
<pre>
Test: select count(*) as ct from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 175.727338 seconds

Test: select count(iip) as ct from querytest where iip=%s and date=%s
 with Index: create index querytest on querytest (iip,date)
Duration: 176.495198 seconds
</pre>
<p>When count returns more than one row, you can see the effect is much more detrimental.  The first iteration of this test took so long that I shortened it to do five iterations of 100 queries.  After 4 hours, and 18% complete, I shortened the test to do one iteration of ten queries.  The results clearly demonstrate the issue without taking 20+ hours to run a single simple benchmark.  Simply stated, unless you really have a valid reason to check your results to see if the column is null, DON&#8217;T!</p>
<pre>
Test: select count(*) as ct from querytest where date=%s
 with Index: create index querytest on querytest (date,iip)
Duration: 0.408268 seconds

Test: select count(ipv) as ct from querytest where date=%s
 with Index: create index querytest on querytest (date,iip)
Duration: 3085.770998 seconds
</pre>
<h3>The Fine Print</h3>
<p>* Index columns used in your where conditions<br />
* B-Tree versus Hash doesn&#8217;t appear to materially affect results<br />
* storing IP as char(15) if the data is being returned to the client can be faster than storing an IP as an unsigned int.  If the IP is not fetched but only used in comparisons, unsigned int is probably the better choice.<br />
* Consider adding that extra column to your index to prevent MySQL from having to read the data file.  Answering your query from the index is significantly faster.<br />
* count(*) rather than count(column)</p>
<p>Live data will not act precisely as the benchmark &#8212; what live scenario ever does?  But, I believe the tests above should show some of the performance gains available by structuring your tables and queries.</p>
<p>While MySQL 4, 5.0 and 5.1 will reorder conditions to match the index key, there are some significant performance gains from 4.x to 5.0.  MySQL 5.1 didn&#8217;t show considerable gains from MySQL 5.0, but, there are some minor speed increases.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/scalability/mysql-5-1s-query-optimizer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Mysql Query Optimization</title>
		<link>http://cd34.com/blog/scalability/mysql-query-optimization/</link>
		<comments>http://cd34.com/blog/scalability/mysql-query-optimization/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 04:16:30 +0000</pubDate>
		<dc:creator>cd34</dc:creator>
				<category><![CDATA[Scalability]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=745</guid>
		<description><![CDATA[I heard a comment from a developer the other day: You don&#8217;t need indexes on small tables. So I asked what the definition of a small table was. He said, anything with a few hundred rows. So I said, 2300 rows? Well&#8230;.. 24000 rows? Well&#8230;.. 292000 rows? That&#8217;s large. I showed him unindexed queries in [...]]]></description>
			<content:encoded><![CDATA[<p>I heard a comment from a developer the other day:</p>
<blockquote><p>You don&#8217;t need indexes on small tables.</p></blockquote>
<p>So I asked what the definition of a small table was.  He said, anything with a few hundred rows.  So I said, 2300 rows?  Well&#8230;.. 24000 rows? Well&#8230;.. 292000 rows?  That&#8217;s large.  I showed him unindexed queries in his application dealing with tables that had 2300, 24000 and 292000 rows.</p>
<h3>Avoid tablescans</h3>
<p>When MySQL deals with a query that is unindexed, it does a full tablescan to see if each record in the table meets the criteria specified.  On a small table, if the query is executed frequently, the MySQL query cache might be able to serve the query.  However, on a larger table, or a table with large rows, it must read every row, check the fields, possibly create a temporary table in ram or disk, and return the results.  On a small site, you might not notice it, but, on a large system, forcing tablescans on tables with even a few thousand rows will slow things down considerably:</p>
<blockquote><p>Uptime: 60016  Threads: 11  Questions: 105460332  Slow queries: 197769  Opens: 5819  Flush tables: 1  Open tables: 1320  Queries per second avg: 1757.204</p></blockquote>
<p>Slow queries are sometimes unavoidable, but, often, slow queries are missing an index.</p>
<h3>Use the slow-query log to find potential issues</h3>
<p>When analyzing a system to find problems, putting:</p>
<blockquote><p>log-queries-not-using-indexes</p></blockquote>
<p>in the my.cnf file and restarting mysql will log the unindexed queries to the slowquery log.</p>
<h3>What can be indexed?</h3>
<p>The rule of thumb when writing indexes is to write your query in such a way that you reduce the result set as quickly as possible, with the highest cardinality possible.  What does this mean?</p>
<p>If you are collecting data of the IP address and the Date, your query against date,ip will actually be worse than ip,date.  Imagine receiving 40000 hits to your site on the same date.  If you were looking for the number of hits that a particular IP had, you would search the 41 hits they have made over time, and then the 8 that they had today.  If you queried by date,ip, you would search 40000 rows then would receive the 8 they had today.  Each index you have, adds extra overhead and an index file should be as small as possible.  IP addresses can be represented in an unsigned int which takes much less space than the varchar(15) usually used.  Remember when you index a varchar field, indexing will spacepad the key to the full length.  If you have a variable length field you want indexed, you might be able to figure out the significant portion of that field by finding the average length and adding a few characters for good measure and indexing fieldname(15) rather than the entire field.  If a query is longer than the 15 characters, you have still created a significant reduction in the number of rows that it must check.</p>
<p>Cardinality refers to the uniqueness of the data.  The more unique the data, the lower the chance that you&#8217;ll have thousands of records that match the first criteria.  When the data is very similar, the index as built on disk will become imbalanced resulting in slower queries.  Since MyISAM and InnoDB use a B-Tree index (or R-Tree if you use a spatial index), data that is similar when inserted, can create a very imbalanced tree which leads to slower lookups.  An optimize table can resort and reindex the table to eliminate this, but, you can&#8217;t do that on an extremely large, active table without impacting response times.</p>
<blockquote><p># Query_time: 0  Lock_time: 0  Rows_sent: 1  Rows_examined: 3323<br />
SELECT * FROM websites_geo where (zoneid = &#8217;5135&#8242;) LIMIT 1;</p></blockquote>
<p>In this case, zoneid is not indexed on the table websites_geo.  Adding an index on zoneid eliminates the tablescan on this query.</p>
<h3>Check for equality, not inequality.</h3>
<p>An index can only check equality.  A query checking to see if values are not equal, cannot be indexed.</p>
<blockquote><p># Query_time: 0  Lock_time: 0  Rows_sent: 5  Rows_examined: 2548<br />
SELECT * FROM websites where (id = &#8217;1056692&#8242; &#038;&#038; status != &#8216;d&#8217; &#038;&#038; status != &#8216;n&#8217;) order by rand() LIMIT 5;</p>
<p># Query_time: 0  Lock_time: 0  Rows_sent: 10  Rows_examined: 2544<br />
SELECT * FROM websites where (status != &#8216;n&#8217; &#038;&#038; status != &#8216;d&#8217; &#038;&#038; traffic > 3000) order by added desc LIMIT 10;</p></blockquote>
<p>These two queries show two different issues, but, deal with the same fundamental issue.  First, id is not indexed which would have at least limited the result set to 9 records rather than 2548.  The status check isn&#8217;t able to use an index.  On the second query, status is checked followed by traffic.  There are other queries issued that check status,traffic,clicks_high.  When we look at status (which should be an enum or char(1) rather than varchar(1)), we find that there are only 4 values used.  By indexing on id,status and status,traffic,clicks_high, we could alter the queries as such:</p>
<blockquote><p>SELECT * FROM websites where (id = &#8217;1056692&#8242; &#038;&#038; status in (&#8216;g&#8217;,&#8217; &#8216;)) order by rand() LIMIT 5;</p>
<p>SELECT * FROM websites where (status in (&#8216;g&#8217;,&#8217; &#8216;) &#038;&#038; traffic > 3000) order by added desc LIMIT 10;</p></blockquote>
<p>which would result in both queries using an index.  </p>
<h3>Choose your data types intelligently.</h3>
<p>As a secondary point, id (though it is numeric) happens to be a text field.  If you index id in this case, you would have to specify a key length.</p>
<blockquote><p>mysql> select max(length(id)) from websites;<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
| max(length(id)) |<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
|              22 |<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
1 row in set (0.02 sec)</p>
<p>mysql> select avg(length(id)) from websites;<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
| avg(length(id)) |<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
|          8.3315 |<br />
+&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+<br />
1 row in set (0.00 sec)</p>
<p>mysql> </p></blockquote>
<p>Based on this, we might decide to set the key length to 22 as it is a relatively small number and allows room to grow.  Personally, I would have opted to have the id be an unsigned int which would be much smaller, but, the application developer uses alphanumeric id&#8217;s which are exposed externally.  With sharding, you could use the id throughout the various tables, or, you could map the text id to a numeric id internally for all of the various tables.</p>
<p>There are a number of possible solutions to help any SQL engine perform better.  And your data set will dictate some of the things that you can do to make data access quicker.</p>
<h3>Helping MySQL Help You</h3>
<p>If you do <strong>select * from table where condition_a=1 and condition_b=2</strong> in one place, and <strong>select * from table where condition_b=2 and condition_a=1</strong>, setting up a single index on condition_a,condition_b and adjusting your second query, reversing the conditions to the same order as the keys on the index will increase performance.</p>
<h3>Limit your results</h3>
<p>Another thing that will help considerably is using a limit clause.  So many times a programmer will do:  <strong>select * from table where condition_a=1</strong> which returns 2300 rows but only the first few rows are used.  A limit clause will prevent a lot of data from being fetched by MySQL and buffered waiting for the response.  <strong>select * from table where condition_a=1 limit 20</strong> would hand you the first 20 records.</p>
<h3>Avoid reading the data file, do all your work from the Index</h3>
<p>Additionally, if you have a table and only need three of the columns from the result, <strong>select fielda,fieldb,fieldc from table where condition_a=1</strong> will return only the three fields.  As an added boost, if the fields you are checking can be answered from the index, the query will never hit the actual data file and will be answered from the index.  Many times I&#8217;ve added a field that wasn&#8217;t needed in the index, just to eliminate the lookup of the key in the index then the corresponding read of the data file.</p>
<h3>Let MySQL do the work</h3>
<p>MySQL reads tables, filters results, can do some calculations.  Going through 40000 records to pick the best 100 is still faster in MySQL than allowing PHP to fetch 40000 rows and do calculations and sorts to come up with that 100 rows.  Index, optimize, and allow MySQL to do the database work.</p>
<h3>Summary</h3>
<p>Making MySQL work more efficiently goes a long way towards making your database driven site work better.  Adding six indexes to the system resulted in quicker response times and an increase in the transactions per second.</p>
<blockquote><p>Uptime: 32405  Threads: 1  Questions: 58729705  Slow queries: 64122  Opens: 2911  Flush tables: 1  Open tables: 295  Queries per second avg: 1812.366</p></blockquote>
<p>Previously, MySQL was generating 3.26 slow queries per second.  Now we&#8217;re just beneath 2 slow queries per second and our system is processing 55 more transactions per second.  There is still a bit more analysis to do to identify the slow queries that are still running and to alter the queries to reverse the inequality checks, but, even just adding indexes to a few tables has helped noticeably.  Once the developer is able to make some changes to the application, I&#8217;m sure we&#8217;ll see an additional speedup.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/scalability/mysql-query-optimization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>User Interface Design</title>
		<link>http://cd34.com/blog/infrastructure/user-interface-design/</link>
		<comments>http://cd34.com/blog/infrastructure/user-interface-design/#comments</comments>
		<pubDate>Wed, 24 Jun 2009 05:46:26 +0000</pubDate>
		<dc:creator>cd34</dc:creator>
				<category><![CDATA[Framework]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[Web Infrastructure]]></category>
		<category><![CDATA[formencode]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[smarty]]></category>
		<category><![CDATA[sqlalchemy]]></category>
		<category><![CDATA[toscawidgets]]></category>
		<category><![CDATA[turbogears]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=676</guid>
		<description><![CDATA[Programmers are not designers. Technical people should not design User Interfaces. * 810 source files * 90658 lines of code * 10213 lines of html For an internal project tasked to a series of programmers throughout the years without enough oversight, it is a mass of undocumented code with multiple programming styles. PHP allowed lazy [...]]]></description>
			<content:encoded><![CDATA[<p>Programmers are not designers.  Technical people should not design User Interfaces.</p>
<p>* 810 source files<br />
* 90658 lines of code<br />
* 10213 lines of html </p>
<p>For an internal project tasked to a series of programmers throughout the years without enough oversight, it is a mass of undocumented code with multiple programming styles.  PHP allowed lazy programming, Smarty didn&#8217;t have some of the finesse required, so, the User Interface suffered.  Functional but confusing to anyone that hadn&#8217;t worked intimately with the interface or been walked through it.</p>
<p>The truest statement is that it is easier for me to do things through the MySQL command line than through the application.  While this does have a tendency to introduce possible typos, it has altered SQL practices here.</p>
<p><code>update table set value=123 where othervalue=246;</code></p>
<p>could have an accidental typo of </p>
<p><code>update table set value=123 where othervalue-=246;</code></p>
<p>which would have completely unintended consequences.  One typo altered the DNS entries for 48000 records.  Shortly after that typo, ingrained in company policy was that I never wanted to ever see a query like that executed in the command line regardless of how simple the command.</p>
<p>Even within code, the above command would be entered as:</p>
<p><code>update table set value=123 where othervalue in (246);</code></p>
<p>This prevented a number of potential typos.  Even limit clauses with deletions were enforced to make sure things didn&#8217;t go too haywire in an update.</p>
<p>With Python, indenting is mandatory which results in multiple programmer&#8217;s code looking similar and easier to troubleshoot.  Utilizing SQLAlchemy which enforces bind variables when talking with the database engine, we&#8217;ve eliminated the potential for a typo updating too many records.  Even cascade deletes are enforced in SQLAlchemy even when running on top of MyISAM.  With MVC, our data model is much better defined and we&#8217;re not tied down to remembering the relationship between two tables and possible dependencies.  Conversion from the existing MySQL database to a DeclarativeBase model hasn&#8217;t been without issues, but, a simple python program allowed the generation of a simple model that took care of most of the issues.  Hand tweaking the database model while developing the application has allowed for quite a bit of insight into issues that had been worked around rather than making adjustments to the database.</p>
<p>Fundamental design issues in the database structure were worked around with code rather than fixed.  Data that should have been retained was not, relationships between tables was defined in code rather than in the database leading to a painful conversion.</p>
<p>When it was decided to rewrite the application in Python using TurboGears, I wasn&#8217;t that familiar with the codebase nor the user interface.  Initially it was envisioned that the templates would be copied and the backend engine would be written to power those templates.  After a few hours running through the application, and attempting the conversion on a number of templates, I realized the application was functional but it was extremely difficult to use in its current state.  So much for having a programmer design an interface.</p>
<p>Some functionality from the existing system was needed so I peered into the codebase and was unprepared for that surprise.  At this point it became evident that a non-programmer had designed the interface.  While Smarty was a decent template language, it was not a formtool, so, methods were designed to give a consistent user experience when dealing with error handling.  A single php file was responsible for display, form submission and validation and writing to the database for each &#8216;page&#8217; in the application.  The code inside should have been straightforward.</p>
<p>* Set up default CSS classes for each form field for an &#8216;ok&#8217; result<br />
* Validate any passed values and set the CSS class as &#8216;error&#8217; for any value that fails validation<br />
* Insert/Update the record if the validation passes<br />
* Display the page</p>
<p>Some validation takes place numerous times throughout the application, and, for some reason one of the &#8216;coders&#8217; decided that copy and paste of another function that used that same validation code was better than writing a function to do the validation.  Of course when that validation method needed to be changed, it needed to be changed in eight places.</p>
<p>So, what should have been somewhat simple has changed considerably:</p>
<p>* Evaluate each page<br />
* Redesign each page to make the process understandable<br />
* Adjust terminology to make it understandable to the application&#8217;s users<br />
* modify the database model<br />
* rewrite the form and validation</p>
<p>A process that should have been simple has turned into quite a bit more work than anticipated.  Basically, development boils down to looking at the page, figuring out what it should be, pushing the buttons to see what they do and rewriting from scratch.</p>
<p>TurboGears has added a considerable amount of efficiency to the process.  One page that dealt with editing a page of information was reduced from 117 lines of code to 12 lines of code.  Since TurboGears uses ToscaWidgets and Formencode, validation and form presentation is removed from the code resulting in a controller that contains the code that modifies the tables in the database with validated input.  Since Formencode already has 95% of the validators that are needed for this project, we can rest assured that someone else has done the work to make sure that field will be properly validated.  Other validation methods can be maintained and self-tested locally, but, defined in such a manner that they are reused throughout the application rather than being cut and pasted into each model that is validating data.  In addition, bugs should be much less frequent as a result of a much-reduced codebase.</p>
<p>Due to the MVC framework and the libraries selected by the developers at TurboGears, I wouldn&#8217;t be surprised if the new codebase is 10%-15% the size of the existing application with greater functionality.  The code should be more maintainable as python enforces some structure which will increase readability.</p>
<p>While I am not a designer, even using ToscaWidgets and makeform, the interface is much more consistent.  Picking the right words, adding the appropriate help text to the fields and making sure things work as expected has resulted in a much cleaner, understandable interface.</p>
<p>While there are some aspects of ToscaWidgets that are a little too structured for some pages, our current strategy is to develop the pages using ToscaWidgets or makeform to make things as clear as possible making notes to overload the Widget class for our special forms at a later date.</p>
<p>While it hasn&#8217;t been a seamless transition, it did provide a good opportunity to rework the site and see a number of the problems that the application has had for a long time.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/infrastructure/user-interface-design/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Embedded indexing versus Client/Server</title>
		<link>http://cd34.com/blog/programming/embedded-indexing-versus-clientserver/</link>
		<comments>http://cd34.com/blog/programming/embedded-indexing-versus-clientserver/#comments</comments>
		<pubDate>Sat, 28 Mar 2009 06:27:27 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[berkeleydb]]></category>
		<category><![CDATA[file_db]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[sqlite3]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=604</guid>
		<description><![CDATA[For a particular application, I require temporary persistent storage of some data.  That data consists of a key value and a payload.  That key value can be a dupe, which is what causes the problem. File_DB in perl handles duplicates and I can delete a key/value pair without too much difficulty.  However, file locking is [...]]]></description>
			<content:encoded><![CDATA[<p>For a particular application, I require temporary persistent storage of some data.  That data consists of a key value and a payload.  That key value can be a dupe, which is what causes the problem.</p>
<p>File_DB in perl handles duplicates and I can delete a key/value pair without too much difficulty.  However, file locking is not handled very well with File_DB which created concurrency issues with the threaded daemon.</p>
<p>Sqlite3 had no problem with duplicates, and could be compiled with the delete from/limit clause to easily handle duplicate keys.  Rather than recompile the packaged Sqlite3 in Debian, I made a slight modification to the code on my side so that I could do further testing.  Due to a few issues with threading and a potential issue with storing binary data and retrieving it in perl, I needed to reevaluate.</p>
<p>BerkeleyDB solves a few problems.  It supports concurrency, it supports proper file locking, but, a minor limitation is that duplicate keys are not handled well when you want to delete a key.  It&#8217;ll require a rewrite of some functionality to use BerkeleyDB, but, I believe that solution will provide the least potential for failures.</p>
<p>I could have use MySQL which I am very comfortable with, but, the storage of the data really only needs to be there for a few minutes in most cases, and the amount of data stored is 10-20K at most.  With MySQL&#8217;s client timeout, I couldn&#8217;t really guarantee everything would work every time without writing in considerable error checking.  While MySQL would handle everything perfectly, it was overkill for the task at hand.</p>
<p>I&#8217;m rewriting the File_DB methods to use BerkeleyDB and modifying the saved data slightly to work around the key delete issue.</p>
<p>It should work and should raise the reliability of this process from 99.2% to 99.9% which will be a considerable improvement.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/programming/embedded-indexing-versus-clientserver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Professional MySQL Developer?</title>
		<link>http://cd34.com/blog/scalability/professional-mysql-developer/</link>
		<comments>http://cd34.com/blog/scalability/professional-mysql-developer/#comments</comments>
		<pubDate>Mon, 09 Mar 2009 15:54:56 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Programming Languages]]></category>
		<category><![CDATA[Scalability]]></category>
		<category><![CDATA[mysql]]></category>

		<guid isPermaLink="false">http://cd34.com/blog/?p=577</guid>
		<description><![CDATA[Once in a while I get to troubleshoot problems on machines specifically with MySQL.  Sometimes that task is difficult because the vendor written code has been Zend Encoded.  Perhaps in the name of performance, possibly to protect his code, or, to prevent me from seeing just how bad his code is.  Diagnosing MySQL issues forensically [...]]]></description>
			<content:encoded><![CDATA[<p>Once in a while I get to troubleshoot problems on machines specifically with MySQL.  Sometimes that task is difficult because the vendor written code has been Zend Encoded.  Perhaps in the name of performance, possibly to protect his code, or, to prevent me from seeing just how bad his code is.  Diagnosing MySQL issues forensically requires a few scripts, patience, and time.</p>
<p>This time, I modified /etc/mysql/my.cnf on two machines and added the following:</p>
<blockquote><p>log        = /var/log/mysql/mysql.log<br />
log-slow-queries= /var/log/mysql/mysql-slow.log<br />
log-queries-not-using-indexes</p></blockquote>
<p>This gave me three things.  A full log of all queries executed.  I have written a small program to normalize the queries and sort them so I can focus on the queries that take 90% of the time.  I then get queries that take more than 10 seconds &#8212; which can be altered by adjusting long_query_time, and, I get any query that doesn&#8217;t use an index.  Usually I will focus on the unindexed queries as those generally cause the problems.  Very few people understand the impact of table locks and that usually is where most problems occur.</p>
<p>The first example comes from an application which claims to be able to support 100,000 visitors per day.  Perhaps with enough hardware behind it can, but, in its current state, on a quad core xeon, about 40,000 per day is about all they can push.</p>
<p>Where do we start?  So few lines, so many errors.</p>
<blockquote><p>CREATE TABLE `iplog` (<br />
`ip` varchar(255) NOT NULL default &#8221;,<br />
`pid` bigint(20) NOT NULL default &#8217;0&#8242;,<br />
`viewed` varchar(255) NOT NULL default &#8221;,<br />
`locked` tinyint(1) NOT NULL default &#8217;0&#8242;,<br />
PRIMARY KEY  (`ip`,`pid`)<br />
) ENGINE=InnoDB DEFAULT CHARSET=latin1</p></blockquote>
<p>The above schema replicates functionality in the prior version of the software.  The old version of the system didn&#8217;t do things this way, so, I am unsure why the developer chose to alter his habits and break things in so many ways.</p>
<blockquote><p>ip varchar(255)</p></blockquote>
<p>IPv4 currently only takes 15 spaces.  char(15) would have been more efficient taking 16 bytes to store a 15 digit IP address including the &#8216;.&#8217;.  Even varchar(15) would have taken 20 bytes to store the 15 digit IP address.  But varchar(255) just takes the cake.  The proper way to store an IPv4 address is unsigned int.  Even barring the fact that varchar(15) or char(15) would have been more space efficient, using an unsigned int allows the BTree that MyISAM creates to be a lot more balanced.</p>
<p>Then, lets add an index to the IP field.  Knowing that indexes space pad keys to their full length, the Mysql index file on the disk is actually larger than the data file itself.</p>
<p>I might object to the use of InnoDB on a table with questionable value.  The overhead for InnoDB on data that is valuable only for a few minutes seems like overkill, but, we&#8217;ll give some credit for data integrity.</p>
<p>That really wasn&#8217;t the worst offender.  Another application written by another professional programmer included roughly 80 queries to display one page.  Once you dig deeper into the code, you find queries like this:</p>
<blockquote><p># Query_time: 0  Lock_time: 0  Rows_sent: 1  Rows_examined: 984<br />
SELECT count(VID) AS total_videos FROM video WHERE type=&#8217;public&#8217; AND ( keyword LIKE &#8216;%keyword%&#8217;  OR keyword LIKE &#8216;%keyword%&#8217; OR keyword LIKE &#8216;%keyword%&#8217; OR keyword LIKE &#8216;%keyword%&#8217;);</p></blockquote>
<p>First, VID is auto_increment, not null.  This particular query causes mysql to scan the results from the above query, and the verify VID to see if it is null for any row.  Due to the table construction, it cannot be null, so, SELECT count(*) would have been faster.</p>
<p>Second, they disassembled the input, wrapped it in % so that they could do floating matches rather than using a fulltext index which would have been much faster.  As the table only has 984 rows now, once they start growing the site, they&#8217;ll start to see more performance issues.  A fulltext index would allow for quite a bit more growth.</p>
<p>While the following query isn&#8217;t optimizable, the results could be cached.</p>
<blockquote><p># Query_time: 0  Lock_time: 0  Rows_sent: 5  Rows_examined: 155<br />
SELECT * from channel ORDER BY rand() LIMIT 5;</p></blockquote>
<p>As a result of the ORDER BY rand(), MySQL will need to do a full table scan each and every time this query is called.  In this case, twice for each pageload.</p>
<p>Another particularly interesting query:</p>
<blockquote><p>select tradeId from trades2 where tradeDomain = &#8216;noref&#8217;;</p></blockquote>
<p>Taken on its own, you can see where tradeDomain may have been omitted in the indexes.  However, the table name gives you some indication of the mess.  Here&#8217;s another query that ran without an index:</p>
<blockquote><p>select tradeId from trades28 where tradeDomain = &#8216;noref&#8217;;</p></blockquote>
<p>There are a total of 50 trades tables.  This corresponds to the number of &#8216;sites&#8217; in the sites table.  It seems our developer couldn&#8217;t quite figure out how to index on two fields.</p>
<p>In the last 4 hours, the unindexed queries from this one site have resulted in the following:</p>
<blockquote><p>-rw-r&#8212;&#8211;  1 mysql adm  10869491 2009-03-09 10:53 mysql-slow.log</p></blockquote>
<p>Roughly 54000 queries in 4 hours that had no index.  Compared to yesterday&#8217;s logs when I added a number of indices, it is a world of difference.</p>
<p>However, fixing the issues in the code is still more expensive than throwing hardware at the problem.  The problem you run into with using hardware as a crutch is that eventually you reach an impasse where the problem must be fixed, or, you face other bottlenecks.</p>
<p>Based on a few changes made by adding indexes, the load on those two machines has dropped considerably.  Each of the vendors that wrote the systems that I analyzed claim to handle well more traffic than either of the machines is currently handling.  The claims were written on their web sites roughly 3 years ago, and the machines that they are currently running on a much newer and much more powerful than they were at the time the software was installed.</p>
<p>One of the software vendors is out of the business due to too many support issues.  One software vendor continues churning out bad code and getting people to pay $600 for it.  The third developer disappeared and was replaced by a team of two programmers who appear to know what they are doing and are releasing version 5.0 (the current, latest version is 3.1) &#8216;soon&#8217;.  I know it will be soon because their web site, last updated in July 2008, says they are in limited beta.</p>
<p>I guess it is time to order some hardware.</p>
]]></content:encoded>
			<wfw:commentRss>http://cd34.com/blog/scalability/professional-mysql-developer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
