30 January 2008

What Do You Name Date Fields?

I have seen too many date fields that have been named after reserved key words that I need to make an entry for this... It's like naming your kid "Kid". Wait a sec... that's happened many times before as well.

I will make a "Proper Naming Conventions" page for mySQL after checking a few more sources first, but this date naming issue cannot wait another day.


First Rule: Do not use a reserved word for your field name!
Second Rule: Your date-related field name can actually include the word date or datetime in it!


Yes, this was a short simple post. But freekin' necessary.

Until next time,

23 January 2008

Sun + MySQL: One Week Later

I really like the 21st century open source technology community. 20 years ago, if a company bought out another company, everyone would have their own island of opinions and response, and you would not know if you were on-track or even if the publicized reports were on-track either.

Today, it is amazing to see the responses around the boards and blogs, which seems to keep the mainstream reports in-check, most of the time (including the picture on the left). I must admit I was checking to see what the "dirt" level of the buyout would end up, but all-in-all, this is something that most people seem to be confident about: this is going to help MySql and Sun in the long run. MySql will likely have a revamped service offering (the price gap between the community and enterprise versions is still sizable for small companies), Sun can integrate MySQL into their OSS, giving Solaris a bit of a facelift.

But where the rubber will meet the road is exemplified in how Sun will identify and utilize MySQL's existing strengths and run with them. Continue to improve InnoDB and MyISAM; I'll give a grace period for Falcon as that appears to have promise, and the third party engines like Brighthouse are fine too, as those are not financed my Sun.

Another place where I look forward to Sun's involvement and direction with MySQL is with replication and clustering. Linux heartbeat is a great utility we use now, and I may see hope with the NDB disk-based cluster when we have 5.1 in production.

I also am looking forward to a comprehensive binary backup utility (beyond MyISAM) in MySql; in my experience, mysqldump is trumped by rsync. Currently we end up doing both rsync and mysqldump on our system, and our recovery works almost like Oracle's recovery manager (...and instead of depending on rman to produce the best backup recovery procedure, I use a few of my brain cells).

After reviewing the posts generated through PlanetMySql.com, and tech articles written by respected journalists, and a newly released report from Gartner, the mood is very positive for the future of MySql.

Until Next Time,

15 January 2008

An exercise in SELF JOINing

There was only so much I could do for my 1976 Mustang. I did not have money to put an Edelbrock head in, so I spent what little I had in a new stereo, a tinted back window, a new muffler that increased the neighborhood noise level a notch, and a couple of other random parts. In the end, it really did not go any faster, and I was $650 poorer.

I liken my experience with SELF JOIN to my 1976 Mustang. Sure, it was fun to use and figure out how I could make it work in my application, but I had a feeling that it might not pull ahead when the light turned green in production.

The Problem
  1. Records are inserted (using INSERT IGNORE) into a table based on DISTINCT first name, last name, address1, address2, city, state, zip, and email.
  2. We needed to select out records that had a distinct address1, address2, city, state and zip, picking up the most recent first name and last name within specific zip code areas (first 3 numbers of the zip code)
The SELF JOIN Solution

I had an "aha" solution, thinking that I could self join the distinct address from the same table with the most recent first name and last name. So off I went in my quest...
SELECT a.first_name, b.first_name, a.address1, a.address2, a.city, a.state, a.zip FROM main_table a INNER JOIN main_table b ON a.address1 = b.address1 AND a.zcf IN (123, 456,789) GROUP BY a.first_name, a.last_name, a.address1, a.address2, a.city, a.state, a.zip HAVING a.signup_date = MAX(b.signup_date) LIMIT 10000;

Well, I was very pleased that it worked so well and quickly in my dev environment of a few thousand rows. I was ready to tell the world that SELF JOINs are fun and profitable.

But then I did an EXPLAIN on the statement and this came up:
Extra: Using temporary; Using filesort
My dev table was only 10,000 rows, but my test environment had several million. Not good!

My Table Hack

Maybe someone could help me improve on my SELF JOIN skills, but based on the business requirement that I would be picking out zipcode-based data sets on a daily basis, I did not want to wait everytime one of these requests ran.

So... I created a table that would include distinct address information with the most recent first name / last name, and a trigger that will populate this table based on this very criteria. It took a little over two hours to insert several million rows, but now I can have the joy of performing this statement instead:
SELECT fname, lname, address1, address2, city, state, zip FROM main_table WHERE zcf IN (123,456,789) LIMIT 10000;

Using the EXPLAIN statement produced the following predictable result:
Extra: Using where; Using index
It is another worker bee script that I don't have to worry about on a daily basis. It produces results quickly, and everyone gets what they needs. For fun, I did try the SELF JOIN on my test table and it took longer than a day to complete. Oh well...

I now drive the family SUV, with no modifications, all standard equipment. And I know it would smoke my old 1976 Mustang off the line anytime.

Til next time,

07 January 2008

How to Send a Report on Most Common Domain Names without Grinding Your DB

This is a continuance from my previous entry on reversing an email address to search for a domain. Minutes after I ran the following query,
SELECT COUNT(1) FROM rv_emails WHERE rv_email LIKE REVERSE('%msn.com');
I was asked to do a count of the top 50 domains in our table of email addresses.

So, I simply changed my previous route by performing the following steps:

1) Add a 'domain' column to my table:
2) Populate the domain column:
UPDATE emails SET DOMAIN = RIGHT(email, LENGTH(email) - INSTR(email,'@'));
3) Select the top 50 domains by doing the following:
There are a couple of items to note here:
  • No need to use the word DISTINCT as this is implied by GROUP BY;
  • ORDER BY COUNT(1) DESC does add "Using temporary; Using filesort" to the EXPLAIN statement, but on a 5 million record table, this took 7.78 sec. Another way to do this is to add "HAVING COUNT(1) > 5000" or some other arbitrary number and write these to a temp table for sorting the final results
This solution turned out to be very good for our reporting purposes, so I now have a trigger that updates the domain column every time an email address is inserted. I would presume a better way is to have php calculate the domain name at the time of the insert, but we have dozens of pages to go through, and a programmer might unknowingly skip this step on a page that would be developed in the future. So, I think I will be keeping the trigger in place forever.

Until next time,