15 January 2008

An exercise in SELF JOINing

There was only so much I could do for my 1976 Mustang. I did not have money to put an Edelbrock head in, so I spent what little I had in a new stereo, a tinted back window, a new muffler that increased the neighborhood noise level a notch, and a couple of other random parts. In the end, it really did not go any faster, and I was $650 poorer.

I liken my experience with SELF JOIN to my 1976 Mustang. Sure, it was fun to use and figure out how I could make it work in my application, but I had a feeling that it might not pull ahead when the light turned green in production.

The Problem
  1. Records are inserted (using INSERT IGNORE) into a table based on DISTINCT first name, last name, address1, address2, city, state, zip, and email.
  2. We needed to select out records that had a distinct address1, address2, city, state and zip, picking up the most recent first name and last name within specific zip code areas (first 3 numbers of the zip code)
The SELF JOIN Solution

I had an "aha" solution, thinking that I could self join the distinct address from the same table with the most recent first name and last name. So off I went in my quest...
SELECT a.first_name, b.first_name, a.address1, a.address2, a.city, a.state, a.zip FROM main_table a INNER JOIN main_table b ON a.address1 = b.address1 AND a.zcf IN (123, 456,789) GROUP BY a.first_name, a.last_name, a.address1, a.address2, a.city, a.state, a.zip HAVING a.signup_date = MAX(b.signup_date) LIMIT 10000;
Results

Well, I was very pleased that it worked so well and quickly in my dev environment of a few thousand rows. I was ready to tell the world that SELF JOINs are fun and profitable.

But then I did an EXPLAIN on the statement and this came up:
Extra: Using temporary; Using filesort
My dev table was only 10,000 rows, but my test environment had several million. Not good!

My Table Hack

Maybe someone could help me improve on my SELF JOIN skills, but based on the business requirement that I would be picking out zipcode-based data sets on a daily basis, I did not want to wait everytime one of these requests ran.

So... I created a table that would include distinct address information with the most recent first name / last name, and a trigger that will populate this table based on this very criteria. It took a little over two hours to insert several million rows, but now I can have the joy of performing this statement instead:
SELECT fname, lname, address1, address2, city, state, zip FROM main_table WHERE zcf IN (123,456,789) LIMIT 10000;


Using the EXPLAIN statement produced the following predictable result:
Extra: Using where; Using index
It is another worker bee script that I don't have to worry about on a daily basis. It produces results quickly, and everyone gets what they needs. For fun, I did try the SELF JOIN on my test table and it took longer than a day to complete. Oh well...

I now drive the family SUV, with no modifications, all standard equipment. And I know it would smoke my old 1976 Mustang off the line anytime.

Til next time,
David

No comments: