07 January 2008

How to Send a Report on Most Common Domain Names without Grinding Your DB

This is a continuance from my previous entry on reversing an email address to search for a domain. Minutes after I ran the following query,
SELECT COUNT(1) FROM rv_emails WHERE rv_email LIKE REVERSE('%msn.com');
I was asked to do a count of the top 50 domains in our table of email addresses.

So, I simply changed my previous route by performing the following steps:

1) Add a 'domain' column to my table:
ALTER emails ADD COLUMN domain CHAR(30) NOT NULL;
2) Populate the domain column:
UPDATE emails SET DOMAIN = RIGHT(email, LENGTH(email) - INSTR(email,'@'));
3) Select the top 50 domains by doing the following:
SELECT domain, COUNT(1) FROM emails GROUP BY domain ORDER BY COUNT(1) DESC LIMIT 50;
There are a couple of items to note here:
  • No need to use the word DISTINCT as this is implied by GROUP BY;
  • ORDER BY COUNT(1) DESC does add "Using temporary; Using filesort" to the EXPLAIN statement, but on a 5 million record table, this took 7.78 sec. Another way to do this is to add "HAVING COUNT(1) > 5000" or some other arbitrary number and write these to a temp table for sorting the final results
This solution turned out to be very good for our reporting purposes, so I now have a trigger that updates the domain column every time an email address is inserted. I would presume a better way is to have php calculate the domain name at the time of the insert, but we have dozens of pages to go through, and a programmer might unknowingly skip this step on a page that would be developed in the future. So, I think I will be keeping the trigger in place forever.

Until next time,
David

No comments: