19 November 2008

My Top 10 List For Developers

An excellent question was posted in a MySQL group in LinkedIn the other day that I had to post it here.

What is your "Top 10 List for Programmers Working with MySQL"?

I started mine... Let me know your thoughts on my list, and feel free to add other items to the list. I interface with programmers on a daily basis, and our experience lends us to help them in certain ways, but I think a "Top 10" list will enable us to be vigilant in our day-to-day development interactions.

Oh yes - I have 14 items in my "Top 10 List", and I have no problem adding a few more... :)

Drum roll please... [cue David Letterman]

  1. DO know your tools, and do your testing on your own LAMP environment at your desk;
  2. DO document;
  3. DO ensure your queries use indexes and use them properly (there *is* a difference);
  4. DO understand the EXPLAIN statement - its benefits and shortcomings alike;
  5. DON'T use comma joins (you must explicitly help MySQL with execution plan on joins);
  6. DO normalize your tables for transaction processing; (I like 3rd normal, but it depends on the app);
  7. DO denormalize your tables for analytical processing; (the key is how much you denormalize!)
  8. DO NOT create indexes on columns with few unique values (like gender);
  9. DON'T open large recordsets (in a dev environment, this can be easily overlooked);
  10. DO always test with large datasets (that's why we have Dev, Test, and Prod servers!);
  11. DO always set numeric columns to NOT NULL unless NULL is absolutely required;
  12. DON'T use cursors (or better yet, skip stored procedures altogether and use an API based on php, perl, C, or python);
  13. DO understand the benefits and limitations of database abstraction and when to use it;
  14. DON'T oversize your varchar's, char's, and don't use TEXT type when other types will do.

15 October 2008

KickFire is Back

After receiving an email about talking with Robert David, Director of Sales at KickFire, I checked out recent news on KickFire, as there had been little written about this company aside from the big splash they made at the MySQL Users Conference back in April 2008.

And, lo and behold, there was a piece of news, posted on October 14 - Kickfire Enters Into MySQL Enterprise Agreement With Sun Microsystems. It did not seem to garner any attention from the MySQL community yet, and the only source that picked up on this was "The Financial". Interestingly, the tagline of "The Financial" is "Open Source Information", which does not represent the same meaning to those of us working with open source tools. I don't know who is doing the PR for KickFire, but there might have been a skipped number on the speed dial somewhere.

This obviously was not as exciting as a headline like "Kickfire Machine Now in Production, Already Saving Dozens of Companies Thousands of Dollars". Or "Extended Data Warehouse Testing Beyond TPC-H headquarters Proves the Robustness of the First MySQL Data Appliance". Perhaps a review from Percona, Pythian, The 451 Group, or others listed along the panel of PlanetMySQL.org could add some fury to the data warehouse fire that is burning in the news lately.

Anyways, tomorrow I will find out what the latest news on this is; after all, since April 2008, it does seem like a race of turtles between MySQL 5.1, the KickFire Appliance, and the introduction of an affordable electric car with a decent range. I understand that quality of the product exceeds the quantity of time, so I am willing to deal with that. But for the last 2+ years, I have learned not to hold my breath over that.

10 October 2008

Fixing InnoDB IMPORT TABLESPACE Error: ERROR 1030 (HY000): Got error -1 from storage engine

We have one InnoDB file per table on our database, which was set with the following option: innodb_file_per_table.

This allows me to portably transport Innodb files on a system level with minimal hassle, and is much faster than mysqldump, as these tables are several GB each.

When transporting an .idb file from one server to another, I ran into the following error:
ERROR 1030 (HY000): Got error -1 from storage engine
How did I get to this point? I did the following:
1) created a dummy table in the database;
2) discarded its tablespace;
3) moved the .ibd file into the database folder on the system;
4) attached the tablespace back to the table

This is how these actions were accomplished:
mysql> CREATE TABLE new_innodb_table (value1 INT UNSIGNED NOT NULL, value2 CHAR(5) NOT NULL);

mysql> ALTER TABLE new_innodb_table DISCARD TABLESPACE;

[sys_prompt]# mv new_innodb_table.ibd /var/lib/mysql/db_name/

mysql> ALTER TABLE new_innodb_table ATTACH TABLESPACE;
ERROR 1030 (HY000): Got error -1 from storage engine
The next thing you automatically do when you see an error in your mysql client is check your error log immediately. If this log is not enabled, please stop reading this post and go here.

Here is what the log recorded:
081009 14:21:02 InnoDB: Error: tablespace id in file './db_name/new_innodb_table.ibd' is 26, but in the InnoDB
InnoDB: data dictionary it is 28.
InnoDB: Have you moved InnoDB .ibd files around without using the
InnoDB: Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.0/en/innodb-troubleshooting.html
InnoDB: for how to resolve the issue.
081009 18:21:02 InnoDB: cannot find or open in the database directory the .ibd file of
InnoDB: table `stbHealth/sub_stb_health_associate`

From the error log we see that the error concerned mismatched tablespace ids.

If the tablespace id for my .ibd file was 26, then the tablespace would have been successfully imported. Also, if the tablespace id in the data dictionary 25, then all I would need to do is create a dummy innoDB table, thus incrementing the tablespace id in the data dictionary to 26, then I could try importing the tablespace again. Tricky, but it works (especially if that number was lower and you would make several dummy innodb tables).

However, the easier option is this: restart mysql, then do the same four steps listed near the beginning of the post. This way, the tablespace id on the data dictionary and the file matched; thus importing the tablespace succeeded.

This can give you greater confidence in dealing with some of the InnoDB "gotcha's" during the recovery process or even file transfers. It did with me!

04 September 2008

Case Sensitive Fields

Yes, it has been too long since I last posted on this blog (3 months) - that's my apology; let's move forward.

There's too many interesting MySQL bits and pieces that I still come across my desk that I cannot help but post these fantastic learning opportunities to my blog. Obviously, you can read about all things interesting from PlanetMySql.org, but I hope that my blog will help sift people through specific issues that pertain directly to them while working with MySQL.

Oh yes... I am now located in Denver and I work for a big startup that uses MySQL for data mining. Most excellent!

Querying Case Sensitive Fields

Issue: Users to be able to query for case sensitive field values. For example, 700g is not the same as 700G.

Answer: COLLATE latin1_general_cs

Example in action:
mysql> CREATE TABLE test (non_case_sensitive char(1) NOT NULL, case_sensitive char(1) COLLATE latin1_general_cs) ENGINE=MyISAM charset=latin1;

mysql> insert into test values ('x','x');
mysql> insert into test values ('X','X');
mysql> insert into test values ('y','Y');
mysql> insert into test values ('Y','y');
mysql> select non_case_sensitive, count(*) from test group by non_case_sensitive;
| non_case_sensitive | count(*) |
| x | 2 |
| y | 2 |
2 rows in set (0.00 sec)

mysql> select case_sensitive, count(*) from test group by case_sensitive;
| case_sensitive | count(*) |
| X | 1 |
| x | 1 |
| Y | 1 |
| y | 1 |
4 rows in set (0.00 sec)


Performance Considerations

There does not appear to be much of a performance hit as a result of this change. On a 4.5 million record table, there was no difference. In fact many of our comparison queries (with same number of results) on the case sensitive table were a bit faster than on the case insensitive table.

Have fun! I hope to post more frequently in the future as I continue to find out interesting tidbits along the way.

Until next time,

27 June 2008

Things Not Replicating Correctly? Part 2

My last post on June 10 was about when MySQL received queries from php scripts that looked like this:
mysql_select_db = ('database_I_DO_want_to_replicate', $link);
$sql = "INSERT INTO db_I_DO_NOT_want_to_replicate.repl_table (a,b,c) VALUES (1,2,3);";

-- and thus could not pass this over to the slave unless I set Replicate_Wild_Ignore_Table value IN ADDITION to Replicate_Ignore_DB as such in my configuration file:
replicate_ignore_DB = db_I_DO_NOT_want_to_replicate
replicate_wild_ignore_table = db_I_DO_NOT_want_to_replicate.%

News Flash

Three things:
  1. As it happens, we do not need use replicate_ignore_DB for this to work properly;
  2. Replicate_do_DB and Replicate_ignore_DB do not need to be used at all;
  3. In fact, Replicate_do_DB is somewhat evil.

Firstly, replicate_wild_ignore_table is all you need to ignore databases, and it will properly replicate all kinds of php riff-raff, where the scripts do not necessarily select the proper database and they are always written in db.table style.

Secondly, replicate_wild_do_table is all you need to include databases, and it will properly replicate all kinds of php fudgerama, blah blah blah.

Thirdly, if you use replicate_do_DB in addition to replicate_wild_do_table to try to cover your bases, it will indeed NOT cover your bases. We have strict replication here, and it WILL break if the php code does this:
mysql_select_db = ('database_I_DO_NOT_want_to_replicate,$link);
$sql = "INSERT INTO database_I_DO_want_to_replicate.repl_table (a,b,c) VALUES (1,2,3);";

So, my advise, once again is:

DO NOT USE replicate_ignore_db and replicate_do_db!

Instead, use replicate_wild_ignore_db and replicate_wild_do_db.

Have fun! Try it out and see for yourself how this can work for and against you! (as well as MySQL loading my.cnf twice, causing your list of tables to appear twice -- proposed to be fixed for MySQL 5.1).

Until next time,

10 June 2008

An Interesting Replication Tip

We recently moved some databases to different servers, and I altered our mySQL slave configuration files to take into account some of the databases we wanted to replicate (Replicate-do-db) and others that we wanted to ignore (Replicate-ignore-db) -- each of these server cases were mutually exclusive.

All went well, until I found a peculiar error:
Error 'Table 'db101.table101' doesn't exist' on query. Default database: 'db201'. Query: 'UPDATE `db101`.`table101` SET `value101`=`value101`+1 WHERE `gender`='F' AND `site`='177' AND `date`=CURDATE()'
I did not want to replicate anything from database db101, but I did want to replicate every table from db201.

How to Proceed

These are my choices:
  1. Get our team of php programmers to properly specify the database that their script is using, even if it requires 1,000 updates. And it needs to be done 10 minutes ago because our slave is stopped;
  2. Provide a bug report to mySQL and wait (months or years)
  3. Use a proxy to sort out the proper database (but there are better purposes for my lua scripts);
  4. Use Replicate-Wild-Ignore-Table option in the configuration file (very fast).
The Solution
Replicate-Do-Db=db201 # This line was already here

In Conclusion

There exists many solutions to a problem, but we try to aim for the simplest, most elegant solution to create the smallest footprint in terms of overhead and maintenance.

To further this last point, we must try our best to keep the overhead of our systems to a minimum, otherwise things can get out of control when we least expect it, as described in this post by John Mark Walker.

Until next time,

07 May 2008

How to Group Customers by Number of Purchases Quickly!

This is a great little sql statement that packs a nice punch for the number crunchers out there...

A disclaimer first: make sure you don't do this on a live OLTP table with millions of rows unless you want your customers to wait. Thus, run it on your slave :).

Onto business...

The Request

"I need to know the number of people who made one purchase, two purchases, three purchases, and so on. Right now."

The Answer
SELECT total, COUNT(1) FROM (SELECT customer_id, COUNT(1) AS total FROM purchases_table GROUP BY customer_id) derived_table GROUP BY total ORDER BY total;

And would you know it, the result was there in seconds....

Until next time, (I do hope to put up some juicy warehousing how-to's soon)

22 April 2008

MySQL Conference: The Week After

Wow... I had a great time at the annual MySQL Users Conference in Santa Clara.

  • Received my CMDBA designation. Woohoo!
  • Sun announced the possibility of an open source mobile phone;
  • All day mysql proxy tutorial with Jan Kneschke and Giuseppe Maxia;
  • Amazon's Power of Infrastructure as a Service - great premise; glitches remain;
  • Great food and a nifty side trip to Santa Cruz;
  • MySQL Workbench - nifty tool for database development;
  • Use multiple sockets to run queries in parallel to speed up web pages (from Digg.com);
  • Best Keynote: Who is the Dick on My Site? by Dick Hardt
  • A chant that Marty Widenus required us to repeat regarding MySQL's development future: "We don't want crippleware";
  • Discovered how Fickr scaled heavy concurrent writes in real time;
  • Intriguing discussion panel on database warehousing: Sun/MySQL, Kickfire, Infobright, NitroSecurity, and a data warehouse contractor;
  • Chocolate covered strawberries and ice cream goodies at the exhibit hall.
I hope to be more of a responsible blogger and share some very insightful ideas to those of you who were not able to make it to the conference.

Until next time,

14 March 2008

Data Warehousing 101: Pitfalls to Avoid

Sometimes technology gets in the way of getting things done, and us tekkies must do what we can to make our data warehouse work for the purpose it is intended. Ralph Kimball makes 10 good points about the common pitfalls to avoid when implementing a data warehouse, as listed in his book "The Data Warehouse Toolkit", published by Wiley.

  • Become immersed with technology rather than the requirements and the goals of business;
  • Fail to recruit an influential and accessible management visionary to sponsor the data warehouse;
  • Turn this into a multi-year project instead of pursuing manageable iterative development;
  • Run out of budget while creating a normalized data structure, before building a viable presentation area;
  • Pay more attention to back-room ease of development over front-room ease of use and query performance;
  • Make queryable data in presentation area too complex, thus causing users to refer to developers;
  • Populate dimensional models on a standalone basis without regard to the data architecture that ties them together;
  • Load only summarized data into presentation area's dimensional tables;
  • Presume that the business, its requirements and analytics and the underlying data and technology are static;
  • Neglect to acknowled that data warehouse success is tied directly to user acceptance.

Next: using MySQL and php to extract, transform and load data (ETL) into the warehouse

Until Next Time,

07 March 2008

Data Warehousing 101: Requirements for a Data Warehouse

Yes, it has been a while since I added an entry in my blog. I have been working on creating a dynamic data warehouse system reliant on the traditional LAMP stack (and a very nifty graphical plug-in - please comment below if you wish to know what it is!).

Firstly, I must make these qualifications before you read further:
  1. I consider Bill Inmon and Ralph Kimball the pioneers of data warehousing;
  2. Data warehousing has been a relatively recent concept, since 1983 when Teradata introduced a database decision support system, and most current day concepts we understand about data warehousing have been taking shape over the past 10 years;
  3. I am new to this as well, but am very satisfied with the results I am obtaining.
OK... What does a data warehouse look like? This is a succinct high-level view as aptly described on the data warehouse entry on wikipedia:
  • Subject-oriented - data is organized so that all data elements relating to the same real-world event or object are linked together.
  • Time-variant -changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time.
  • Non-volatile - data in the database is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting.
  • Integrated - database contains data from most or all of an organization's operational systems and this data is made consistent.
This premise draws us to what appears to be a centralized solution for a data warehouse. However, you can approach it on a smaller scale by identifying the basic bones of your database structure, thus enabling you to produce results in a shorter span of time. Note: the building blocks of data is not done according to departments - for example, sales, marketing, customer service, and so on.

Instead, you can establish what your basis for your data is through fact tables and dimension tables.

A fact table is the "primary table where the numerical performance measurements of the business are stored" (The Data Warehouse Toolkit, second edition, p.16). The most useful facts of this table are numeric and additive.

Daily Sales Fact Table

Date Key (FK)
Product Key (FK)
Store Key (FK)
Quantity Sold
Dollar Sales Amount
A dimension table, the companion to the fact table, is the user interface to the data warehouse.

Production Dimension Table

Product Key (PK)
Product Description
SKU Number (not to be used as a key in the warehouse)
Brand Description
Category Description
Department Description

Bringing together facts and dimensions becomes easier across all levels of business, as users agree that the dimensional model is their business. In our examples above, store operators use the store dimension table to relate to the number of products sold daily; marketing and product development refer to the attributes of the product dimension table to make conclusions about which products to promote and develop further.

Next I will relate common pitfalls known in the early stages of data warehousing and how to overcome them.

Until Next Time,

13 February 2008

Don't Ever Use NOW() - `datetime field`

***All point news bulletin***
Do not ever use NOW() - `datetime field`! Broadcast this from the mountaintops of open source to the valleys of companies who dabble in it.

Instead, use the following:

TIME_TO_SEC(TIMEDIFF(Now(),`datetime field`))

There are other ways to do the above, just as long as you do not use NOW() - `datetime field`.
Wanna see proof?
mysql> SELECT RIGHT(NOW(),8) `Now`, RIGHT(datetime_inserted,8) `Inserted`, NOW()-datetime_inserted Incorrect, TIME_TO_SEC(TIMEDIFF(NOW(),datetime_inserted)) Correct from my_table;
| -- Now - | Inserted | Incorrect | Correct
| 11:16:59 | 10:59:45 | 5714.0000 | 1034
| 11:16:59 | 10:59:51 | 5708.0000 | 1028
| 11:16:59 | 10:59:55 | 5704.0000 | 1024
| 11:16:59 | 11:03:17 | 1342.0000 | 822
| 11:16:59 | 11:03:19 | 1340.0000 | 820
| 11:16:59 | 11:03:20 | 1339.0000 | 819

Fun, eh?

Til next time,

06 February 2008

Data Warehousing 101: The purpose of a data warehouse

When your company decides that "it is time to build a data warehouse", what thoughts come to mind?

1) A magical fairy ice cream land where data is presented in chocolate shells for everyone to digest perfectly;
2) A big literal warehouse in the industrial section of town with rusty old containers;
3) Another place to put data, which means another place for you to track and monitor additional activity;
4) A place to put a pared-down representation of your OLTP database and call it OLAP.

Sorry to dash anyone's hopes, but it is none of the above.

Before anyone starts shooting from the hip about what a data warehouse is, I recommend picking up the book "The Data Warehouse Toolkit" by Ralph Kimball, a very deserved pioneer in the specialty of data warehousing.

Let's start with the goals of a data warehouse. What do you hear through the walls of business when they are talking about data?
"We have oceans of data in this company, but we can't access it."
"We need to slice, dice, and thrice the data in every conceivable manner possible. "
"You've got to make it easy for business people to get at the data directly."
"Why do we have people present the same business metrics at a meeting, but with different numbers?"
"We want people to use infromation to support more fact-based decision making."

Because these concerns are widespread, they have driven the requirements for the data warehouse:

The data warehouse must make an organization's information easily accessible.
The data warehouse must present the organization's information consistently.
The data warehouse must be adaptive and resilient to change.
The data warehouse must be a secure bastion that protexts our information assets.
The data warehouse must serve as the foundation for improved decision making.
The business community must accept the data warehouse if it is to be deemed successful.

So what does a data warehouse look like? I am glad you asked. That will be the next thing I blog about. :)

Until next time,

30 January 2008

What Do You Name Date Fields?

I have seen too many date fields that have been named after reserved key words that I need to make an entry for this... It's like naming your kid "Kid". Wait a sec... that's happened many times before as well.

I will make a "Proper Naming Conventions" page for mySQL after checking a few more sources first, but this date naming issue cannot wait another day.


First Rule: Do not use a reserved word for your field name!
Second Rule: Your date-related field name can actually include the word date or datetime in it!


Yes, this was a short simple post. But freekin' necessary.

Until next time,

23 January 2008

Sun + MySQL: One Week Later

I really like the 21st century open source technology community. 20 years ago, if a company bought out another company, everyone would have their own island of opinions and response, and you would not know if you were on-track or even if the publicized reports were on-track either.

Today, it is amazing to see the responses around the boards and blogs, which seems to keep the mainstream reports in-check, most of the time (including the picture on the left). I must admit I was checking to see what the "dirt" level of the buyout would end up, but all-in-all, this is something that most people seem to be confident about: this is going to help MySql and Sun in the long run. MySql will likely have a revamped service offering (the price gap between the community and enterprise versions is still sizable for small companies), Sun can integrate MySQL into their OSS, giving Solaris a bit of a facelift.

But where the rubber will meet the road is exemplified in how Sun will identify and utilize MySQL's existing strengths and run with them. Continue to improve InnoDB and MyISAM; I'll give a grace period for Falcon as that appears to have promise, and the third party engines like Brighthouse are fine too, as those are not financed my Sun.

Another place where I look forward to Sun's involvement and direction with MySQL is with replication and clustering. Linux heartbeat is a great utility we use now, and I may see hope with the NDB disk-based cluster when we have 5.1 in production.

I also am looking forward to a comprehensive binary backup utility (beyond MyISAM) in MySql; in my experience, mysqldump is trumped by rsync. Currently we end up doing both rsync and mysqldump on our system, and our recovery works almost like Oracle's recovery manager (...and instead of depending on rman to produce the best backup recovery procedure, I use a few of my brain cells).

After reviewing the posts generated through PlanetMySql.com, and tech articles written by respected journalists, and a newly released report from Gartner, the mood is very positive for the future of MySql.

Until Next Time,

15 January 2008

An exercise in SELF JOINing

There was only so much I could do for my 1976 Mustang. I did not have money to put an Edelbrock head in, so I spent what little I had in a new stereo, a tinted back window, a new muffler that increased the neighborhood noise level a notch, and a couple of other random parts. In the end, it really did not go any faster, and I was $650 poorer.

I liken my experience with SELF JOIN to my 1976 Mustang. Sure, it was fun to use and figure out how I could make it work in my application, but I had a feeling that it might not pull ahead when the light turned green in production.

The Problem
  1. Records are inserted (using INSERT IGNORE) into a table based on DISTINCT first name, last name, address1, address2, city, state, zip, and email.
  2. We needed to select out records that had a distinct address1, address2, city, state and zip, picking up the most recent first name and last name within specific zip code areas (first 3 numbers of the zip code)
The SELF JOIN Solution

I had an "aha" solution, thinking that I could self join the distinct address from the same table with the most recent first name and last name. So off I went in my quest...
SELECT a.first_name, b.first_name, a.address1, a.address2, a.city, a.state, a.zip FROM main_table a INNER JOIN main_table b ON a.address1 = b.address1 AND a.zcf IN (123, 456,789) GROUP BY a.first_name, a.last_name, a.address1, a.address2, a.city, a.state, a.zip HAVING a.signup_date = MAX(b.signup_date) LIMIT 10000;

Well, I was very pleased that it worked so well and quickly in my dev environment of a few thousand rows. I was ready to tell the world that SELF JOINs are fun and profitable.

But then I did an EXPLAIN on the statement and this came up:
Extra: Using temporary; Using filesort
My dev table was only 10,000 rows, but my test environment had several million. Not good!

My Table Hack

Maybe someone could help me improve on my SELF JOIN skills, but based on the business requirement that I would be picking out zipcode-based data sets on a daily basis, I did not want to wait everytime one of these requests ran.

So... I created a table that would include distinct address information with the most recent first name / last name, and a trigger that will populate this table based on this very criteria. It took a little over two hours to insert several million rows, but now I can have the joy of performing this statement instead:
SELECT fname, lname, address1, address2, city, state, zip FROM main_table WHERE zcf IN (123,456,789) LIMIT 10000;

Using the EXPLAIN statement produced the following predictable result:
Extra: Using where; Using index
It is another worker bee script that I don't have to worry about on a daily basis. It produces results quickly, and everyone gets what they needs. For fun, I did try the SELF JOIN on my test table and it took longer than a day to complete. Oh well...

I now drive the family SUV, with no modifications, all standard equipment. And I know it would smoke my old 1976 Mustang off the line anytime.

Til next time,

07 January 2008

How to Send a Report on Most Common Domain Names without Grinding Your DB

This is a continuance from my previous entry on reversing an email address to search for a domain. Minutes after I ran the following query,
SELECT COUNT(1) FROM rv_emails WHERE rv_email LIKE REVERSE('%msn.com');
I was asked to do a count of the top 50 domains in our table of email addresses.

So, I simply changed my previous route by performing the following steps:

1) Add a 'domain' column to my table:
2) Populate the domain column:
UPDATE emails SET DOMAIN = RIGHT(email, LENGTH(email) - INSTR(email,'@'));
3) Select the top 50 domains by doing the following:
There are a couple of items to note here:
  • No need to use the word DISTINCT as this is implied by GROUP BY;
  • ORDER BY COUNT(1) DESC does add "Using temporary; Using filesort" to the EXPLAIN statement, but on a 5 million record table, this took 7.78 sec. Another way to do this is to add "HAVING COUNT(1) > 5000" or some other arbitrary number and write these to a temp table for sorting the final results
This solution turned out to be very good for our reporting purposes, so I now have a trigger that updates the domain column every time an email address is inserted. I would presume a better way is to have php calculate the domain name at the time of the insert, but we have dozens of pages to go through, and a programmer might unknowingly skip this step on a page that would be developed in the future. So, I think I will be keeping the trigger in place forever.

Until next time,