30 November 2007

A Zip Code Solution - ...WHERE LEFT(zip_code,3) vs. This Article

The Price is MySql with Jay Pipes

Quick! Pick a curtain! Do it now!

Curtain #1:
SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

What's under curtain #2 you ask?

Well, I'm glad you asked! It is...
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);

What?? Another column and another index? My table already has 7,351 columns? Well, vertically partition it! OK, off topic...

In the meantime, let me show you how to save minutes, if not hours in your query execution time!

The Problem

I have a three table join in a query that is also making mysql scan 17 million records. Where is this scan coming from? It is from this part of the query:

SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

The Solution

It a nutshell, it is this:
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);
UPDATE table_name SET zip3 = LEFT(zip_code,3);

Just do it. It may take an hour and 37 minutes as it did for me, but it might not!

The New Query

SELECT zip_code FROM table_name WHERE zip3 IN (123,456,789);

The Results

I have shaved off time from a query that joins a 17 million row scan to a 330,000 row scan and a 17,500 record table in the magnitude of 51:1. Instead of 45 minute runs, we're looking at under a minute. I think I will go out for lunch 4 times today. Whoops, I didn't say that. Really!

Til next time,
David


20 November 2007

Table Joins vs. php Loops

Table joins are fun and useful, but did you know that there are circumstances that a php loop will beat out table joins?

For those of you who already know this, feel free to return to www.PlanetMySql.com. But for the rest of us, read on...

The Setup (fictional, but mimics my production environment)

I have two tables, employees and termination_info. Employees has 150,000 records and termination_info has 65,000 records.

The Request

I have been asked to remove employees that match the email address in termination_info (id matching would be my very next project on this one).

The Choices

1) DELETE FROM employees a USING employees a, termination_info b WHERE a.email = b.email;
2) php loop that loops through termination_info records and deletes the corresponding employee record.

Which One is Faster?
#2, by a wide margin.

Why?
'EXPLAIN SELECT a.email FROM employees a INNER JOIN termination_info b ON a.email = b.email' only gives part of the picture. Sure, the email index shows that it will use it, so mysql only needs to study 65,000 rows. However, it needs processing space. A lot of it.

This is a great reason why the api's are made for mysql -- the php api allows us to interface with mysql and perform jobs more efficiently. The "real" size of the tables that I used this were 230,000 records and 3.9 million records. The php script took under 10 minutes, and the mysql version is still running clearly over an hour since it began (and is still running as I write this).

Until next time,
David

16 November 2007

My Favorite Three Stooges Stats

Quick... You have to let everyone (boss, biz dev, customer service, and random bean counters) know why everything is moving slowly! Of course, rarely do people define what "everything" is, and what type of slowness is occuring. But, in the face of customer service agents that cannot work because their pages will not render, generally all eyes are on the famed-dba-of-the-minute.

So, with 7 people standing behind you, 3 IMs coming through (for those that bother with that), and 4 more messages on the phone, and the knowledge that at least a dozen people are talking about "the dba is working on it now", what do you do?

First, we all like SHOW PROCESSLIST. Nice. Gives us a quick bird's view on possible locks, etc.

But... what if there are no locks, and just a lot of interesting-looking selects? Before you beat up on the programmers, you have to carry the boat over the sticks first. Then you can throw darts at the dev team.

How to do this? Well, first, how's the CPU / IO situation?

Will 'top' really work? It's good for a general view of the state of the machine, so fine. Do that. Keep the window open to reference if you wish; I get the general trend within 5 seconds and then move on. But linux gives you FAR more than just that (here's a great reference that served as an inspiration to me) -- don't stop there!

But now comes the fun part:

I use mpstat to give me immediate feedback on disk I/O. One of my favorite uses is in the case below, where I am going to see what each (ALL) processor (-P) is doing every 10 seconds for 2 rounds...

mpstat -P ALL 2 10


05:54:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:10 PM all 1.25 0.00 1.00 78.29 0.05 0.10 0.00 19.31 652.51
05:54:10 PM 0 4.61 0.00 3.21 91.58 0.00 0.60 0.00 0.00 629.86
05:54:10 PM 1 0.40 0.00 0.20 21.84 0.00 0.00 0.00 77.35 22.65
05:54:10 PM 2 0.20 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00
05:54:10 PM 3 0.00 0.00 0.20 100.00 0.00 0.00 0.00 0.00 0.00

05:54:10 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:15 PM all 1.75 0.00 1.30 79.39 0.05 0.20 0.00 17.31 982.00
05:54:15 PM 0 6.20 0.00 4.20 88.60 0.20 0.80 0.00 0.00 945.60
05:54:15 PM 1 0.80 0.00 0.80 29.20 0.00 0.00 0.00 69.20 36.40
05:54:15 PM 2 0.20 0.00 0.00 99.80 0.00 0.00 0.00 0.00 0.00
05:54:15 PM 3 0.00 0.00 0.20 99.80 0.00 0.00 0.00 0.00 0.00

Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
Average: all 1.55 0.00 1.09 78.98 0.06 0.15 0.00 18.17 859.50
Average: 0 5.47 0.00 3.54 90.21 0.20 0.58 0.00 0.00 822.21
Average: 1 0.66 0.00 0.50 26.06 0.02 0.02 0.00 72.76 37.29
Average: 2 0.08 0.00 0.02 100.04 0.00 0.00 0.00 0.00 0.00
Average: 3 0.02 0.00 0.30 99.86 0.00 0.00 0.00 0.00 0.00


OK, this box looks busy, and there is a lot of I/O going on. This is where I check the processlist again, and the slow query log (which I have a handy little script that parses this into a database on a seperate box -- and since I clean my logs daily, the size of the LOAD INFILE remains small). At this point, you can see which queries are causing the box the most amount of grief and review this with people who are willing to listen... :)

In the meantime, I still have to tell you about iostat, dstat and sar!

One of my favorite uses of the iostat command is as below (by this time, the server calmed down quite a bit!). k = kilobytes per second (instead of blocks per second -- makes it easy to assess the situation at hand)
t = prints time for each report (as I also have a utility that parses this data into a table for analysis, and it is good to know the time!)
x = Displays extended statistics, particularly %util, which is the percentage of CPU time during which I/O requests are issued to the device, where saturation occurs at 100%.
iostat -ktx 10 5


Time: 04:29:50 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.07 0.24 3.26 0.00 95.92

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 1.35 19.65 16.31 24.25 474.10 177.46 32.13 0.50 12.29 3.06 12.41

Time: 04:30:00 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.07 1.85 0.00 98.03

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.91 2.90 22.52 12.01 123.32 10.65 0.14 5.33 2.96 7.54

Time: 04:30:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
11.85 0.00 8.88 2.03 0.00 77.24

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 17.88 1.30 18.98 5.19 148.25 15.13 0.13 6.39 4.33 8.79

Time: 04:30:20 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.00 0.12 4.80 0.00 94.98

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.10 4.70 25.00 18.80 142.80 10.88 0.28 9.39 6.49 19.29

Time: 04:30:30 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.05 1.27 0.00 98.60

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 12.40 3.00 33.60 12.00 188.00 10.93 0.16 4.32 1.45 5.30


Next is dstat -- another way to look at CPU and I/O:

dstat


----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
6 2 87 5 0 0| 44k 55k| 0 0 | 0 0 | 564 248
4 0 93 2 0 0| 44k 0 | 0 0 | 0 0 | 345 266
0 0 79 21 0 0| 60k 372k| 21k 5684B| 0 0 | 494 386
2 1 90 6 0 0| 48k 0 | 0 0 | 0 0 | 435 579
0 0 75 24 0 0| 92k 1496k| 50k 13k| 0 0 | 656 346
5 2 88 5 0 0| 112k 336k| 0 0 | 0 0 | 429 467
1 2 90 6 0 0| 80k 0 | 48k 11k| 0 0 | 426 514
1 1 92 5 0 0| 48k 344k| 0 0 | 0 0 | 501 576
0 0 81 18 0 0| 72k 0 | 44k 10k| 0 0 | 364 293
0 0 89 11 0 0| 24k 2064k| 0 0 | 0 0 | 704 273
8 1 77 13 0 0| 48k 240k| 32k 6214B| 0 0 | 350 146
3 1 89 7 0 0| 36k 0 | 0 0 | 0 0 | 400 401
7 1 76 16 0 0| 84k 72k| 35k 8034B| 0 0 | 381 243
5 1 92 3 0 0| 68k 0 | 0 0 | 0 0 | 357 310
3 0 87 9 0 0| 0 1552k| 25k 5858B| 0 0 | 580 106



And lastly:

sar 5 10


05:16:09 PM CPU %user %nice %system %iowait %steal %idle
05:16:14 PM all 0.30 0.00 0.05 0.95 0.00 98.70
05:16:19 PM all 0.30 0.00 0.15 3.65 0.00 95.90
05:16:24 PM all 0.40 0.00 0.15 6.49 0.00 92.96
05:16:29 PM all 0.50 0.00 0.05 4.25 0.00 95.20
05:16:34 PM all 0.30 0.00 0.10 3.60 0.00 96.00
05:16:39 PM all 0.35 0.00 0.15 0.90 0.00 98.60
05:16:44 PM all 0.25 0.00 0.05 3.75 0.00 95.95
05:16:49 PM all 0.45 0.00 0.10 0.65 0.00 98.80
05:16:54 PM all 0.40 0.00 0.10 1.00 0.00 98.50
05:16:59 PM all 0.05 0.00 0.10 0.55 0.00 99.30
Average: all 0.33 0.00 0.10 2.58 0.00 96.99


Enjoy using these! If needed, write them down on a handy little spot, so that when you're in the spotlight you can show how your servers are getting utilized (hammered). This will then get you to the next level with developers, biz dev, customer support and your supervisor (which would be how to improve a few queries).

Which goes to a future topic of using the slow query log. Fun!!

Until next time,
David

function foo() {
return "Dr Nic";
}

02 November 2007

How to Select Unique Records from a table that does not have a Unique Index

Many times DBAs work with data sets that are not the most optimal (in their opinion), and I recently had the pleasure of working with a table that did not have a unique identifier. And I needed to get unique results from this table, as I would need to do this query in different forms in the future, without selecting records from the past.

Caveat: I could not alter the table structure yet; I needed to make do with the way it was (there was a LOAD INFILE script or two without qualified field names). The first thing I would have done is simply add an auto_increment column at the start.

I also could not add a unique index; there was no time for that -- it would have easily taken over an hour.


The Challenge

Obtain exactly 10,000 records with a unique identifier x nine different queries.


The Half-Solution

Create table_2 with a unique constraint based on the id.

INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT 10000;


The Problem

Affected rows are less than 10000, thus falling short of the 10,000 goal.

I needed to do nine of these, and there was no way I was going to do this by hand in sql.


The Solution


Use a php loop and the use of mysql_affected_rows to repeat and rinse until the desired solution is achieved.

$finished = false;
$limit = 10000;

while (!($finished)) {

$sql = "INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT $limit;";
mysql_query($sql,$link) or die(mysql_error());
$affected_rows = mysql_affected_rows();
if ($limit == $affected_rows) {
$finished = true;
} else {
$limit = $limit - $affected_rows # rinse and repeat!
}
}


This saved me a lot of grief until I have the window of opportunity to put in a unique index on the table.

Until next time,
David


30 October 2007

mySql Utility to Set Up a Slave Server, Unattended

This utility written by Kevin Burton is worth checking into - even though it is a first version and needs to be modded for your purposes (in my case I had to alter the local mysql connection string). However, in any case, this is going to be needed by us DBAs in the future, as we manage and build more and more slave servers.

This is what the script does, in Kevin's words:

This is a script to perform unattended cloning of MySQL slave servers (or masters) to put a new slave online with minimal interaction.

It connects to the source node, performs a mysqldump or mysqlhotcopy, transfers the data, restores the data, and then sets up all replication parameters, on the target machine.

It then starts the slave which then begins to catch up to the master

For DBAs that spend hours setting up slave servers, save yourself some time and check this script out!

Til next time,
David

25 October 2007

Google + mysql = distributed fun!

Google seems to be everybody's favorite girlfriend these days. We might even replace the antiquated statement "Midas touch" with "googlitis", a condition you get when google looks in your direction. Your valuation goes up even if google does not become a successful bidder (MSN + facebook = $15 billion dollar valuation), and everybody wants to be your best friend, even if you are politely collecting data from them.

So the daytime soap continues on, this time in favor of mysql -- with articles like:
MySQL to get injection of Google code and
MsSQL getting Google's blessing...and code
where google documents their progress here.

Add to this an increasing acceptance (not just awareness) of open-source databases, and that makes for another exciting day...

Til next time,
David

19 October 2007

IP Addresses: No more char(15)'s allowed!

OK... This is my second entry on this blog. And it is about something I want every mysql/php newbie to know at the starting gate:

DON'T USE CHAR(15) FOR IP Address fields in mysql!

Well... what about the periods in ip addresses, you say?

Well... mysql has a FANTASTIC way of solving this issue! Convert the standard ip address into an unsigned integer!

ok... at first it sounds a little cryptic, but say it at least 3 times before your morning coffee:

INET_ATON
INET_NTOA


loosely meaning....

INET_ATON = Internet address to number
INET_NTOA = Internet number to address

SO... try this:

mysql> SELECT INET_ATON('192.168.1.5');
+--------------------------+
| INET_ATON('192.168.1.5') |
+--------------------------+
| 3232235781 |
+--------------------------+
1 row in set (0.04 sec)

Now, let's reverse it!


mysql> SELECT INET_NTOA(3232235781);
+-----------------------+
| INET_NTOA(3232235781) |
+-----------------------+
| 192.168.1.5 |
+-----------------------+
1 row in set (0.01 sec)

Fun, eh?

Now... what are the savings on this, you ask?

On a 17 million row table consisting only of an auto_increment id (another must; don't leave home without an auto_increment id!!), non-indexed exact searches are about the same. In fact, the searches on the char column sometimes perform up to 10% faster.

After indexing, exact searches are once again the same, although they take a fraction of the time they did when they were not indexed. However, where the crown jewel shines is on ranges. On a count of an ip range, on the int based table, the cost was 3.78 seconds, but on the char(15) based table, the cost was 26.21 seconds.

Also -- the size of the int based data table (MyISAM) is 148M compared to 329M on the char(15) based table.

Are you convinced now? :) If so, now you know what you're going to do!

Until next time (whenever that will be),
David