30 November 2007

A Zip Code Solution - ...WHERE LEFT(zip_code,3) vs. This Article

The Price is MySql with Jay Pipes

Quick! Pick a curtain! Do it now!

Curtain #1:
SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

What's under curtain #2 you ask?

Well, I'm glad you asked! It is...
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);

What?? Another column and another index? My table already has 7,351 columns? Well, vertically partition it! OK, off topic...

In the meantime, let me show you how to save minutes, if not hours in your query execution time!

The Problem

I have a three table join in a query that is also making mysql scan 17 million records. Where is this scan coming from? It is from this part of the query:

SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

The Solution

It a nutshell, it is this:
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);
UPDATE table_name SET zip3 = LEFT(zip_code,3);

Just do it. It may take an hour and 37 minutes as it did for me, but it might not!

The New Query

SELECT zip_code FROM table_name WHERE zip3 IN (123,456,789);

The Results

I have shaved off time from a query that joins a 17 million row scan to a 330,000 row scan and a 17,500 record table in the magnitude of 51:1. Instead of 45 minute runs, we're looking at under a minute. I think I will go out for lunch 4 times today. Whoops, I didn't say that. Really!

Til next time,
David


20 November 2007

Table Joins vs. php Loops

Table joins are fun and useful, but did you know that there are circumstances that a php loop will beat out table joins?

For those of you who already know this, feel free to return to www.PlanetMySql.com. But for the rest of us, read on...

The Setup (fictional, but mimics my production environment)

I have two tables, employees and termination_info. Employees has 150,000 records and termination_info has 65,000 records.

The Request

I have been asked to remove employees that match the email address in termination_info (id matching would be my very next project on this one).

The Choices

1) DELETE FROM employees a USING employees a, termination_info b WHERE a.email = b.email;
2) php loop that loops through termination_info records and deletes the corresponding employee record.

Which One is Faster?
#2, by a wide margin.

Why?
'EXPLAIN SELECT a.email FROM employees a INNER JOIN termination_info b ON a.email = b.email' only gives part of the picture. Sure, the email index shows that it will use it, so mysql only needs to study 65,000 rows. However, it needs processing space. A lot of it.

This is a great reason why the api's are made for mysql -- the php api allows us to interface with mysql and perform jobs more efficiently. The "real" size of the tables that I used this were 230,000 records and 3.9 million records. The php script took under 10 minutes, and the mysql version is still running clearly over an hour since it began (and is still running as I write this).

Until next time,
David

16 November 2007

My Favorite Three Stooges Stats

Quick... You have to let everyone (boss, biz dev, customer service, and random bean counters) know why everything is moving slowly! Of course, rarely do people define what "everything" is, and what type of slowness is occuring. But, in the face of customer service agents that cannot work because their pages will not render, generally all eyes are on the famed-dba-of-the-minute.

So, with 7 people standing behind you, 3 IMs coming through (for those that bother with that), and 4 more messages on the phone, and the knowledge that at least a dozen people are talking about "the dba is working on it now", what do you do?

First, we all like SHOW PROCESSLIST. Nice. Gives us a quick bird's view on possible locks, etc.

But... what if there are no locks, and just a lot of interesting-looking selects? Before you beat up on the programmers, you have to carry the boat over the sticks first. Then you can throw darts at the dev team.

How to do this? Well, first, how's the CPU / IO situation?

Will 'top' really work? It's good for a general view of the state of the machine, so fine. Do that. Keep the window open to reference if you wish; I get the general trend within 5 seconds and then move on. But linux gives you FAR more than just that (here's a great reference that served as an inspiration to me) -- don't stop there!

But now comes the fun part:

I use mpstat to give me immediate feedback on disk I/O. One of my favorite uses is in the case below, where I am going to see what each (ALL) processor (-P) is doing every 10 seconds for 2 rounds...

mpstat -P ALL 2 10


05:54:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:10 PM all 1.25 0.00 1.00 78.29 0.05 0.10 0.00 19.31 652.51
05:54:10 PM 0 4.61 0.00 3.21 91.58 0.00 0.60 0.00 0.00 629.86
05:54:10 PM 1 0.40 0.00 0.20 21.84 0.00 0.00 0.00 77.35 22.65
05:54:10 PM 2 0.20 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00
05:54:10 PM 3 0.00 0.00 0.20 100.00 0.00 0.00 0.00 0.00 0.00

05:54:10 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:15 PM all 1.75 0.00 1.30 79.39 0.05 0.20 0.00 17.31 982.00
05:54:15 PM 0 6.20 0.00 4.20 88.60 0.20 0.80 0.00 0.00 945.60
05:54:15 PM 1 0.80 0.00 0.80 29.20 0.00 0.00 0.00 69.20 36.40
05:54:15 PM 2 0.20 0.00 0.00 99.80 0.00 0.00 0.00 0.00 0.00
05:54:15 PM 3 0.00 0.00 0.20 99.80 0.00 0.00 0.00 0.00 0.00

Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
Average: all 1.55 0.00 1.09 78.98 0.06 0.15 0.00 18.17 859.50
Average: 0 5.47 0.00 3.54 90.21 0.20 0.58 0.00 0.00 822.21
Average: 1 0.66 0.00 0.50 26.06 0.02 0.02 0.00 72.76 37.29
Average: 2 0.08 0.00 0.02 100.04 0.00 0.00 0.00 0.00 0.00
Average: 3 0.02 0.00 0.30 99.86 0.00 0.00 0.00 0.00 0.00


OK, this box looks busy, and there is a lot of I/O going on. This is where I check the processlist again, and the slow query log (which I have a handy little script that parses this into a database on a seperate box -- and since I clean my logs daily, the size of the LOAD INFILE remains small). At this point, you can see which queries are causing the box the most amount of grief and review this with people who are willing to listen... :)

In the meantime, I still have to tell you about iostat, dstat and sar!

One of my favorite uses of the iostat command is as below (by this time, the server calmed down quite a bit!). k = kilobytes per second (instead of blocks per second -- makes it easy to assess the situation at hand)
t = prints time for each report (as I also have a utility that parses this data into a table for analysis, and it is good to know the time!)
x = Displays extended statistics, particularly %util, which is the percentage of CPU time during which I/O requests are issued to the device, where saturation occurs at 100%.
iostat -ktx 10 5


Time: 04:29:50 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.07 0.24 3.26 0.00 95.92

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 1.35 19.65 16.31 24.25 474.10 177.46 32.13 0.50 12.29 3.06 12.41

Time: 04:30:00 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.07 1.85 0.00 98.03

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.91 2.90 22.52 12.01 123.32 10.65 0.14 5.33 2.96 7.54

Time: 04:30:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
11.85 0.00 8.88 2.03 0.00 77.24

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 17.88 1.30 18.98 5.19 148.25 15.13 0.13 6.39 4.33 8.79

Time: 04:30:20 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.00 0.12 4.80 0.00 94.98

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.10 4.70 25.00 18.80 142.80 10.88 0.28 9.39 6.49 19.29

Time: 04:30:30 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.05 1.27 0.00 98.60

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 12.40 3.00 33.60 12.00 188.00 10.93 0.16 4.32 1.45 5.30


Next is dstat -- another way to look at CPU and I/O:

dstat


----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
6 2 87 5 0 0| 44k 55k| 0 0 | 0 0 | 564 248
4 0 93 2 0 0| 44k 0 | 0 0 | 0 0 | 345 266
0 0 79 21 0 0| 60k 372k| 21k 5684B| 0 0 | 494 386
2 1 90 6 0 0| 48k 0 | 0 0 | 0 0 | 435 579
0 0 75 24 0 0| 92k 1496k| 50k 13k| 0 0 | 656 346
5 2 88 5 0 0| 112k 336k| 0 0 | 0 0 | 429 467
1 2 90 6 0 0| 80k 0 | 48k 11k| 0 0 | 426 514
1 1 92 5 0 0| 48k 344k| 0 0 | 0 0 | 501 576
0 0 81 18 0 0| 72k 0 | 44k 10k| 0 0 | 364 293
0 0 89 11 0 0| 24k 2064k| 0 0 | 0 0 | 704 273
8 1 77 13 0 0| 48k 240k| 32k 6214B| 0 0 | 350 146
3 1 89 7 0 0| 36k 0 | 0 0 | 0 0 | 400 401
7 1 76 16 0 0| 84k 72k| 35k 8034B| 0 0 | 381 243
5 1 92 3 0 0| 68k 0 | 0 0 | 0 0 | 357 310
3 0 87 9 0 0| 0 1552k| 25k 5858B| 0 0 | 580 106



And lastly:

sar 5 10


05:16:09 PM CPU %user %nice %system %iowait %steal %idle
05:16:14 PM all 0.30 0.00 0.05 0.95 0.00 98.70
05:16:19 PM all 0.30 0.00 0.15 3.65 0.00 95.90
05:16:24 PM all 0.40 0.00 0.15 6.49 0.00 92.96
05:16:29 PM all 0.50 0.00 0.05 4.25 0.00 95.20
05:16:34 PM all 0.30 0.00 0.10 3.60 0.00 96.00
05:16:39 PM all 0.35 0.00 0.15 0.90 0.00 98.60
05:16:44 PM all 0.25 0.00 0.05 3.75 0.00 95.95
05:16:49 PM all 0.45 0.00 0.10 0.65 0.00 98.80
05:16:54 PM all 0.40 0.00 0.10 1.00 0.00 98.50
05:16:59 PM all 0.05 0.00 0.10 0.55 0.00 99.30
Average: all 0.33 0.00 0.10 2.58 0.00 96.99


Enjoy using these! If needed, write them down on a handy little spot, so that when you're in the spotlight you can show how your servers are getting utilized (hammered). This will then get you to the next level with developers, biz dev, customer support and your supervisor (which would be how to improve a few queries).

Which goes to a future topic of using the slow query log. Fun!!

Until next time,
David

function foo() {
return "Dr Nic";
}

02 November 2007

How to Select Unique Records from a table that does not have a Unique Index

Many times DBAs work with data sets that are not the most optimal (in their opinion), and I recently had the pleasure of working with a table that did not have a unique identifier. And I needed to get unique results from this table, as I would need to do this query in different forms in the future, without selecting records from the past.

Caveat: I could not alter the table structure yet; I needed to make do with the way it was (there was a LOAD INFILE script or two without qualified field names). The first thing I would have done is simply add an auto_increment column at the start.

I also could not add a unique index; there was no time for that -- it would have easily taken over an hour.


The Challenge

Obtain exactly 10,000 records with a unique identifier x nine different queries.


The Half-Solution

Create table_2 with a unique constraint based on the id.

INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT 10000;


The Problem

Affected rows are less than 10000, thus falling short of the 10,000 goal.

I needed to do nine of these, and there was no way I was going to do this by hand in sql.


The Solution


Use a php loop and the use of mysql_affected_rows to repeat and rinse until the desired solution is achieved.

$finished = false;
$limit = 10000;

while (!($finished)) {

$sql = "INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT $limit;";
mysql_query($sql,$link) or die(mysql_error());
$affected_rows = mysql_affected_rows();
if ($limit == $affected_rows) {
$finished = true;
} else {
$limit = $limit - $affected_rows # rinse and repeat!
}
}


This saved me a lot of grief until I have the window of opportunity to put in a unique index on the table.

Until next time,
David