18 December 2007

How to Find Domain Names of Emails Without Using LIKE '%domain.com'

I like working at companies where there are "problems related to success". My colleague and I from a former company that I worked for in LA talked about two types of problems faced by business:
  1. Problems related to failure - the problems of having your doors open for business but few people know about it;
  2. Problems related to success -the problems of having so many people come through your doors that you are trying to figure out how to handle it
The world of web is very much like this. Rarely would one hear that a company is in the "sweet spot in the middle", and if a company is in such a position, everyone is very quiet about it as to not the jinx the atmosphere around it. Instead, you hear "We just need to bring our numbers up by promoting 1,000 more youtube videos", or the other end - "We just brought in three racks of servers and we can barely keep up".

And, as it happens, "problems of success" fall on the lap of the mysql DBA. As many DBAs should be aware, we need to prepare for the future, before people start calling you a Disaster Briefing Analyst instead of Database Administrator.

One of these moments concerns us with the LIKE statement. We all hope that we don't need to use LIKE with the wildcard % in the front. But, as it happens, with tables containing email addresses, an business analyst will request a count of all msn.com-based email addresses.

How to do this without LIKE '%msn.com'?

REVERSE to the Rescue

Just reverse the email address in the table and then reverse the domain you are searching for. Here's how to create a table for the testing:
CREATE TABLE rv_emails (rv_email_id INT UNSIGNED NOT NULL AUTO_INCREMENT, email CHAR(50) NOT NULL, rv_email CHAR(50) NOT NULL, PRIMARY KEY (rv_email_id), INDEX idx_email (email), INDEX idx_rv_email (rv_email));

When I SELECT COUNT(1) FROM rv_emails WHERE email LIKE '%msn.com' when this table has a million records, it takes 1.28 sec to retrieve a count 17,594 records. Now try this on for size:
SELECT COUNT(1) FROM rv_emails WHERE rv_email LIKE REVERSE('%msn.com');
How long did it take for you? A little better, eh? (That's my Canadian vernacular coming through) For me, it was .03 sec to achieve the identical result. An improvement of about 40 times.

Just for fun, here are the results when the table has 5 million records:
  • Old method: 6.69 sec
  • New method: 0.19 sec
Fun stuff!

Until next time,
David

13 December 2007

My One Night mySQL Road Trip

I like driving. It relaxes me. I especially like driving in the Carolinas versus driving the freeways in Los Angeles. In LA, you have to intentionally calm yourself down before going to the on ramp of the 101, 5 or the 10. Here, in South Carolina, it is possible to drive highway speeds on the highways (in LA, it is possible, but only at night time or early in the morning).

When I heard that the Durham mySQL meetup had Jay Pipes over, I had to go. Thus, my Carolinas road trip, from Columbia, SC to Durham, NC. That was about the length of three CDs. Nice.

And, as always, when you connect with like-minded individuals in the industry, I *always* pick up on a few new techniques. It's like the same idea Sheeri talked about when seeing another side of things when in the presence of other mySQL DBAs.

As a result, this has become my list of "To-Do's" at our shop here. As each one gets done, I'll blog about it and turn each of these items into a link. So, without further ado, here goes:

  • Vertical partition infrequently updated fields in one table and frequently updated fields in another. We have a couple of tables that could use this, but the programming ramifications could be more significant than anticipated;
  • Remove redundant/poor indexes;
  • Think in terms of sets instead of loops - get comfortable using derived tables;
  • Take out non-deterministic values in our reporting queries
  • Turn MyISAM DELETES and UPDATES into batches
  • Convert "email LIKE '%website.com'" queries into "rv_email LIKE (REVERSE('website.com'),'%')" queries. Already did that today - worked like a charm! I'll blog about it tomorrow.
That's all for today!

Til next time,
David

06 December 2007

Creating Copies of Tables: MS SQL vs. MySql

I was reminded the other day how SQL Server was annoying in regards to table structure changes. All I wanted to to was copy a table (with structure intact - keys, dependancies) with its data.

How hard could that be?

MS SQL in Action

First, I needed to do a complete DDL on TableOne and replace every instance of the word "TableOne" with "TableTwo". It works, even though it is bulky and takes time, especially when the DDL was 68 lines long.

The second step was not too bad.
SELECT * FROM TableOne INTO TableTwo
GO
Ok - that did not seem too hard, until you see how this can be accomplished in MySql.


MySql's Turn
CREATE TABLE table_1 LIKE table_2;
INSERT INTO table_1 SELECT * FROM table_2;
Which looks easier to you?

Here's another wrinkle: what about table-level permissions? Which side is easier to manage?

Until next time,
David

30 November 2007

A Zip Code Solution - ...WHERE LEFT(zip_code,3) vs. This Article

The Price is MySql with Jay Pipes

Quick! Pick a curtain! Do it now!

Curtain #1:
SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

What's under curtain #2 you ask?

Well, I'm glad you asked! It is...
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);

What?? Another column and another index? My table already has 7,351 columns? Well, vertically partition it! OK, off topic...

In the meantime, let me show you how to save minutes, if not hours in your query execution time!

The Problem

I have a three table join in a query that is also making mysql scan 17 million records. Where is this scan coming from? It is from this part of the query:

SELECT zip_code FROM table_name WHERE LEFT(zip_code,3) IN (123,456,789);

The Solution

It a nutshell, it is this:
ALTER TABLE table_name ADD COLUMN zip3 INTEGER UNSIGNED ZEROFILL NOT NULL, ADD INDEX idx_zip3 (zip3);
UPDATE table_name SET zip3 = LEFT(zip_code,3);

Just do it. It may take an hour and 37 minutes as it did for me, but it might not!

The New Query

SELECT zip_code FROM table_name WHERE zip3 IN (123,456,789);

The Results

I have shaved off time from a query that joins a 17 million row scan to a 330,000 row scan and a 17,500 record table in the magnitude of 51:1. Instead of 45 minute runs, we're looking at under a minute. I think I will go out for lunch 4 times today. Whoops, I didn't say that. Really!

Til next time,
David


20 November 2007

Table Joins vs. php Loops

Table joins are fun and useful, but did you know that there are circumstances that a php loop will beat out table joins?

For those of you who already know this, feel free to return to www.PlanetMySql.com. But for the rest of us, read on...

The Setup (fictional, but mimics my production environment)

I have two tables, employees and termination_info. Employees has 150,000 records and termination_info has 65,000 records.

The Request

I have been asked to remove employees that match the email address in termination_info (id matching would be my very next project on this one).

The Choices

1) DELETE FROM employees a USING employees a, termination_info b WHERE a.email = b.email;
2) php loop that loops through termination_info records and deletes the corresponding employee record.

Which One is Faster?
#2, by a wide margin.

Why?
'EXPLAIN SELECT a.email FROM employees a INNER JOIN termination_info b ON a.email = b.email' only gives part of the picture. Sure, the email index shows that it will use it, so mysql only needs to study 65,000 rows. However, it needs processing space. A lot of it.

This is a great reason why the api's are made for mysql -- the php api allows us to interface with mysql and perform jobs more efficiently. The "real" size of the tables that I used this were 230,000 records and 3.9 million records. The php script took under 10 minutes, and the mysql version is still running clearly over an hour since it began (and is still running as I write this).

Until next time,
David

16 November 2007

My Favorite Three Stooges Stats

Quick... You have to let everyone (boss, biz dev, customer service, and random bean counters) know why everything is moving slowly! Of course, rarely do people define what "everything" is, and what type of slowness is occuring. But, in the face of customer service agents that cannot work because their pages will not render, generally all eyes are on the famed-dba-of-the-minute.

So, with 7 people standing behind you, 3 IMs coming through (for those that bother with that), and 4 more messages on the phone, and the knowledge that at least a dozen people are talking about "the dba is working on it now", what do you do?

First, we all like SHOW PROCESSLIST. Nice. Gives us a quick bird's view on possible locks, etc.

But... what if there are no locks, and just a lot of interesting-looking selects? Before you beat up on the programmers, you have to carry the boat over the sticks first. Then you can throw darts at the dev team.

How to do this? Well, first, how's the CPU / IO situation?

Will 'top' really work? It's good for a general view of the state of the machine, so fine. Do that. Keep the window open to reference if you wish; I get the general trend within 5 seconds and then move on. But linux gives you FAR more than just that (here's a great reference that served as an inspiration to me) -- don't stop there!

But now comes the fun part:

I use mpstat to give me immediate feedback on disk I/O. One of my favorite uses is in the case below, where I am going to see what each (ALL) processor (-P) is doing every 10 seconds for 2 rounds...

mpstat -P ALL 2 10


05:54:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:10 PM all 1.25 0.00 1.00 78.29 0.05 0.10 0.00 19.31 652.51
05:54:10 PM 0 4.61 0.00 3.21 91.58 0.00 0.60 0.00 0.00 629.86
05:54:10 PM 1 0.40 0.00 0.20 21.84 0.00 0.00 0.00 77.35 22.65
05:54:10 PM 2 0.20 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00
05:54:10 PM 3 0.00 0.00 0.20 100.00 0.00 0.00 0.00 0.00 0.00

05:54:10 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:15 PM all 1.75 0.00 1.30 79.39 0.05 0.20 0.00 17.31 982.00
05:54:15 PM 0 6.20 0.00 4.20 88.60 0.20 0.80 0.00 0.00 945.60
05:54:15 PM 1 0.80 0.00 0.80 29.20 0.00 0.00 0.00 69.20 36.40
05:54:15 PM 2 0.20 0.00 0.00 99.80 0.00 0.00 0.00 0.00 0.00
05:54:15 PM 3 0.00 0.00 0.20 99.80 0.00 0.00 0.00 0.00 0.00

Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
Average: all 1.55 0.00 1.09 78.98 0.06 0.15 0.00 18.17 859.50
Average: 0 5.47 0.00 3.54 90.21 0.20 0.58 0.00 0.00 822.21
Average: 1 0.66 0.00 0.50 26.06 0.02 0.02 0.00 72.76 37.29
Average: 2 0.08 0.00 0.02 100.04 0.00 0.00 0.00 0.00 0.00
Average: 3 0.02 0.00 0.30 99.86 0.00 0.00 0.00 0.00 0.00


OK, this box looks busy, and there is a lot of I/O going on. This is where I check the processlist again, and the slow query log (which I have a handy little script that parses this into a database on a seperate box -- and since I clean my logs daily, the size of the LOAD INFILE remains small). At this point, you can see which queries are causing the box the most amount of grief and review this with people who are willing to listen... :)

In the meantime, I still have to tell you about iostat, dstat and sar!

One of my favorite uses of the iostat command is as below (by this time, the server calmed down quite a bit!). k = kilobytes per second (instead of blocks per second -- makes it easy to assess the situation at hand)
t = prints time for each report (as I also have a utility that parses this data into a table for analysis, and it is good to know the time!)
x = Displays extended statistics, particularly %util, which is the percentage of CPU time during which I/O requests are issued to the device, where saturation occurs at 100%.
iostat -ktx 10 5


Time: 04:29:50 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.07 0.24 3.26 0.00 95.92

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 1.35 19.65 16.31 24.25 474.10 177.46 32.13 0.50 12.29 3.06 12.41

Time: 04:30:00 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.07 1.85 0.00 98.03

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.91 2.90 22.52 12.01 123.32 10.65 0.14 5.33 2.96 7.54

Time: 04:30:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
11.85 0.00 8.88 2.03 0.00 77.24

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 17.88 1.30 18.98 5.19 148.25 15.13 0.13 6.39 4.33 8.79

Time: 04:30:20 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.00 0.12 4.80 0.00 94.98

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.10 4.70 25.00 18.80 142.80 10.88 0.28 9.39 6.49 19.29

Time: 04:30:30 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.05 1.27 0.00 98.60

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 12.40 3.00 33.60 12.00 188.00 10.93 0.16 4.32 1.45 5.30


Next is dstat -- another way to look at CPU and I/O:

dstat


----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
6 2 87 5 0 0| 44k 55k| 0 0 | 0 0 | 564 248
4 0 93 2 0 0| 44k 0 | 0 0 | 0 0 | 345 266
0 0 79 21 0 0| 60k 372k| 21k 5684B| 0 0 | 494 386
2 1 90 6 0 0| 48k 0 | 0 0 | 0 0 | 435 579
0 0 75 24 0 0| 92k 1496k| 50k 13k| 0 0 | 656 346
5 2 88 5 0 0| 112k 336k| 0 0 | 0 0 | 429 467
1 2 90 6 0 0| 80k 0 | 48k 11k| 0 0 | 426 514
1 1 92 5 0 0| 48k 344k| 0 0 | 0 0 | 501 576
0 0 81 18 0 0| 72k 0 | 44k 10k| 0 0 | 364 293
0 0 89 11 0 0| 24k 2064k| 0 0 | 0 0 | 704 273
8 1 77 13 0 0| 48k 240k| 32k 6214B| 0 0 | 350 146
3 1 89 7 0 0| 36k 0 | 0 0 | 0 0 | 400 401
7 1 76 16 0 0| 84k 72k| 35k 8034B| 0 0 | 381 243
5 1 92 3 0 0| 68k 0 | 0 0 | 0 0 | 357 310
3 0 87 9 0 0| 0 1552k| 25k 5858B| 0 0 | 580 106



And lastly:

sar 5 10


05:16:09 PM CPU %user %nice %system %iowait %steal %idle
05:16:14 PM all 0.30 0.00 0.05 0.95 0.00 98.70
05:16:19 PM all 0.30 0.00 0.15 3.65 0.00 95.90
05:16:24 PM all 0.40 0.00 0.15 6.49 0.00 92.96
05:16:29 PM all 0.50 0.00 0.05 4.25 0.00 95.20
05:16:34 PM all 0.30 0.00 0.10 3.60 0.00 96.00
05:16:39 PM all 0.35 0.00 0.15 0.90 0.00 98.60
05:16:44 PM all 0.25 0.00 0.05 3.75 0.00 95.95
05:16:49 PM all 0.45 0.00 0.10 0.65 0.00 98.80
05:16:54 PM all 0.40 0.00 0.10 1.00 0.00 98.50
05:16:59 PM all 0.05 0.00 0.10 0.55 0.00 99.30
Average: all 0.33 0.00 0.10 2.58 0.00 96.99


Enjoy using these! If needed, write them down on a handy little spot, so that when you're in the spotlight you can show how your servers are getting utilized (hammered). This will then get you to the next level with developers, biz dev, customer support and your supervisor (which would be how to improve a few queries).

Which goes to a future topic of using the slow query log. Fun!!

Until next time,
David

function foo() {
return "Dr Nic";
}

02 November 2007

How to Select Unique Records from a table that does not have a Unique Index

Many times DBAs work with data sets that are not the most optimal (in their opinion), and I recently had the pleasure of working with a table that did not have a unique identifier. And I needed to get unique results from this table, as I would need to do this query in different forms in the future, without selecting records from the past.

Caveat: I could not alter the table structure yet; I needed to make do with the way it was (there was a LOAD INFILE script or two without qualified field names). The first thing I would have done is simply add an auto_increment column at the start.

I also could not add a unique index; there was no time for that -- it would have easily taken over an hour.


The Challenge

Obtain exactly 10,000 records with a unique identifier x nine different queries.


The Half-Solution

Create table_2 with a unique constraint based on the id.

INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT 10000;


The Problem

Affected rows are less than 10000, thus falling short of the 10,000 goal.

I needed to do nine of these, and there was no way I was going to do this by hand in sql.


The Solution


Use a php loop and the use of mysql_affected_rows to repeat and rinse until the desired solution is achieved.

$finished = false;
$limit = 10000;

while (!($finished)) {

$sql = "INSERT IGNORE INTO table_2 (id, a,b,c) SELECT id, a,b,c FROM table_1 WHERE column_condition = 'match this' LIMIT $limit;";
mysql_query($sql,$link) or die(mysql_error());
$affected_rows = mysql_affected_rows();
if ($limit == $affected_rows) {
$finished = true;
} else {
$limit = $limit - $affected_rows # rinse and repeat!
}
}


This saved me a lot of grief until I have the window of opportunity to put in a unique index on the table.

Until next time,
David


30 October 2007

mySql Utility to Set Up a Slave Server, Unattended

This utility written by Kevin Burton is worth checking into - even though it is a first version and needs to be modded for your purposes (in my case I had to alter the local mysql connection string). However, in any case, this is going to be needed by us DBAs in the future, as we manage and build more and more slave servers.

This is what the script does, in Kevin's words:

This is a script to perform unattended cloning of MySQL slave servers (or masters) to put a new slave online with minimal interaction.

It connects to the source node, performs a mysqldump or mysqlhotcopy, transfers the data, restores the data, and then sets up all replication parameters, on the target machine.

It then starts the slave which then begins to catch up to the master

For DBAs that spend hours setting up slave servers, save yourself some time and check this script out!

Til next time,
David

25 October 2007

Google + mysql = distributed fun!

Google seems to be everybody's favorite girlfriend these days. We might even replace the antiquated statement "Midas touch" with "googlitis", a condition you get when google looks in your direction. Your valuation goes up even if google does not become a successful bidder (MSN + facebook = $15 billion dollar valuation), and everybody wants to be your best friend, even if you are politely collecting data from them.

So the daytime soap continues on, this time in favor of mysql -- with articles like:
MySQL to get injection of Google code and
MsSQL getting Google's blessing...and code
where google documents their progress here.

Add to this an increasing acceptance (not just awareness) of open-source databases, and that makes for another exciting day...

Til next time,
David

19 October 2007

IP Addresses: No more char(15)'s allowed!

OK... This is my second entry on this blog. And it is about something I want every mysql/php newbie to know at the starting gate:

DON'T USE CHAR(15) FOR IP Address fields in mysql!

Well... what about the periods in ip addresses, you say?

Well... mysql has a FANTASTIC way of solving this issue! Convert the standard ip address into an unsigned integer!

ok... at first it sounds a little cryptic, but say it at least 3 times before your morning coffee:

INET_ATON
INET_NTOA


loosely meaning....

INET_ATON = Internet address to number
INET_NTOA = Internet number to address

SO... try this:

mysql> SELECT INET_ATON('192.168.1.5');
+--------------------------+
| INET_ATON('192.168.1.5') |
+--------------------------+
| 3232235781 |
+--------------------------+
1 row in set (0.04 sec)

Now, let's reverse it!


mysql> SELECT INET_NTOA(3232235781);
+-----------------------+
| INET_NTOA(3232235781) |
+-----------------------+
| 192.168.1.5 |
+-----------------------+
1 row in set (0.01 sec)

Fun, eh?

Now... what are the savings on this, you ask?

On a 17 million row table consisting only of an auto_increment id (another must; don't leave home without an auto_increment id!!), non-indexed exact searches are about the same. In fact, the searches on the char column sometimes perform up to 10% faster.

After indexing, exact searches are once again the same, although they take a fraction of the time they did when they were not indexed. However, where the crown jewel shines is on ranges. On a count of an ip range, on the int based table, the cost was 3.78 seconds, but on the char(15) based table, the cost was 26.21 seconds.

Also -- the size of the int based data table (MyISAM) is 148M compared to 329M on the char(15) based table.

Are you convinced now? :) If so, now you know what you're going to do!

Until next time (whenever that will be),
David

05 September 2007

The Underestimated Power of HAVING

This is the query that just got me to thinking that I should blog about mySQL hints, tips, and ramblings. So... thinking turned into action and now I have a blog! The next step is to be able to keep it going so that it becomes useful to mySQL database administrators that are concerned about performance of mySQL and also need make sure that queries are written well.

Now... onward and forward...

The Challenge

Return results based on the sum of a column and user_id, where the sum of this column is greater than 10 for each user_id.

The Setup

Table: sample_table
Columns: user_id bigint not null auto_increment, date_added datetime not null, int_aggregate_column int not null

The Expected Finish Line

Result Format: user_id, SUM(int_aggregate_column)
Criteria: where SUM(int_aggregate_column) is greater than 10

The Challenge

The HAVING statement is powerful and underutilized. I cannot count the number of times I have seen php loops when a HAVING statement will take care of the issue. Mind you, there are times when having a php loop statement is better, but that is for another post.

We knew we would use the HAVING statement, but how?

The Answer

SELECT DISTINCT user_id, SUM(int_aggregate_column) AS sum FROM sample_table WHERE date_added >='2007-06-01' AND date_added < '2007-07-01' GROUP BY user_id HAVING sum > 10;

Fun, eh? (That' s my Canadian accent coming through)

As a test, you can limit the records returned to ensure that you are indeed getting the results you want.