16 November 2007

My Favorite Three Stooges Stats

Quick... You have to let everyone (boss, biz dev, customer service, and random bean counters) know why everything is moving slowly! Of course, rarely do people define what "everything" is, and what type of slowness is occuring. But, in the face of customer service agents that cannot work because their pages will not render, generally all eyes are on the famed-dba-of-the-minute.

So, with 7 people standing behind you, 3 IMs coming through (for those that bother with that), and 4 more messages on the phone, and the knowledge that at least a dozen people are talking about "the dba is working on it now", what do you do?

First, we all like SHOW PROCESSLIST. Nice. Gives us a quick bird's view on possible locks, etc.

But... what if there are no locks, and just a lot of interesting-looking selects? Before you beat up on the programmers, you have to carry the boat over the sticks first. Then you can throw darts at the dev team.

How to do this? Well, first, how's the CPU / IO situation?

Will 'top' really work? It's good for a general view of the state of the machine, so fine. Do that. Keep the window open to reference if you wish; I get the general trend within 5 seconds and then move on. But linux gives you FAR more than just that (here's a great reference that served as an inspiration to me) -- don't stop there!

But now comes the fun part:

I use mpstat to give me immediate feedback on disk I/O. One of my favorite uses is in the case below, where I am going to see what each (ALL) processor (-P) is doing every 10 seconds for 2 rounds...

mpstat -P ALL 2 10


05:54:05 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:10 PM all 1.25 0.00 1.00 78.29 0.05 0.10 0.00 19.31 652.51
05:54:10 PM 0 4.61 0.00 3.21 91.58 0.00 0.60 0.00 0.00 629.86
05:54:10 PM 1 0.40 0.00 0.20 21.84 0.00 0.00 0.00 77.35 22.65
05:54:10 PM 2 0.20 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00
05:54:10 PM 3 0.00 0.00 0.20 100.00 0.00 0.00 0.00 0.00 0.00

05:54:10 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
05:54:15 PM all 1.75 0.00 1.30 79.39 0.05 0.20 0.00 17.31 982.00
05:54:15 PM 0 6.20 0.00 4.20 88.60 0.20 0.80 0.00 0.00 945.60
05:54:15 PM 1 0.80 0.00 0.80 29.20 0.00 0.00 0.00 69.20 36.40
05:54:15 PM 2 0.20 0.00 0.00 99.80 0.00 0.00 0.00 0.00 0.00
05:54:15 PM 3 0.00 0.00 0.20 99.80 0.00 0.00 0.00 0.00 0.00

Average: CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
Average: all 1.55 0.00 1.09 78.98 0.06 0.15 0.00 18.17 859.50
Average: 0 5.47 0.00 3.54 90.21 0.20 0.58 0.00 0.00 822.21
Average: 1 0.66 0.00 0.50 26.06 0.02 0.02 0.00 72.76 37.29
Average: 2 0.08 0.00 0.02 100.04 0.00 0.00 0.00 0.00 0.00
Average: 3 0.02 0.00 0.30 99.86 0.00 0.00 0.00 0.00 0.00


OK, this box looks busy, and there is a lot of I/O going on. This is where I check the processlist again, and the slow query log (which I have a handy little script that parses this into a database on a seperate box -- and since I clean my logs daily, the size of the LOAD INFILE remains small). At this point, you can see which queries are causing the box the most amount of grief and review this with people who are willing to listen... :)

In the meantime, I still have to tell you about iostat, dstat and sar!

One of my favorite uses of the iostat command is as below (by this time, the server calmed down quite a bit!). k = kilobytes per second (instead of blocks per second -- makes it easy to assess the situation at hand)
t = prints time for each report (as I also have a utility that parses this data into a table for analysis, and it is good to know the time!)
x = Displays extended statistics, particularly %util, which is the percentage of CPU time during which I/O requests are issued to the device, where saturation occurs at 100%.
iostat -ktx 10 5


Time: 04:29:50 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.51 0.07 0.24 3.26 0.00 95.92

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 1.35 19.65 16.31 24.25 474.10 177.46 32.13 0.50 12.29 3.06 12.41

Time: 04:30:00 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.05 0.00 0.07 1.85 0.00 98.03

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 7.91 2.90 22.52 12.01 123.32 10.65 0.14 5.33 2.96 7.54

Time: 04:30:10 PM
avg-cpu: %user %nice %system %iowait %steal %idle
11.85 0.00 8.88 2.03 0.00 77.24

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 17.88 1.30 18.98 5.19 148.25 15.13 0.13 6.39 4.33 8.79

Time: 04:30:20 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.10 0.00 0.12 4.80 0.00 94.98

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 10.10 4.70 25.00 18.80 142.80 10.88 0.28 9.39 6.49 19.29

Time: 04:30:30 PM
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.05 1.27 0.00 98.60

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 12.40 3.00 33.60 12.00 188.00 10.93 0.16 4.32 1.45 5.30


Next is dstat -- another way to look at CPU and I/O:

dstat


----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
6 2 87 5 0 0| 44k 55k| 0 0 | 0 0 | 564 248
4 0 93 2 0 0| 44k 0 | 0 0 | 0 0 | 345 266
0 0 79 21 0 0| 60k 372k| 21k 5684B| 0 0 | 494 386
2 1 90 6 0 0| 48k 0 | 0 0 | 0 0 | 435 579
0 0 75 24 0 0| 92k 1496k| 50k 13k| 0 0 | 656 346
5 2 88 5 0 0| 112k 336k| 0 0 | 0 0 | 429 467
1 2 90 6 0 0| 80k 0 | 48k 11k| 0 0 | 426 514
1 1 92 5 0 0| 48k 344k| 0 0 | 0 0 | 501 576
0 0 81 18 0 0| 72k 0 | 44k 10k| 0 0 | 364 293
0 0 89 11 0 0| 24k 2064k| 0 0 | 0 0 | 704 273
8 1 77 13 0 0| 48k 240k| 32k 6214B| 0 0 | 350 146
3 1 89 7 0 0| 36k 0 | 0 0 | 0 0 | 400 401
7 1 76 16 0 0| 84k 72k| 35k 8034B| 0 0 | 381 243
5 1 92 3 0 0| 68k 0 | 0 0 | 0 0 | 357 310
3 0 87 9 0 0| 0 1552k| 25k 5858B| 0 0 | 580 106



And lastly:

sar 5 10


05:16:09 PM CPU %user %nice %system %iowait %steal %idle
05:16:14 PM all 0.30 0.00 0.05 0.95 0.00 98.70
05:16:19 PM all 0.30 0.00 0.15 3.65 0.00 95.90
05:16:24 PM all 0.40 0.00 0.15 6.49 0.00 92.96
05:16:29 PM all 0.50 0.00 0.05 4.25 0.00 95.20
05:16:34 PM all 0.30 0.00 0.10 3.60 0.00 96.00
05:16:39 PM all 0.35 0.00 0.15 0.90 0.00 98.60
05:16:44 PM all 0.25 0.00 0.05 3.75 0.00 95.95
05:16:49 PM all 0.45 0.00 0.10 0.65 0.00 98.80
05:16:54 PM all 0.40 0.00 0.10 1.00 0.00 98.50
05:16:59 PM all 0.05 0.00 0.10 0.55 0.00 99.30
Average: all 0.33 0.00 0.10 2.58 0.00 96.99


Enjoy using these! If needed, write them down on a handy little spot, so that when you're in the spotlight you can show how your servers are getting utilized (hammered). This will then get you to the next level with developers, biz dev, customer support and your supervisor (which would be how to improve a few queries).

Which goes to a future topic of using the slow query log. Fun!!

Until next time,
David

function foo() {
return "Dr Nic";
}

No comments: