24 April 2012

The Conflicted Data Analyst

Inspired by a post from Juice Analytics.

We are a conflicted people. We love our TV and movie violence but worry that it ruins our children’s minds. We want to reduce healthcare costs, but don’t want to restrict the free market.

Conflicts like these leave little room for a satisfactory answer. Basic principles are in conflict and deeply-rooted desires run up against painful consequences. We don’t want to choose, and the middle ground feels like failure.

The people who practice data analysis and visualization have their own set of conflicts. Does this sound familiar?

“I want a feature-rich analytics tool, but…
My powerful analytics tool is too complex to use.”

“I want a dashboard that summarizes my business or functional area, but…
I don’t want to feel limited in what information I get to see.”

“I want a predictive model, but…
I don’t trust black box models when a result doesn’t align with my understanding.”

“I only want to focus on actionable data, but…
hey, look at that data point…that’s kind of interesting.”

“I want this report immediately, but…
now that I’ve got it, I don’t really have time to look at it.”

“I want to access data without a technical resource, but…
I don’t want to learn SQL because that’s for technical people.”

“I want to run scenario analyses on my data to make better decisions, but… I don’t believe the analysis can account for the complexity of my business.”

“I love geographic maps and network diagrams, but…
it is hard to see patterns or find insights from these types of visualizations.”

In a sense, each of these conflicts represents an opportunity to innovate to solve a persistent problem within our data analysis solutions. Find a way to satisfy the initial desire without suffering the assumed consequence, and you’ve got a chance to delight users of data.

24 January 2012

How to Find Out if an Entire String Is Numeric

Problem: Find out if an entire string is numeric.  (Just like the is_numeric function in php.)

Problem, Part II: Most online resources show how you can find out if part of your string is numeric.

Solution:
SELECT str FROM tbl
  WHERE str REGEXP('(^[0-9]+$)');
Give it a whirl!


For those who would like to view some test results, here goes:
mysql> SELECT str FROM tbl;
+-----+
| str |
+-----+
| 10  |
| 10A |
| 10E |
| 78  |
| E78 |
+-----+
5 rows in set (0.00 sec)
mysql> SELECT str FROM tbl WHERE str REGEXP('(^[0-9]+$)');
+-----+
| str |
+-----+
| 10  |
| 78  |
+-----+
2 rows in set (0.00 sec)
Now for the other way around:
mysql> SELECT str FROM tbl WHERE str NOT REGEXP('(^[0-9]+$)');
+-----+
| str |
+-----+
| 10A |
| 10E |
| E78 |
+-----+
3 rows in set (0.00 sec)

03 June 2011

How to Parse the Most Selected Tables From MySQL General Query Log in One Line

Problem: Pin down redundant queries by finding out which tables are being selected from the most.

I like general query log parsing tools, such as mk-query-digest or mysqlsla.

However, there are times when I am on a server as a dba with limited access and can only read files in /tmp.  And other times I just need to get something done quickly.

So what did I do to find out the most selected tables from a database?

1. Make sure /tmp has enough space to do what I'm about to do
2. Turn on the general query log with the file at /tmp/mysql_general.log.  This is usually /var/log/mysqllog/mysql_general or similar, but I need access to this log without waking up the sysadmin.
3. Turn off the general query log before /tmp fills up (for some people this might be just minutes)
4. Parse using this nifty set of commands:
grep -i "SELECT " /tmp/mysql_general.log | grep -io "SELECT .*" | sed 's|\(FROM [^ ]*\) .*|\1|' | sort | uniq -c | sort -nr | head -100

That's all!

As my mother would say, "Try it... You'll like it."

27 April 2011

My Ignite Talk at MySQL Conference

My Ignite Talk (5 minute talk, 20 slides, 15 seconds per slide) at the MySQL Conference about Data Visualization was picked among three for "the Best of Ignite".  I would say I barely scraped into third spot, and it was inspiring giving a talk after Baron Schwartz gave his at the keynote on the last day of the conference.

The other two talks were "Scale Fail" by Josh Berkus (hilarious, must see) and "Causes of Downtime" by Baron Schwartz.  Enjoy!

02 February 2011

Need to Replicate to a Database with a Different Name?

This is the first time this has happened to me, so I got excited about it until I noticed that this command has been with us from nearly the very beginning (ie MySQL 4).

The Situation
Master Server contains db1
Slave Server contains db1 and db2
Slave Server needs db1 from the Master to replicate to db2

The Solution
On my.cnf on the slave server:
--replicate-rewrite-db=from_name->to_name

Thus, this situation would dictate the following:
--replicate-rewrite-db=db1->db2
Easy!

The Wrinkle
Part 1
What if the Master Server contained db1 and db2; and
the Slave needs only db1 from the Master to replicate to db2 on the Slave?

Part 2
We don't want to replicate both db1 and db2 to this slave, although we want to replicate both databases to another slave - this means that the filter for databases should reside on my.cnf on the slave...

Assuming that we are using replicate-rewrite-db=db1->db2:
Using replicate-do-db on the my.cnf file on the slave, which database would you use with that command?

db1 or db2?  It only takes a moment in time to find out... :)

11 January 2011

How to Log User Connections in MySQL

There is nothing that MySQL 5.1 explicitly performs to log user connections, but there is a combination of MySQL commands that can log user connections without having to turn on the general query log. I repeat: you do not have to turn the general query log on to make this work!

Would you like to find out what that is?

With the help of Baron Schwartz's post on http://www.xaprb.com/blog/2006/07/23/how-to-track-what-owns-a-mysql-connection/ I was able to find out as well. However, I wanted to simplify things a bit, because all I need is to track MySQL connections within MySQL for now. I'll probably increase the scope in the future, but for now, let's keep it simple.

Ingredients
init-connect
NOW()
CURRENT_USER()
CONNECTION_ID()

See it coming together?

Alrighty, here's the recipe...

Assumptions: The name of the database we will use to store this information will be called admin.

1. Create the table to store the connection information in. You don't have to include the unique key if you don't want to.

CREATE TABLE admin.connections (id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, connect_time DATETIME NOT NULL, user_host VARCHAR(50) NOT NULL, connection_id INT UNSIGNED NOT NULL);

2. Set the init-connect variable. This is a string to be executed for each client that connects. Details here.

SET GLOBAL init_connect = "INSERT INTO admin.connections (connect_time, user, connection_id) VALUES (NOW(), CURRENT_USER(), CONNECTION_ID());";

3. Check permissions of all of your users to ensure that they can insert a record into the admin.connections table.

4. Login as a user without global privileges and a row should be inserted in the admin.connections table. Note that the init-connect system variable does not work on users with global privileges. For better or for worse, all of us DBAs know what that means.

5. Watch your connection table grow and flourish. You've just started a new pet (project).

19 February 2010

How to Save a Lot of Time with LOAD DATA INFILE Involving Large Files

Are you tired of waiting an hour or so performing a LOAD DATA INFILE statement involving a file with millions of rows?

If so, and you have enough money to pay for 3 monthly installments of $5.99... (Belgian chocolate will substitute nicely) -- you TOO can be a hero in your office and go home early for the weekend. :)

How?

Disable keys. Load data infile. Enable keys.

ALTER TABLE table_i_am_going_to_load_data_into DISABLE KEYS;
LOAD DATA INFILE '/folder/five_million_row_file.csv' INTO TABLE table_i_am_going_to_load_data_into FIELDS TERMINATED BY ',' ENCLOSED BY '"';
ALTER TABLE table_i_am_going_to_load_data_into ENABLE KEYS;

What are the savings?

In our situation -- we have hundreds of millions of rows, 12 columns, various data types, a compound primary key (bleck), a unique index, and 2 standard indexes, and a dash of pepper -- the original LOAD DATA INFILE statement took between 50 and 75 minutes to run.

Now, this is how long it takes:
1) Disabling keys: 0.01 seconds
2) Loading data infile: 2.45 to 2.8 minutes
3) Enabling keys: 10 to 12 minutes
4) Going home early: priceless

Thanks to Ronald Bradford for the idea... trying to find the link to his post on this, but I have lost it...

Have fun! Feel free to let me know of any other nifty LOAD DATA INFILE savings...

Cheers,
David