28 January 2016

MySQL Swap Insanity: Solved

Yes, it has been a while since my last post.  Still in the midst of MySQL ecosystem, and enjoying it tremendously.  I thought I would make this post as it relates to a topic that has some varied results in the online world, and I would like to kick start this blog again.  We'll see how that goes... :)

MySQL has been able to harness the potential of more powerful (CPU) and larger (RAM, disk space) systems in recent years like never before. With this comes the possibility that something can go awry in a big way, and that just happened with us when we upgraded from MySQL 5.1 to MySQL 5.6. 

Our original 5.1 system was on CentOS, and aside from two funky issues (one being MySQL process getting shot in the head by cfengine), our system was stable and performant. With continuous data initiatives/pressures (more data, improved performance and increased availability), we upgraded our system to MySQL 5.6 on Ubuntu 14.04, managed by puppet.

The first day or two were smooth; transition of data was a big deal, but other than that, it was almost a non-event.

Until day 3. Swap started to increase.

Not only that, swap increased in a big way. From 1% to over 80% in a matter of days. Yes, it started to plateaued, and in some cases, it was at the ~70% mark, and in other cases, at the ~99% mark. And yes, we had a couple of MySQL instances restart due to the dreaded signal 11 message (ie crash). At that point, we ended up restarting several MySQL instances daily so that I would not get woken up in the middle of the night with an issue. And this way I would not have to repair dozens or hundreds of tables.

Restarting MySQL in this manner was clearly not acceptable, as it reminded me of my old SQL server days when we restarted those instances daily, a common Windows "fix" at that time in history. 

So what could be the culprit in dealing with this swap issue in MySQL? 

There were a number of possibilities that we considered.  The first four possibilites were MySQL related, the remaining ones were system related.

  1. key_buffer_size - reduce size 
  2. max_connections - reduce value
  3. performance_schema - turn it off 
  4. table_open_cache - reduce value
  5. swappiness - reduce value
  6. numactl --interleave=all 
  7. sysctl -q -w vm.drop_caches=3 


We (including my trusty DBA associate, Jonathan Park) tried a few things on this list, including #1, #2, #3, #6, and #7. 

We even tested on a box with Innodb engine set off. 

We still noticed that swap increased under our load test.  

In fact, we were very successful at increasing swap rapidly over a short time. We could start with swap at 1% and it would increase to about 80% in about 15 minutes. And I was merely running 8 threads to make that happen. 

What Worked?

Swappiness.

We noticed that swappiness was set to 60 by default (we have Ubuntu 14.04), and we reduced it to 10.  We ran our tests and swap did increase a little bit, but not as badly as before.  So we reduced it to 1 (not zero) and swap behavior stayed constant.

I am interested to find out what the MySQL community has done in addition to the prominent post at Jeremy Cole: MySQL Swap Insanity and the NUMA Architecture and would like to see what may have worked for you!

And if you try setting swappiness to 1, does it work for you?