This article is for database server administrators or those who plan on being one soon, who want to have an efficient setup for serving database requests. This article primarily focuses on disk arrangement, partitioning, mounting, and making HugePages available to MySQL - in essence, setting up the server to run MySQL efficiently.
my.cnf configuration will be discussed in the next article, which will be more valuable to a wider class of users. VPS users should mostly focus on the tmpfs portion of this article, below, as they can safely make use of that. Some VPSes can also use HugePages, so if you can take advantage of them, do so.
When choosing a server to purchase, you are concerned about one thing, and one thing only: Disk bandwidth. Or more appropriately, non-volatile storage bandwidth.
CPU power is plentiful. Storage space is plentiful. The server you are reading this on can hold up to 96 gigabytes of RAM if need be, even though I could barely use 24 at this point. It can have up to eight cores if need be, even though I barely ever make use of two.
My server's speed is essentially bound by how fast I can write data to disk.
In a nutshell, your database server's CPU is the least important part of the equation, despite the top-billing that server providers give to CPU speed and core count. You want enough RAM to make sure you do very few disk reads - at least if you are using platter drives. But you only need a few cores to actually run your site. When you hit a bottleneck, it is invariably going to be disk access.
When ordering a server then, get a quote for how much RAM the server can physically hold, but if you are just needing to move to a dedicated server, you obviously won't need to max it out right away. Far more important is how many hard drives you can stick in the chassis, and how fast they are. Setting these up properly will do a lot more for you in the long run.
If you keep disk reads to a minimum, you will generally be bound by your disks' write I/O. For MySQL, this comes from the following sources:
Your mileage may vary. For me, this means that my best throughput is when I divide my I/O load between at least three drives:
With binlog sync set to one, I would dedicate a single drive to that, pushing things onto four drives.
The type of drive is of course important.
Again, your mileage will vary depending on your needs.
Swap
I have a four gigabyte swap. About a tenth of this gets actually, meaningfully used, and it seems to scale roughly linearly with how much actual RAM I'm using. This would suggest that making a swap file or swap partition about 5% of your system's total, maximum amount of RAM would be ideal. It's several hundred megabytes of free RAM, at least.
Note that, if you have a large amount of RAM, and largely use InnoDB, the default swappiness value can lead to the system using swap far too aggressively. Setting swappiness to a low value can prevent your system from e.g. saving a couple gigs of free ram on your database server for no logical reason. I set
vm.swappiness = 10
in my /etc/sysctl.conf
Note that this is only really safe if you plan your memory usage - you use InnoDB (or another database that largely works in-cache) and plan accordingly. MyISAM, and Apache with php can end up with poorly constrained resource usage, and this will work less well.
ext4 versus ext3
My own benchmarking suggests that ext4 is roughly 30% faster than ext3, primarily due to extents. There is no reason not to format partitions as ext4, outside of your /boot partition since some versions of GRUB don't like it.
noatime
Another mandatory optimization for any web, file or data server. Performing a disk write on each read will absolutely murder your performance.
My /etc/fstab thus looks roughly like so:
# /etc/fstab: static file system information.
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/cciss/c0d0p3 / ext3 noatime,errors=remount-ro 0 1
/dev/cciss/c0d0p4 /boot ext3 defaults 0 2
/dev/cciss/c0d1p1 /data ext4 noatime,nodev,nosuid,noexec 0 2
/dev/cciss/c0d1p2 /home ext4 noatime,nodev,nosuid 0 2
/dev/cciss/c0d0p2 /var ext4 noatime,nodev,nosuid 0 2
/dev/cciss/c0d0p1 none swap sw 0 0
tmpfs /tmp tmpfs rw,noatime,nodev,nosuid,mode=1777,size=2g 0 0
tmpfs /var/tmp tmpfs rw,noatime,nodev,nosuid,mode=1777,size=2g 0 0
During installation Squeeze replaces the filesystem identifiers with UUIDs, but these are more human readable, so I stripped them out before copying here.
tmpfs
tmpfs, or temporary filesystem, is a filesystem that exists entirely in RAM, and can be mounted as you see above. Technically, files in/var/tmp are expected to be able to persist between reboots... but we are talking about a database server here. You generally won't be making use of software that relies on that, to say nothing of how rarely you will be rebooting. The primary benefit of mounting /tmp as tmpfs is when MySQL decides to make on-disk temporary tables, they are not actually 'on disk'.
You can also mount /var/run and /var/lock as tmpfs, but at least in Debian, this is best done through /etc/default/rcS:
# /etc/default/rcS
TMPTIME=0
SULOGIN=no
DELAYLOGIN=no
UTC=yes
VERBOSE=no
FSCKFIX=yes
RAMRUN=yes
RAMLOCK=yes
You set the effective size of tmpfs partitions in /etc/default/tmpfs:
# /etc/default/tmpfs
TMPFS_SIZE=2G
SHM_SIZE=2G
RUN_SIZE=512M
LOCK_SIZE=512M
SHM_SIZE controls /dev/shm, and so on.
The size of a memory page in Linux on x86 and AMD64 machines is 4k. This is all well and good when your applications are playing with a few megabytes of RAM, but when you allocate say, eight gigabytes to InnoDB, the kernel spends a rather disproportionate amount of time keeping track of all of these pages.
The solution, then, is HugePages. Instead of 4k on x86/AMD64, they are two megabytes in size, and the kernel handles these much more efficiently - I get one third the number of slow queries as I do with hugepages disabled.
You can track your usage of HugePages by running
cat /proc/meminfo
MemTotal: 12327200 kB
MemFree: 974616 kB
Buffers: 37868 kB
Cached: 844980 kB
SwapCached: 279252 kB
Active: 1813168 kB
Inactive: 615524 kB
Active(anon): 1589240 kB
Inactive(anon): 381732 kB
Active(file): 223928 kB
Inactive(file): 233792 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 4200956 kB
SwapFree: 3481704 kB
Dirty: 1856 kB
Writeback: 160 kB
AnonPages: 1339004 kB
Mapped: 190860 kB
Shmem: 408356 kB
Slab: 123532 kB
SReclaimable: 85592 kB
SUnreclaim: 37940 kB
KernelStack: 4848 kB
PageTables: 74116 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6039180 kB
Committed_AS: 6577872 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 310964 kB
VmallocChunk: 34359403224 kB
HardwareCorrupted: 0 kB
HugePages_Total: 4224
HugePages_Free: 32
HugePages_Rsvd: 4
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 7552 kB
DirectMap2M: 12566528 kB
As you can see, I have twelve gigabytes of RAM at the moment, and have allocated 4224 HugePages, or 8448 megabytes of RAM as HugePages. This ends up being about 56 megabytes more than I need, but breathing room here can alleviate some headaches.
To enable HugePages, you first need to declare who can use it. This is the memlock parameter, often set in limits.conf (/etc/security/limits.conf in Debian) - here is my file:
# /etc/security/limits.conf
#<domain> <type> <item> <value>
#
* - core 0
* - maxlogins 3
* - memlock 2048
* - msgqueue 1048576
* - nofile 8192
* - nproc 256
* - sigpending 16384
* - stack 8192
mysql - maxlogins 0
mysql - memlock 16777216
root - maxlogins -
root - memlock 16777216
root - nproc -
admin - maxlogins 5
The value listed is in kilobytes, and you need to set it for -both- root and mysql, as it can sometimes decide to use root's limits instead of mysql's. As you can see I have set it to sixteen gigabytes, which is more RAM than I have - probably the best thing to do is to set it to something arbitrarily large. I don't set it unlimited on the off change that I might want something else to use it in the future - but you might argue I should burn that bridge if it ever shows up.
Next, you'll need to set up the allocations in /etc/sysctl.conf. While you can, in theory, enable HugePages on a live server, in practice, if your site gets any real amount of traffic, you will just murder it. Rebooting is faster and less painful on your users.
# Allow high shared memory values for hugepages
kernel.shmmax=17179869184
kernel.shmall=4194304
# Number of hugepages to assign.
vm.nr_hugepages = 4224
kernel.shmall is the number of pages that can be allocated to all shared memory requests - a page is typically 4k, on x86 and AMD64 systems. kernel.shmmax is the number of bytes that a single user can request. vm.nr_hugepages is the number of hugepages to create within the space defined by kernel.shmall, and thus must be less than kernel.shmall/512. And if you want one user to access this entire block, you'll need to set kernel.shmmax to be appropriately large, as well.
All that remains is enabling Large Pages in MySQL itself. This goes in the mysqld section of my.cnf, a la:
[mysqld]
large-pages
MySQL will then try to store most of MyISAM's key_buffer and all of InnoDB's innodb_buffer_pool_size in the HugePages table.
Again, if you are running a large database in active use, do not try to switch to large-pages live. Linux will scramble trying to arrange continuous blocks of 2 megabytes or more, and on an active enough server, finding even one such block can take it hours. Save yourself the headache and set your server's uptime back to zero again.