Building a big file server on FreeBSD

by Garrett Wollman

I never thought I'd find myself building file servers. As a network and security guy, NFS did not appear destined to be in my bailiwick. But the conjunction of two outside forces caused me to be in that business unexpectedly throughout much of 2012, and it's looking like I'll be doing plenty more in the years to come.

The first event was a generous donation to our Lab by Quanta Computer, the Taiwan-based contract manufacturer, of a system they called a "KT" rack. "KT" stands for "Korea Telecom", and the rack consisted a bunch of Intel Xeon-based servers with lots of memory and four SAS disk shelves. When we got it, nobody had any idea what to do with the disk shelves, so they just sat there for six months.

The second event was the acquisition of BlueArc, the vendor of our high-end NFS storage appliance, by Hitachi. The Hitachi people didn't really seem to understand the market they had acquired—BlueArc was popular among sites like ours who thought that competing products from EMC and NetApp were grossly overpriced, and Hitachi seemed to think they were acquiring a business with EMC-like margins. So we were looking at a bill for maintenance contracts and upgrades that was well into the six figures, and we naturally wanted to find something more cost-effective.

So of course I said, "Why don't we try building something from this Quanta hardware? They gave us some nice SSDs and lots of big rotating disks, and it can probably meet the requirements that caused us to buy the BlueArc in the first place." I did, and—after a somewhat shaky start—it did. We added even more memory to the file servers, plus redundant SAS controllers and higher-performance SSD, and it really screamed. Having proved that the architecture was workable, we went back to our friends at Quanta to flesh out a production-ready design for file servers, and I'm working on making the transition to the new systems as I write this.

Physically, the final hardware configuration of our test/development configuration is as follows:

Quanta QSSC-S99Q chassis	2 x Intel Xeon E5640 (2.67 GHz) 96 GB DRAM LSI SAS1064ET Fusion-MPT SAS (for internal boot drives) 2 x Intel 82599EB 10-gigabit network controller 2 x LSI SAS2116 Fusion-MPT SAS-2 (for external drives)
4 x Quanta DNS1700 (QSSC-JB7) disk shelves	23 ea. Seagate ST32000444SS 2-TB SAS-2 drives 1 ea. STec ZeusRAM 8-GB SAS-2 SSD or OCZ Talos 2 240-GB SAS-2 SSD

The LSI SAS2116 chipset is found in a number of OEM products; the retail version of the PCI-Express card that we have is called the "SAS 9201-16e". This is a 16-port card with four SFF-8088 connectors on it; each SFF-8088 carries four SAS ports, and is wired to a single disk shelf. Each shelf has eight 6:1 SAS port expanders, allowing for 24 dual-ported drives. With four shelves, we get a total of 88 active drives (allowing for one spare and one SSD on each shelf). Each shelf has a cable to both SAS controllers, and we do not daisy-chain the shelves so as to avoid introducing another point of failure.

These servers also have an internal SAS backplane. We use this only for boot drives, never for user data, to ensure that the complete storage pool can be moved from one machine to another simply by moving the SAS cables. Since we are not running active-active redundancy (e.g., HAST), our availability strategy requires that we be able to survive a failure of the file server itself merely by relocating the drive shelves to another server.

On the software side, we have been running FreeBSD 9-STABLE with various fixes (primarily to geom_multipath and the mps driver) backported. The new production servers will run our private build of FreeBSD 9.1-RELEASE, and will be part of our Puppet system administration environment. Some other tidbits:

The easiest way to configure geom_multipath is to connect a single controller, then gmultipath label all the drives—make sure to give them meaningful names, or you'll be sorry!—and then hook up the second controller. Note that figuring out the mapping between the device in CAM (daXX) and the physical shelf and drive bay is challenging; the LSI SAS2IRCU tool and some scripting can help.
We put all the drives in a single pool, with 11 x 8-drive RAID-Z2 vdevs. The individual vdevs are striped across the four disk shelves so that we can withstand the failure of an entire shelf (and we have tested this by powering off a shelf with a file-system exerciser running on the server). zpool list gives the size of this pool as 160 TB; zfs list puts it at 120 TB. (I'm not sure what accounts for the discrepancy.)
We don't use the kernel-based access controls for exports, because ZFS interacts badly with them in several ways. Instead, we use the host firewall (ipfw) to permit access only to hosts on our networks. We run ipfw anyway for sshguard so having it also handle NFS access control does not add significant additional overhead. (We put things like -maproot in /etc/exports rather than letting ZFS mangle them—ZFS expects the Solaris/Linux data model for exports and FreeBSD's data model is completely different.) There's no special security issue: we warn all our users that NFS has no security, and the access controls implemented in ipfw are exactly the same as the ones we would implement in /etc/exports anyway.

We needed to adjust a bunch of kernel parameters and tuneables. Here are some of them:

kern.ipc.nmbclusters="1048576"
kern.hwpmc.nsamples="64"
kern.hwpmc.nbuffers="32"
vfs.zfs.scrub_limit="16"
vfs.zfs.vdev.max_pending="24"
hw.mps.max_chains=4096
kern.msgbufsize=262144
net.inet.tcp.sendspace=1048576
net.inet.tcp.sendbuf_max=2097152
net.inet.tcp.recvspace=1048576
net.inet.tcp.recvbuf_max=2097152
hw.intr_storm_threshold=12000

In addition, we found that the NFS server needed at least 256 threads to be able to handle the load some of our users could generate.

In the course of testing, we developed custom munin plugins to monitor various aspects of ZFS and NFS performance.

Garrett Wollman, 2013-01-10