I never thought I'd find myself building file servers. As a network and security guy, NFS did not appear destined to be in my bailiwick. But the conjunction of two outside forces caused me to be in that business unexpectedly throughout much of 2012, and it's looking like I'll be doing plenty more in the years to come.
The first event was a generous donation to our Lab by Quanta Computer, the Taiwan-based contract manufacturer, of a system they called a "KT" rack. "KT" stands for "Korea Telecom", and the rack consisted a bunch of Intel Xeon-based servers with lots of memory and four SAS disk shelves. When we got it, nobody had any idea what to do with the disk shelves, so they just sat there for six months.
The second event was the acquisition of BlueArc, the vendor of our high-end NFS storage appliance, by Hitachi. The Hitachi people didn't really seem to understand the market they had acquired—BlueArc was popular among sites like ours who thought that competing products from EMC and NetApp were grossly overpriced, and Hitachi seemed to think they were acquiring a business with EMC-like margins. So we were looking at a bill for maintenance contracts and upgrades that was well into the six figures, and we naturally wanted to find something more cost-effective.
So of course I said, "Why don't we try building something from this Quanta hardware? They gave us some nice SSDs and lots of big rotating disks, and it can probably meet the requirements that caused us to buy the BlueArc in the first place." I did, and—after a somewhat shaky start—it did. We added even more memory to the file servers, plus redundant SAS controllers and higher-performance SSD, and it really screamed. Having proved that the architecture was workable, we went back to our friends at Quanta to flesh out a production-ready design for file servers, and I'm working on making the transition to the new systems as I write this.
Physically, the final hardware configuration of our test/development configuration is as follows:
Quanta QSSC-S99Q chassis | 2 x Intel Xeon E5640 (2.67 GHz) 96 GB DRAM LSI SAS1064ET Fusion-MPT SAS (for internal boot drives) 2 x Intel 82599EB 10-gigabit network controller 2 x LSI SAS2116 Fusion-MPT SAS-2 (for external drives) |
---|---|
4 x Quanta DNS1700 (QSSC-JB7) disk shelves | 23 ea. Seagate ST32000444SS 2-TB SAS-2 drives 1 ea. STec ZeusRAM 8-GB SAS-2 SSD or OCZ Talos 2 240-GB SAS-2 SSD |
The LSI SAS2116 chipset is found in a number of OEM products; the retail version of the PCI-Express card that we have is called the "SAS 9201-16e". This is a 16-port card with four SFF-8088 connectors on it; each SFF-8088 carries four SAS ports, and is wired to a single disk shelf. Each shelf has eight 6:1 SAS port expanders, allowing for 24 dual-ported drives. With four shelves, we get a total of 88 active drives (allowing for one spare and one SSD on each shelf). Each shelf has a cable to both SAS controllers, and we do not daisy-chain the shelves so as to avoid introducing another point of failure.
These servers also have an internal SAS backplane. We use this only for boot drives, never for user data, to ensure that the complete storage pool can be moved from one machine to another simply by moving the SAS cables. Since we are not running active-active redundancy (e.g., HAST), our availability strategy requires that we be able to survive a failure of the file server itself merely by relocating the drive shelves to another server.
On the software side, we have been running FreeBSD 9-STABLE
with various fixes (primarily to geom_multipath
and
the mps
driver) backported. The new production
servers will run our private build of FreeBSD 9.1-RELEASE, and
will be part of our Puppet system administration environment.
Some other tidbits:
geom_multipath
is to
connect a single controller, then gmultipath label
all the drives—make sure to give them meaningful names, or
you'll be sorry!—and then hook up the second controller.
Note that figuring out the mapping between the device in CAM
(daXX
) and the physical shelf and drive
bay is challenging; the LSI SAS2IRCU tool and some scripting can
help.zpool list
gives the size of this pool as 160 TB;
zfs list
puts it at 120 TB. (I'm not sure what
accounts for the discrepancy.)ipfw
) to
permit access only to hosts on our networks. We run
ipfw
anyway for sshguard
so having
it also handle NFS access control does not add significant
additional overhead. (We put things like
-maproot
in /etc/exports
rather than
letting ZFS mangle them—ZFS expects the Solaris/Linux
data model for exports and FreeBSD's data model is completely
different.) There's no special security issue: we warn all
our users that NFS has no security, and the access controls
implemented in ipfw
are exactly the same as the
ones we would implement in /etc/exports
anyway.kern.ipc.nmbclusters="1048576" kern.hwpmc.nsamples="64" kern.hwpmc.nbuffers="32" vfs.zfs.scrub_limit="16" vfs.zfs.vdev.max_pending="24" hw.mps.max_chains=4096 kern.msgbufsize=262144 net.inet.tcp.sendspace=1048576 net.inet.tcp.sendbuf_max=2097152 net.inet.tcp.recvspace=1048576 net.inet.tcp.recvbuf_max=2097152 hw.intr_storm_threshold=12000In addition, we found that the NFS server needed at least 256 threads to be able to handle the load some of our users could generate.
munin
plugins to monitor various aspects of ZFS
and NFS performance.