Page 1 of 1

Pure SSD pool or not?

Posted: 12 Feb 2013 11:23
by DigitalDaz
I'm looking to create a shared storage box for a couple of hypervisors. I'm going to chunk the hypervisors up into 1GB RAM VPS so over to there will potentially be about 60. They're for VOIP so there will be no huge demand on IO.

I've decided ZFS is the way to go and there is no great budget here so I'm trying to make the best of what I have.

I've had some bad experiences with cheap SSD, I put 8 256GB Crucial V4s in two HP DL360 G5 boxes and 4 of them are ruined within a month.

For my storage box, I have a i3-2100 with 16GB RAM and am thinking of going either four or even six Samsung 840 Pro 256GB drives to create a pure SSD pool.

The alternative is using say a one (or two) small intel SSD as a ZIL and maybe one 840 Pro 256GB as L2ARC with some 250GB 10K Velociraptors as pool.

I'm new to ZFS and would like to know how the more experienced people would tackle this.

Re: Pure SSD pool or not?

Posted: 26 Feb 2013 08:42
by Darkfall
Hi, DigitalDaz,

An all SSD storage array is awesome for IOPs, but they're expensive if you don't need the speed (you indicate that you won't see huge demand on I/O, so I'm not sure why you're looking to build a box with the most expensive drive options possible).

But, if speed is what you want, and you don't want disk failures constantly, you want to use SSDs build for the job. Intel just released their DC V3700 series, which are designed for this type of task. Those SSDs are designed to live up to high write loads. Intel is claiming write durability in line with SLC SSDs (which, while the DC V3700 series SSDs are expensive, the SLC SSDs are much more expensive). SSDs like the Crucial V4s are built for desk systems. They're not meant to meet the needs of a disk array (and really, even for desktops, I'd go with Intel SSDs. Intel SSDs are pretty much unbeatable for reliability compared to virtually every other brand).

If you want to step down performance a bit, you can go with spindles and use SSD ZIL and L2ARC. That decision really depends on how much performance you'll need (not much, it sounds like) and whether you'll likely be reading from the same data frequently (the more that happens, the more the L2ARC will be used and the spindles won't need to work very hard). Three things come to mind with your mention of hardware for this:

1. Velociraptors are desktop drives. They don't have appropriate TLER (time limit to error recovery) for RAID, which means that if a drive starts having a problem, it'll go away and retry the disk read for 30-40 seconds. By that time, the array controller (ZFS in this case) will disable the drive and take it offline and your whole array will "freeze" while the desktop drive screws around. Buy 3.5" 10,000RPM drives, if you want 10,000RPM. Western Digital doesn't offer a 3.5" 10,000RPM drive, other than the Velociraptor. WD does offer a 2.5" 10,000RPM drive (as you might imagine, since they have the Velociraptor), which is the XE series - but if you're installing into a hot-swap bay, you'll need a special adapter that will align the drives data and power ports with the RAID backplane (that being said, perhaps you should look at a box with 2.5" bays?). If you want 3.5" drives, Seagate's Cheetah line, which are all 15,000RPM. A 300GB Cheetah will run you about $200 and be a fair bit quicker than the 10,000RPM drives.

2. ZIL should never be a single drive. If the drive fails, you'll lose everything that isn't yet written to the spindles. A ZIL should always be at least a mirror (or stripe of mirrors, if you have a huge ZIL). You want a mirror because write speed is important. You don't want parity data being calculated on your writes, like a RAIDZ would require. Also, your ZIL, being a write cache, will kill SSDs unless they SLC or perhaps the new Intel DC V3700 series drives. In a perfect world, you'd use something like a STEC ZeusRAM, which is normal RAM with a battery backup that can last for days if need be - a drive like that never suffers from write wear issues, but they're pricey. Keep in mind, too, that any extra RAM you have in the machine is also cache. ZFS loves RAM. Give it lots. It's cheap. The more you have, the more performance you'll enjoy (to a point, but don't be afraid to give it 32GB for a few extra bucks).

3. Your L2ARC is less important. An SSD is a good idea, and it's nice if it can withstand writes, but it's not going to see nearly as many writes as your ZIL. The L2ARC also doesn't need to be a mirror or any other form of RAID - if it dies, it just means that ZFS has to fetch the data from the spindles to recover and continue to do so until it has a working L2ARC again.

Also, be careful about the number of disks you put into a single group. At 3 disks, you should have RAIDZ configured (2+1). At 6 disks, you should have RAIDZ2 configured (4+2), at 9 disks you should have RAIDZ3 configured (6+3). Beyond 9 disks, you should create multiple groups of 3-9 disks and stripe them together. Of course, that all assumes that you're not going to uber performance, in which case you'd probably be heading for RAID10 or something.

It sounds like your storage box is a separate box from your hypervisor boxes, in which case all of this money spent on performance won't mean a hill of beans without bandwidth between the boxes. Look to something like 10GBASE-T (Intel 540 series NICs and CAT6A patch cables are the right answer here). If you need even more bandwidth, look at using iSCSI and multipath IO and multiple 10GBps links.

Re: Pure SSD pool or not?

Posted: 26 Feb 2013 09:00
by b0ssman
be aware that ssd will most likely fail around the same time in a raid configuration.

once the write limit of the cells is reached all current ssd fail completely. none of them go into a read only mode.
see ... nm-Vs-34nm

Re: Pure SSD pool or not?

Posted: 26 Feb 2013 09:29
by Darkfall
I can't say I agree with the idea of all of the SSDs will fail at similar times. Some of the silicon in the drives will do better than others, much like some CPUs will overclock and some won't - it depends how exact the fabrication of that particular chip happened to be.

ZFS further changes the equation because ZFS doesn't use full width stripes for writes like legacy RAID does. ZFS uses whatever size stripe is appropriate for the data being written, thus not every drive will see the same data, which could play a factor in failure rates.

That being said, at around 500TB of writes, it's time to start keeping a close eye on the drives, and of course make sure there are solid backups to rely on if the whole thing burns to the ground, despite best efforts to prevent it.