Sun Microsystems storage CTO, Jeff Bonwick, and Sun systems architect, Bill Moore, jointly lead the design team for Sun's ZFS file system.
ZFS is one of Sun's proudest creations, and it also helps power one of Sun's brightest storage products, the Thumper disk array, or Sun Fire X4500 server.
As well as explaining exactly what Thumper is, Bonwick and Moore highlight the challenges of using flash memory, tell the world where data de-dupe might be better done, and why it is cheaper for customers to put up with growing data volumes than to hire somebody to delete it.
Q: What was Thumper designed for?
Moore: It Is an interesting story. Thumper in its original form was conceived by Andy Bechtolsheim when he founded a company called Kealia.
Andy was working on high performance video streaming, and Thumper was originally designed as a back-end movie store, from which you would stream content. So it was designed to be extremely low cost and physically dense.
When we bought Kealia in 2004 I got a chance to talk to Andy.
He showed us around his laboratory, and explained this box to us. I took particular interest in it because I helped start 3PAR and I have a pretty heavy interest in the physical design of storage boxes.
I asked him a whole bunch of questions, and he asked me about software. I had been on the ZFS team for a while at that point, and our conversation rapidly escalated in baud rate until we were streaming information at each other about as fast as we could talk.
Andy realized that Thumper - at the time called StreamStor - was much more than just a simple storage and video streaming device, but would also be a very good fit for general purpose storage and file system workloads. Thumper was a good fit for ZFS, which incorporated RAID and data protection directly into the filesystem.
It was not that great a fit for traditional, general-purpose file systems.
Thumper's a classic Bechtolsheim design. It is very dense and very fast. Time and space fold in on themselves to fit everything in there. The board that provides everything is actually 2U high and about six or seven inches deep. That is a very small volume to fold a standard two-socket server into.
The overall enclosure is 4U high, and packs 48 SATA disks each with their own channel, with up to 16 GB RAM, four Ethernet ports, and two PCI slots. There is no I/O bottleneck anywhere. It is a very balanced system that gives 1,2 Gbytes to 1,3 Gbytes per second throughput.
Q: Sun says it is selling around 1000 Thumper boxes every quarter, and says that this bringing in around $200m revenue each year. [The cheapest 12 TB configuration Thumper lists at around $24 000.] How much of that business is coming from outside the niche HPC sector?
Moore: Quite a lot. Thumper is basically for anyone who needs high density high performance storage, but does not need redundant failover capabilities. [Thumper is currently a single controller, non-failover device. A dual-controller device is expected to ship later this year]
It usually winds up being second tier storage for large companies, or first-tier for smaller companies that do not need the failover redundancy. So there is really a wide swathe of people using this.
I would say it is probably a pretty even split at this moment between HPC and non-HPC people.
Q: You say that AMD helped bring Thumper into the world. How did that happen?
Moore: The compute power in the volume CPU market has caused a lot of hardware extinction. Now the tropical jungle of hardware has disappeared, and we are just down to a handful of processors and buses, and one Ethernet technology.
For storage that means that what used to have to be done with custom-built proprietary hardware can now be done with general purpose CPUs and general purpose software at a much higher performance level and a much lower price.
Q: Thumper is far from the only storage array to be powered by an x86 processor. What makes it different?
Moore: We have got rid of a lot of extra processors. Those other boxes need silicon like RAID controllers and non-volatile cache or RAM. Thumper does not.
We do the RAID in software, using the RAID-Z inside ZFS. There are other boxes that also do RAID in software, but unlike Thumper they still have what we a call a RAID write-hole, and that has to be plugged with NV-RAM. RAID-Z does not have that write-hole problem, so Thumper does not need N-VRAM.
The RAID controllers to handle 48 disks at 2 GBytes per second would cost several thousand dollars. NV-RAM is very expensive.
For a traditional RAID system, whether its software or hardware, you would need probably couple of gigabytes of it at maybe a thousand dollars per gigabyte.
And we got rid of any proprietary interconnects. It is just Ethernet, or InfiniBand if you plug in an expansion card.
Q: Sun has talked about applications being hosted directly on Thumper, so that they have very fast access to data. The software suppliers that have worked on this are a handful of smaller companies such data warehousing specialist Greenplum. Will we ever see mainstream suppliers making the same move?
Moore: It depends what you consider mainstream. The people really interested in this stuff are typically people for whom the latency and network bandwidth are too slow for their applications - so they put the applications inside Thumper where they are right next to the data and can operate at much higher speeds.
It is really a general purpose solution for anybody that has those kinds of needs.
Q: Without a storage network, there is no way to cluster Thumpers for failover protection. Or to fully exploit the flexibility of virtual servers by moving them between physical hosts. Is that not a limitation?
Moore: That is why we say it is for people who do not need that level of protection.
Q: Let us move on from Thumper. What do you see as the biggest recent advances in storage?
Jeff: ZFS and flash memory. Of course flash memory is not new, but it has just started to become economically interesting. The reliability of flash is improving, and the cost is dropping through the floor. Now it occupies a very interesting niche in price-performance.
Q: When you say reliability, do you mean life, or reliability per se?
Moore: Both. Writing into flash memory cells is like walking on a carpet. If you keep on writing into the same cells or walking on the same bit of carpet again and again, eventually the flash cells like the carpet wear out.
The probability that any one cell will give you the wrong data increases as that cell gets worn. There are techniques to help mitigate this, like wear-levelling, ECC, and mirror cells. But once you run into the wall so to speak, unless you have a file system like ZFS that knows how to check for errors and how to recover from them, you are out of luck.
Q: That is a tidy plug for ZFS. Are you saying that unless they use ZFS, customers will be in un-charted territory with flash?
Moore: No I would not say uncharted territory. People know this about flash and they have been doing a whole bunch of engineering to mitigate the issue.
But it will be interesting to see what different vendors' take on this will be. They might just pre-emptively replace the flash, or they might build in so much extra capacity or memory cells that the drives have a 10 or 20 year lifespan.
Q: Does block-level data de-duplication not count as a major storage development?
Jeff: There is an element of de-dupe that is important, and an element that is kind of flavour of the month. It got attention primarily because it was one of the few different value propositions to come along in a while.
But you have to ask yourself why we have so much duplicate data in the first place? In some cases it is because say somebody has sent the same attachment to every e-mail. The place to solve that is in the e-mail system, and Microsoft is working on that problem.
They understand the content of an Exchange e-mail database, and how to do that. It is the logical place to do it.
Another place where you get data duplication is when you have a bunch of virtual machine images. If you have a file system capable of an unlimited number of snapshots, then you can run the de-duplication not by de-duping after the fact, but by never duplicating the data in the first place.
You create a template VM, and every time you create a new one you take a clone of that snapshot, and apply the relatively small changes to create the new VM personality. All the blocks that do not change will just be shared.
You can do this with ZFS. And by the way - we are developing block-level de-dup for ZFS.
Q: How much of the data out in the world there is worth saving?
Bonwick: Indeed, perhaps not that much. But here is an interesting statistic. Assume that the loaded cost of an employee is around $120 000 per year. That might be on the low side, but let us just take that figure.
There are 2000 hours in a work year, so that is $60 an hour, or $1 per minute. Disk space costs in the order of $1 per GB a year. So your people cost a dollar a minute, and a GB of space costs a dollar. If you were to dedicate somebody to the task of deleting files, they would be losing you money unless they were deleting at least a GB of data every minute.
Sun is currently embroiled in a patent battle with NetApp concerning ZFS. Moore and Bonwick declined to comment on that issue during this interview.
Jeff Bonwick is a Sun fellow. He is the chief architect of ZFS, inventor of the slab allocator, and author of many core Solaris services. Jeff holds a BS in Mathematics from the University of Delaware and an MS in Statistics from Stanford.
Moore is an architect in Sun's Systems Group. He joined Sun in 1996, and returned in 2003 after spending time at 3PAR. Moore holds a bachelor's degree in electrical engineering from Michigan Technological University and a master's degree in computer science from Michigan State University.