While this is not meant to be an exhaustive review of storage options, it will provide some information necessary for future performance optimization discussions.
Blocks or Files
Starting off with a gross simplification, the Operating System views storage as either blocks or files. Disk, tape, and solid-state devices are the foundation of storage and they present blocks to the OS, which lays a file system on top. The file system organizes the storage blocks into a form that the OS can easily access. Users only care about the files on the file system.
Direct Attached, Storage Area Networks, and Network Attached Storage
Storage typically comes in three forms. Each presents their own benefits and drawbacks which have performance implications.
- Direct Attached Storage (DAS): This is the most common storage type. As the name implies, the block device is directly attached to the system bus of the host system. Any hard drive or collection of drives on a system can be a DAS. DAS also extends itself to JBODs (Just a Bunch of Disks) in a separate enclosure directly attached to the host system. By being directly attached it is fast and simple to configure. The major drawback of most DAS systems is their inflexibility; DAS systems are normally a fixed size and growing them is difficult. Also, the nature of their file systems usually mean they are tied to the host system and concurrent access it not possible from another system.
- Storage Area Network (SAN): A SAN takes the block devices away from the host and normally aggregates them in disk arrays that are attached using a specialized storage network. Fibre-Channel is typically the interconnect used to create the network, but recently iSCSI, which implements the SCSI protocol over IP, has gained popularity with the advent of 10 Gigabit networks. The primary benefit of SANs are the performance characteristics of a storage optimized network and the ability to grow as requirements expand. On the flip-side, SANs are historically expensive and require specialized administrators to manage the Fibre-Channel infrastructure. SANs also are local networks because the signaling technologies are intolerant of latency and loss.
- Network Attached Storage (NAS): Whereas DAS and SAN are block storage systems, NAS presents a filesystem over the network as a protocol. Modern NAS is purely IP based and the NAS protocols typically found in the enterprise are NFS, and CIFS/SMB. Some NAS systems also implement FTP as an access protocol, but FTP was not originally designed to support file access in the same way a file system is designed to. NAS systems provide the benefits of SAN over commodity IP networks without the administrative overhead and cost. The drawback of NAS is that they are not normally designed for high performance applications. As networking technology advances provide faster bandwidth interconnects, NAS has grown more viable for for higher performance storage applications. NAS perform best in a LAN for many of the reasons Wide Area SAN is not realistic.
Characteristics of High Performance Storage
High performance storage requires low access times to data. On the high end of the storage device spectrum there are RAM and Solid-State Devices. The speed of these systems come from the lack of moving parts and the way the data is stored. The next tier down are disks. Disks are cheap and the storage density they provide is unmatched right now. The performance challenges of disks stem from the problems that come from rotating a disk and using a reading head that has to seek different positions on the disk. To meet these problems disk manufacturers have increased the rotations per minute a disk can have. Low end laptops use 5400 rpm drives while high performance disk arrays may use 10000-15000 rpm drives. To further increase performance, many drives use some form of RAID to make use of parallelization.
There are a number of RAID levels and each have unique performance characteristics. For this discussion the RAID levels discussed will be RAID 0 and 5; RAID can be layered, include multiple parity disks, etc, which would muddy the intent of this section. Further articles may go into how best to design storage to meet certain goals.
Of the two RAID levels, the fastest is RAID-0. In a RAID-0 configuration all the drives are used to parallelize I/O. In a simple example, a 4 disk RAID-0 set, where each disk is 100GB, would have a total size of 400GB. When writing to this RAID set, data is written to all the disks at once; writing a 1MB file would split 250KB to disk 1, another 250KB to disk 2, etc. This is all done at once, so the effect is an optimal write to the disk set. This is called striping. The reason RAID-0 is not in use more is the absence of redundancy. A failure of any of the disks will result in the loss of all data.
The most common RAID used in the enterprise is RAID-5. RAID-5 uses parity and striping to get good performance and redundancy. In a RAID-5 set a disk can be lost and the RAID set will still have all the data available. While the striping allows for performance gains, the parity system incurs a performance penalty that impacts the gains of striping.
Who knew storage could be this complex? The topics covered so far are the tip of the storage puzzle. Forthcoming articles will discuss NFS tuning and other topics to create performance optimized solutions. Check back for those and other topics.