Linux RAID Performance
I won’t discuss the different RAID types, but rather refer you to Wikipedia for that1. For using software RAID under Linux, I recommend this documentation2. We’ll focus on performance since that’s our topic here. RAID 0 is the most performant of all RAID types, but it obviously has data security issues if a disk fails.
The MTBF (Mean Time Between Failure) is also important for RAID systems. This is an estimate of how long the RAID will function properly before a disk is detected as failed.
Chunk Size
The “Chunk size” (or stripe size or element size for some vendors) is the amount of data (in KiB segments) written or read on each device before moving to another segment. The Round Robin algorithm is used for movement. The chunk size must be an integer multiple of the block size. The larger the chunk size, the faster the write speed for very large data, but inversely slower for small data. If the average size of IO requests is smaller than the chunk size, the request will be placed on a single disk in the RAID, canceling all the advantages of RAID. Reducing the chunk size will split large files into smaller pieces distributed across multiple disks, improving performance. However, the positioning time of the chunks will be reduced. Some hardware does not allow writing until a stripe is complete, eliminating this positioning latency effect.
A good rule for defining chunk size is to divide the size of IO operations by the number of disks in the RAID (minus parity disks if using RAID5 or 6).
Notes |
---|
Quick reminder: |
- RAID 0: No parity |
- RAID 1: No parity |
- RAID 5: 1 parity disk |
- RAID 6: 2 parity disks |
- RAID 10: No parity disks |
If you have no idea about your IO, choose a value between 32KB and 128KB, taking a multiple of 2KB (or 4KB if you have larger block sizes). The chunk size (stripe size) is an important factor in your RAID’s performance. If the stripe is too wide, the raid can have a “hot spot” which will be the disk receiving the most IO and will reduce your RAID’s performance. It’s obvious that the best performance is when data is spread across all disks. The correct formula is therefore:
Chunk size = average request IO size (avgrq-sz) / number of disks
To get the average request size, I invite you to check the Systat documentation3 where we talk about Iostat and Sar.
- To see the chunk size on a RAID (here md0):
> cat /sys/block/md0/md/chunk_size
131072
So here it’s 128KB.
Here’s another way to see it:
> cat /proc/mdstat
Personalities : [raid10]
md0 : active raid10 sdc2[3] sda2[1] sdb2[0] sdd2[2]
1949426688 blocks super 1.0 128K chunks 2 near-copies [4/4] [UUUU]
unused devices: <none>
Or even:
> mdadm --detail /dev/md0
/dev/md0:
Version : 1.0
Creation Time : Sat May 12 09:35:34 2012
Raid Level : raid10
Array Size : 1949426688 (1859.12 GiB 1996.21 GB)
Used Dev Size : 974713344 (929.56 GiB 998.11 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Thu Aug 30 12:53:20 2012
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 128K
Name : N7700:2
UUID : 1a83e7dc:daa7d822:15a1de4d:e4f6fd19
Events : 64
Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb2
1 8 2 1 active sync /dev/sda2
2 8 50 2 active sync /dev/sdd2
3 8 34 3 active sync /dev/sdc2
- It’s possible to define the chunk size when creating the RAID with the -c or –chunk argument. Let’s also see how to calculate it optimally. First, let’s use iostat to get the avgrq-sz value:
> iostat -x sda 1 5
avg-cpu: %user %nice %system %iowait %steal %idle
0,21 0,00 0,29 0,05 0,00 99,45
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0,71 1,25 1,23 0,76 79,29 15,22 47,55 0,01 2,84 0,73 0,14
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 0,00 0,00 0,00 100,00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0,00 0,00 1,00 0,00 16,00 0,00 16,00 0,00 1,00 1,00 0,10
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 0,00 0,00 0,00 100,00
Now let’s do the calculation to get the chunk size in KiB:
> echo "47.55*512/1024" | bc -l
23.77500000000000000000
We then divide this value by the number of disks (let’s say 2) and round it to the nearest multiple of 2:
Chunk Size(KB) = 23.775/2 = 11.88 ≈ 8
Here the chunk size to use is 8, since it’s the multiple of 2 closest to 11.88.
To create a RAID 0 while defining the chunk size:
mdadm -C /dev/md0 -l 0 -n 2 --chunk-size=32 /dev/sd[ab]1
Stride
Stride is a parameter passed during RAID construction that optimizes how the filesystem places its data blocks on the disks before moving to the next ones. With extXn, you can optimize using the -E option which corresponds to the number of filesystem blocks in a chunk. To calculate the stride:
Stride = chunk size / block size
For a RAID 0 with a chunk size of 64KiB (64 KiB / 4KiB = 16) for example:
mkfs.ext4 -b 4096 -E stride=16 /dev/mapper/vg1-lv0
Some disk controllers make a physical abstraction of block groups, making it impossible for the kernel to know them. Here’s an example to see the stride size:
> dumpe2fs /dev/mapper/vg1-lv0 | grep -i stride
dumpe2fs 1.42 (29-Nov-2011)
RAID stride: 16
Here, the size is 16 KiB.
To calculate the stride, there is also a website: http://busybox.net/~aldot/mkfs_stride.html
The Round Robin
Non-parity RAIDs allow data segmentation across multiple disks to increase performance using the Round Robin algorithm. The segment size is defined when creating the RAID and refers to the chunk size.
The size of a RAID is defined by the smallest disk when creating the RAID. The size can vary in the future if all disks are replaced with larger capacity disks. Disk resynchronization will occur and the filesystem can be extended.
So for tuning Round Robin, you need to correctly tune the chunk size and stride for optimal algorithm usage! That’s all :-)
Parity RAIDs
One of the major performance constraints of RAID 5 and 6 is the parity calculation. For data to be written, parity calculations must be performed on the raid first. Only then can the parity and data be written.
Each data update requires 4 IO operations:
- Data to be updated is first read from the disks
- Update of the new data (but parity is not yet correct)
- Reading blocks from the same stripe and calculating parity
- Final writing of new data to disks and parity
In RAID 5, it’s recommended to use stripe cache:
echo 256 > /sys/block/md0/md/stripe_cache_size
For more information on RAID optimizations: http://kernel.org/doc/Documentation/md.txt[^4][^5]. For the optimization part, look at the following parameters:
- chunk_size
- component_size
- new_dev
- safe_mode_delay
- syncspeed{min,max}
- sync_action
- stripe_cache_size
RAID 1
The RAID driver writes to the bitmap when changes have been detected since the last synchronization. A major drawback of RAID 1 is during a power outage, since it needs to be completely rebuilt. With the ‘write-intent’ bitmap, only the parts that have changed will need to be synchronized, greatly reducing rebuild time.
If a disk fails and is removed from the RAID, md stops clearing bits in the bitmap. If the same disk is reintroduced into the RAID, md will only need to resynchronize the difference. When creating the RAID, if the ‘–write-intent’ bitmap option is combined with ‘–write-behind’, write requests to devices with the ‘–write-mostly’ option will not wait for the requests to complete before writing to disk. The ‘–write-behind’ option can be used for RAID1 with slow connections.
New mdraid arrays support write intent bitmaps. This helps the system identify problematic parts of an array; thus, in case of an incorrect shutdown, only the problematic parts will need to be resynchronized, not the entire disk. This drastically reduces the time required for resynchronization. Newly created arrays will automatically have a write intent bitmap added when possible. For example, arrays used as swap and very small arrays (such as /boot arrays) will not benefit from getting write intent bitmaps. It is possible to add write intent bitmaps to previously existing arrays once the device update is complete via the mdadm –grow command. However, write intent bitmaps incur performance impact (about 3-5% on a bitmap size of 65536, but can increase to 10% or more on smaller bitmaps, such as 8192). This means that if a write intent bitmap is added to an array, it is better to keep the size relatively large. The recommended size is 65536.4
To see if a RAID is persistent:
> mdadm --detail /dev/md0
/dev/md0:
Version : 1.0
Creation Time : Sat May 12 09:35:34 2012
Raid Level : raid10
Array Size : 1949426688 (1859.12 GiB 1996.21 GB)
Used Dev Size : 974713344 (929.56 GiB 998.11 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Thu Aug 30 16:43:17 2012
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : near=2
Chunk Size : 128K
Name : N7700:2
UUID : 1a83e7dc:daa7d822:15a1de4d:e4f6fd19
Events : 64
Number Major Minor RaidDevice State
0 8 18 0 active sync /dev/sdb2
1 8 2 1 active sync /dev/sda2
2 8 50 2 active sync /dev/sdd2
3 8 34 3 active sync /dev/sdc2
To add write intent bitmap (internal):
mdadm /dev/md0 --grow --bitmap=internal
To add write intent bitmap (external):
mdadm /dev/md0 --grow --bitmap=/mnt/mon_fichier
And to remove it:
mdadm /dev/md0 --grow --bitmap=none
To define the slow and fastest disk:
mdadm -C /dev/md0 -l1 -n2 -b /tmp/md0 --write-behind=256 /dev/sdal --write-mostly /dev/sdbl
Last updated 31 Aug 2012, 20:45 CEST.