Software RAID Configuration
Introduction
Not everyone can afford a RAID 5 card with proper disks. That’s why a small software RAID 5 can be a good solution, especially for home use!
Creating a RAID
RAID 1
To create a RAID 1, it’s simple. You just need 2 disks with 2 partitions of the same size, then run this command:
|
|
- create: creating a raid
- assume-clean: allows having a directly usable raid, without a complete synchronization. This requires being 100% sure that both disks/partitions are blank.
- level: the type of raid (here RAID 1)
- raid-devices: the number of disks used
Monitoring
To see if everything is working properly, here are several solutions:
- The /proc/mdstat file:
|
|
- The mdadm command which will allow us to have an exact view of the raid status:
|
|
- And finally the best option, to be alerted in case of problems:
|
|
I won’t go into details, the options speak for themselves.
Problem Cases
Relative to what we’ve seen above, here’s what happens in case of problems:
|
|
And finally, a little mdadm:
|
|
Repairing Your RAID 5
Replace the problematic disk, then add it to your raid:
|
|
Now, you can monitor the restoration via this command:
|
|
We can also view the reconstruction like this:
|
|
Increasing RAID Performance
I won’t discuss the different RAID types, but will rather leave Wikipedia for that1. For using software RAID under Linux, I recommend this documentation2. We’ll focus more on performance since that’s the subject here. RAID 0 is the most performant of all raids, but it obviously has its data security problems when a disk is lost.
The MTBF (Mean Time Between Failure) is also important on RAIDs. It’s an estimate of the good functioning of the RAID before a disk is detected as failing.
Chunk Size
The “Chunk size” (or stripe size or element size for some vendors) is the number (in segment size (KiB)) of data written or read for each device before moving to another segment. The algorithm used is Round Robin. The chunk size must be an integer, multiple of the block size. The larger the chunk size, the faster the write speed on very large capacity data, but conversely slower on small data. If the average size of IO requests is smaller than the size of a chunk, then the request will be placed on a single disk of the RAID, canceling all the advantages of RAID. Reducing the chunk size will break large files into smaller pieces that will be distributed across multiple disks, which will improve performance. However, the positioning time of chunks will be reduced. Some hardware doesn’t allow writing until a stripe is complete, canceling this positioning latency effect.
A good rule to define the chunk size is to divide roughly the size of IO operations by the number of disks on the RAID (remove parity disks if RAID 5 or 6).
Quick reminder:
RAID 0: No parity
RAID 1: No parity
RAID 5: 1 parity disk
RAID 6: 2 parity disks
RAID 10: No parity disks
If you have no idea about your IOs, take a value between 32KB and 128KB, taking a multiple of 2KB (or 4KB if you have larger block sizes). The chunk size (stripe size) is an important factor on the performance of your RAID. If the stripe is too wide, the raid may have a “hot spot” which will be the disk that receives the most IO and will reduce the performance of your RAID. It’s obvious that the best performance is when data is spread across all disks. The good formula is therefore:
Chunk size = average request IO size (avgrq-sz) / number of disks
To get the average request size, I invite you to check the Systat documentation3 where we talk about Iostat and Sar.
- To see the chunk size on a RAID (here md0):
|
|
It’s therefore 128KB here.
Here’s another way to see it:
|
|
Or even:
|
|
- It’s possible to define the chunk size when creating the RAID with the argument -c or –chunk. Let’s also see how to calculate it best. First, let’s use iostat to get the avgrq-sz value:
|
|
Let’s then do the calculation to get the chunk size in KiB:
|
|
We must then divide this value by the number of disks (let’s say 2) and round it to the nearest multiple of 2:
Chunk Size(KB) = 23.775/2 = 11.88 ≈ 8
Here the chunk size to set is 8, since it’s the multiple of 2 that is closest to 11.88.
To create a raid 0 by defining the chunk size:
|
|
Stride
The Stride is a parameter that we pass during the construction of a RAID that optimizes the way the filesystem will place its data blocks on the disks before moving to the next ones. With extXn, we can optimize by using the -E option which corresponds to the number of filesystem blocks in a chunk. To calculate the stride:
Stride = chunk size / block size
For a raid 0 having a chunk size of 64KiB (64 KiB / 4KiB = 16) for example:
|
|
Some disk controllers do a physical abstraction of block groups making it impossible for the kernel to know them. Here’s an example to see the size of a stride:
|
|
Here, the size is 16 KiB.
To calculate the stride, there’s also a website: http://busybox.net/~aldot/mkfs_stride.html
Round Robin
RAIDs without parity allow data segmentation across multiple disks to increase performance using the Round Robin algorithm. The segment size is defined at the creation of the RAID and refers to the chunk size.
The size of a RAID is defined by the smallest disk at the creation of the RAID. The size can vary in the future if all disks are replaced by larger capacity disks. A resynchronization of the disks will take place and the filesystem can be extended.
So for Round Robin tuning, you need to properly tune the chunk size and stride so that the usage of the algorithm is optimal! That’s all :-)
Parity RAIDs
One of the big performance constraints of RAID 5 and 6 is parity calculation. For data to be written, parity calculation must be performed on the raid beforehand. and only then can parity and data be written.
Each data update requires 4 IO operations:
- The data to be updated is first read from the disks
- Update of the new data (but the parity is not yet correct)
- Reading of blocks of the same stripe and parity calculation
- Final writing of new data to disks and parity
In RAID 5, it’s recommended to use stripe caching:
|
|
For more information on RAID optimizations: http://kernel.org/doc/Documentation/md.txt[^4][^5]. For the optimization part, look at the following parameters:
- chunk_size
- component_size
- new_dev
- safe_mode_delay
- syncspeed{min,max}
- sync_action
- stripe_cache_size
RAID 1
The RAID driver writes to the bitmap when changes have been detected since the last synchronization. A major drawback of RAID 1 is during a power cut, since it has to be entirely rebuilt. With the ‘write-intent’ bitmap, only the parts that have changed will have to be synchronized, which greatly reduces the reconstruction time.
If a disk fails and is removed from the RAID, md stops erasing bits in the bitmap. If this same disk is reintroduced into the RAID, md will only have to resynchronize the difference. When creating the RAID, if the ‘–write-intent’ bitmap option is combined with ‘–write-behind’, write requests to devices with the ‘–write-mostly’ option will not wait for the requests to be complete before writing to the disk. The ‘–write-behind’ option can be used for RAID1 with slow connections.
The new mdraid matrices support the use of write intent bitmaps. This helps the system identify problematic parts of a matrix; thus, in case of an incorrect stop, the problematic parts will have to be resynchronized, not the entire disk. This drastically reduces the time required for resynchronization. Newly created matrices will automatically have a write intent bitmap added when possible. For example, matrices used as swap and very small matrices (such as /boot matrices) will not benefit from obtaining write intent bitmaps. It’s possible to add write intent bitmap to previously existing matrices once the update on the device is completed via the mdadm –grow command. However, write intent bitmaps don’t incur a performance impact (about 3-5% on a bitmap size of 65536, but can increase up to 10% or more on smaller bitmaps, such as 8192). This means that if write intent bitmap is added to a matrix, it’s better to keep the size relatively large. The recommended size is 65536.4
To see if a RAID is persistent:
|
|
To add the write intent bitmap (internal):
|
|
To add the write intent bitmap (external):
|
|
And to remove it:
|
|
To define the slow disk and the fastest one:
|
|
FAQ
I have an md127 appearing and my md0 is broken
First, you need to repair the RAID with mdadm. Then, you need to add the current configuration to mdadm.conf, so that at boot time, it doesn’t try to guess a wrong configuration. Simply run this command when your RAID is working properly:
|
|
References
Last updated 08 Aug 2014, 08:29 CEST.