ZFS: The Filesystem Par Excellence
Introduction
ZFS or Z File System is an open-source filesystem under the CDDL license. The ‘Z’ doesn’t officially stand for anything specific, but has been referred to in various ways in the press, such as Zettabyte (from the English unit zettabyte for data storage), or ZFS as “the last word in filesystems”.
Produced by Sun Microsystems for Solaris 10 and above, it was designed by Jeff Bonwick’s team. Announced for September 2004, it was integrated into Solaris on October 31, 2005, and on November 16, 2005, as a feature of OpenSolaris build 27. Sun announced that ZFS was integrated into the Solaris update dated June 2006, a year after the opening of the OpenSolaris community.
The characteristics of this filesystem include its very high storage capacity, the integration of all previous concepts related to filesystems and volume management into a single product. It integrates on-disk structure, is lightweight, and easily allows setting up a storage management platform.
Locating Your Disks
Use your usual tools to identify your disks. For example, under Solaris:
|
|
Zpool
A Zpool is similar to a VG (Volume Group) for those familiar with LVM. The issue is that at the time of writing this article, you cannot reduce the size of a zpool, but you can increase it. You can use a zpool directly as a filesystem since it’s based on ZFS. However, you can create ZFS filesystems (that’s what they’re called, I know it’s confusing, but think of them more like LVs (Logical Volumes) or partitions). You can also create other filesystems containing other filesystem types (NTFS, EXT4, REISERFS…).
Creating a Zpool
To create a zpool, follow these steps:
|
|
- zpool_name: specify the name you want for the zpool
- c0t600A0B80005A2CAA000004104947F51Ed0: this is the device name displayed by the format command
Listing Zpools
Simple Zpool
To know which pools exist on the machine:
|
|
Raid-Z
A Raid-Z is like a Raid 5, but without one major problem: no parity resynchronization or loss during a power outage. Here’s how to do it:
|
|
Mounting a Zpool
By default, zpools have the mount point /zpool_name. To mount a zpool, we’ll use the zfs command:
|
|
It will remember where it should be mounted, as this information is stored in the filesystem.
Unmounting a Zpool
This is super simple, as usual:
|
|
Just use umount followed by the mount point.
Deleting a Zpool
To delete a zpool:
|
|
Expanding a Zpool
To expand a zpool, we’ll use the zpool name and the additional device:
|
|
Modifying Zpool Parameters
Changing the Mount Point of a Zpool
By default, zpools are mounted in /, to change this:
|
|
- /mnt/datas: the desired mount point
- my_zpool: name of the zpool
Importing All Zpools
To import all zpools:
|
|
- -f: force (optional and may be dangerous in some cases)
- -a: will import all zpools
Renaming a Zpool
Renaming a zpool is actually not very complicated:
|
|
And that’s it, the zpool is renamed :-).
Using in a Cluster Environment
In a cluster environment, you’ll need to mount and unmount zpools quite regularly. If you use Sun Cluster (at the time of writing this in version 3.2), you are forced to use zpools for mounting partitions. Filesystems cannot be mounted and unmounted from one node to another when they belong to the same zpool.
You’ll therefore need to unmount the zpool, export the information in ZFS, then import it on the other node. Imagine the following scenario:
- sun-node1 (node 1)
- sun-node2 (node 2)
- 1 disk array with 1 LUN of 10 GB
The LUN is a zpool created as described earlier in this document on sun-node1. Now you need to switch this zpool to sun-node2. Since ZFS is not a cluster filesystem, you must unmount it, export the information, then import it. Unmounting is not mandatory since the export will do it, but if you want to do things properly, then let’s proceed on sun-node1:
|
|
Now, list the available zpools, you should no longer see it. Let’s move to sun-node2:
|
|
There you go, you’ll find your files, normally it’s automatically mounted, but if for some reason it’s not, you can do it manually (see above).
ZFS
Creating a ZFS Partition
To create a ZFS partition, it’s extremely simple! You obviously have your Zpool created first, then you execute:
|
|
Then you can specify options with -o and see all available options with:
|
|
Renaming a Partition
To rename a ZFS partition. If you want to rename a ZFS partition, nothing could be simpler:
|
|
Managing Swap on ZFS
On Solaris, you can use multiple combined swap spaces (partitions + files indifferently).
To list the swap spaces used:
|
|
To know a bit more, you can also use:
|
|
Here we have two ZFS volumes used as swap. By default, when creating a ZPOOL, a swap space is created “rpool/swap”.
Adding SWAP
You just need to increase the size of the ZFS associated with the swap.
Adding a Swap
Let’s verify the number of assigned swaps:
|
|
Now we add a ZFS:
|
|
Here we’ve just created the swap at 30G. Then we declare this new partition as swap:
|
|
Now, when I display the list of active partitions, I can see the new one:
|
|
If you get a message like this:
/dev/zvol/dsk/rpool/swap is in use for live upgrade -. Please see ludelete(1M).
You’ll need to use the following command to activate it:
|
|
Expanding a Swap
When the machine is running and the swap space is being used, you can increase the size of the swap so the system can use it. This will require deactivation and reactivation for the new space to be taken into account. For this, we’ll expand the zfs:
|
|
Now we’ll deactivate the swap:
|
|
You now need to delete or comment out the entry in /etc/vfstab
that corresponds to the swap, as it will be automatically created in the next step:
|
|
Then reactivate it so the new size is taken into account:
|
|
You can check the swap size:
|
|
Advanced Usage
The ZFS ARC Cache
The problem with ZFS is that it’s very RAM-hungry (about 1/8 of the total + swap). This can quickly become problematic on machines with a lot of RAM. Here’s some explanation.
Available Memory mdb -k and ZFS ARC I/O Cache
The mdb -k
command with the ::memstat
option provides a global view of available memory on a Solaris machine:
|
|
In the example above, this is a machine with 32 GB of physical memory.
ZFS uses a kernel cache called ARC for I/Os. To know the size of the I/O cache at a given time used by ZFS, use the kmastat
option with the mdb -k
command and look for the Total [zio_buf] statistic:
|
|
In the example above, the ZFS I/O cache uses 1.1 GB in memory.
Limiting the ZFS ARC Cache (zfs_arc_max and zfs_arc_min)
For machines with a very large amount of memory, it’s better to limit the ZFS I/O cache to avoid any memory overflow on other applications. In practice, this cache increases and decreases dynamically based on the needs of applications installed on the machine, but it’s better to limit it to prevent any risk. The zfs_arc_max
parameter (in bytes) in the /etc/system
file allows limiting the amount of memory for the ZFS I/O cache. Below is an example where the ZFS I/O cache is limited to 4GB:
|
|
Statistics on the ZFS ARC Cache (kstat zfs)
Similarly, you can specify the minimum amount of memory to allocate to the ZFS I/O cache with the zfs_arc_min
parameter in the /etc/system
file.
The kstat
command with the zfs
option provides detailed statistics on the ZFS ARC cache (hits, misses, size, etc.) at a given time: you’ll find the maximum possible value (c_max) for this cache, the current size (size) in the output of this command. In the example below, the zfs_arc_max
parameter hasn’t yet been applied, which explains why the maximum possible size corresponds to the physical memory of the machine.
|
|
I also recommend the excellent arc_summary which provides very precise information or arcstat.
Not Mounting All Zpools at Boot
If you encounter errors when booting your machine during zpool mounting (in my case, a continuous reboot of the machine), there is a solution that allows it to forget all those that were imported during the last session (ZFS remembers imported zpools and automatically reimports them during a boot, which is convenient but can be constraining in some cases).
To do this, you’ll need to boot in single user or multi-user mode (if it doesn’t work, try failsafe mode for Solaris (chrooted for Linux)), then we’ll remove the ZFS cache:
|
|
If your OS is installed on ZFS, and you’re in failsafe mode, you’ll need to repopulate the cache (the /a corresponds to the root under Solaris):
|
|
Then reboot and reimport the desired zpools.
FAQ
FAULTED
I got a little FAULTED as you can see below without really knowing why:
|
|
Here we need to debug a bit. For that, we use the following command:
|
|
That doesn’t look good! The error messages are scary. Yet the solution is simple: just export and reimport (forcing if necessary) the defective zpools.
Then, you can check the status of your filesystem via a scrub:
|
|
How to Repair Grub After Zpool Upgrade
It appears that zpool upgrade can break the grub bootloader.
To fix this, we need to reinstall grub on the partition. Proceed as follows:
- Unplug all fiber cables from the server (or other disks than the OS)
- Boot on Solaris 10 Install DVD
- On boot Select option 6: “Start Single User Shell”
- It will scan for existing zpool containing OS installation then ask you if you want to mount your rpool to /a. Answer “yes”
- When you get the prompt, launch this to check the status of the rpool:
|
|
- This way we can see all the disks involved in the zpool (here c3t0d0s0). We will reinstall grub on all the disks in the zpool with this command:
|
|
- Unmount the zpool
|
|
- Plug back the fiber cables
- Reboot
|
|
That’s it!
Resources
Last updated 02 Jan 2013, 13:30 +0200.