Multipath on Debian

Installation

To make multipath working on Debian, you'll need 'multipath-tools-initramfs' and 'multipath-tools' packages. But as said in 'mulitpath-tools-initramfs' bug's list, you need to correct '/usr/share/initramfs/hooks/multipath_hook'.

When you look at the bug list http://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=multipath-tools-initramfs;dist=unstable, you see there are some tools missing.

The first very important things to add in 'multipath_hook' is :

manual_add_modules dm-multipath
manual_add_modules dm-mod
manual_add_modules dm-round-robin

Then you need to add this

for helper in /sbin/mpath_prio_*; do
        copy_exec $helper /sbin
done

And finally if you want to use 'alias', add :

copy_exec /etc/multipath.conf /etc/

Optionally you can comment out the line

copy_exec /bin/readlink /bin/

And you're done with it. Your file should looks like :

#!/bin/sh
 
# The environment contains at least:
#
#  CONFDIR -- usually /etc/mkinitramfs, can be set on mkinitramfs
#                command line.
#
#  DESTDIR -- The staging directory where we are building the image.
#
PREREQ=""
 
prereqs()
{
        echo "$PREREQ"
}
 
case $1 in
# get pre-requisites
prereqs)
        prereqs
        exit 0
        ;;
esac
 
 
# You can do anything you need to from here on.
#
 
# Source the optional 'hook-functions' scriptlet, if you need the
# functions defined within it.  Read it to see what is available to
# you.  It contains functions for copying dynamically linked program
# binaries, and kernel modules into the DESTDIR.
#
. /usr/share/initramfs-tools/hook-functions
 
copy_exec /sbin/multipathd /sbin/
copy_exec /sbin/scsi_id /sbin/
copy_exec /sbin/kpartx /sbin/
copy_exec /bin/mountpoint /bin/
copy_exec /sbin/devmap_name /sbin/
copy_exec /sbin/multipath /sbin/
# Modified by tchetch
#copy_exec /bin/readlink /bin/
 
# Added by tchetch
copy_exec /etc/multipath.conf /etc/
for helper in /sbin/mpath_prio_*; do
        copy_exec $helper /sbin
done
manual_add_modules dm-multipath
manual_add_modules dm-mod
manual_add_modules dm-round-robin
 
mkdir -p $DESTDIR/lib || true
cp /lib/libgcc_s.so.1 $DESTDIR/lib/
 
exit 0

Configuration

This part depends on your hardware, I've been working only with a SAN from IBM. Now you'll need to configure your file '/etc/multipath.conf'. You'll first create alias for your device :

multipaths {
        multipath {
                wwid                    3600a0b8000177d9400002e61463f2ed3
                alias                   system
        }
        multipath {
                wwid                    3600a0b8000177bcc0000256645f7f166
                alias                   data
        }
}

alias name you want to give to the Logical Drive attached to your Blade.
wwid World Wide ID. It's a unique ID assigned to each Logical Drive.

Now you can configure the some options, like devices. For each devices connected to your system you can define options. I've got only one SAN attached to my system so it's easy :

devices {
        device
        {
                vendor                  "IBM.*"
                product                 "1722-600"
                path_grouping_policy    group_by_serial
                path_checker            tur
                path_selector           "round-robin 0"
                prio_callout            "/sbin/mpath_prio_tpc /dev/%n"
                failback                immediate
                features                "1 queue_if_no_path"
                no_path_retry           300
        }
}

vendor is the name of the vendor of your system. This will be used to identify your SAN. For IBM, IBM.* works.
product product name of your SAN. Mine is a DS 4300, but in the Storage Manager reports Product ID : 1722-600.
path_grouping_policy Depend on how you want to use your SAN. For example multibus doesn't work on my SAN. I use group_by_serial because I've seen a document for IBM SAN that use this. Other options are failover and multibus. To find the best one, test (I personnaly have tested all of them, and for me group_by_serial work best).
path_checker Can be readsector0 and tur. On my SAN, readsector0 make path switching and so my SAN is not happy with it and it reports a problem.
prio_callout this where I've spend much of my time testing. To know which prio_callout you've got, go to /sbin and list all the mpath_prio_*. Then test them and choose the one which return the right value, but do the test in the initrd environment, because behavior is different than in initrd. I'll explain more later.
failback define when to come back to the original path when it comes up. Set to immediate, a value in second or manual if you want to disable path failback.
features I don't know what it is, but it was used by someone working on IBM SAN.
no_path_retry how many times before failling. Can be a number of try, fail for immediate failling and queue to keep trying forever.

Now you can add default values for all the devices :

defaults {
        udev_dir                /dev
        polling_interval        2
        default_getuid_callout  "/sbin/scsi_id -g -u -s /block/%n"
        user_friendly_names     yes
}

udev_dir where is the devfs.
polling_interval time in second between two check on a path
default_getuid_callout command to get WWID.
user_friendly_names if no aliases are set, this define if the name choose will be user friendly (mpathX) or system friendly (using the WWID instead).

And finally you should add this, taken from the example file from Debian :

devnode_blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

This just set devices that won't be taken into account when building the multipath.

Your files should looks like this :

##
## This is a template multipath-tools configuration file
## Uncomment the lines relevent to your environment
##
defaults {
        udev_dir                /dev
        polling_interval        2
        default_getuid_callout  "/sbin/scsi_id -g -u -s /block/%n"
        user_friendly_names     yes
}
 
devnode_blacklist {
        devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
        devnode "^hd[a-z][[0-9]*]"
        devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}
 
 
devices {
        device
        {
                vendor                  "IBM.*"
                product                 "1722-600"
                path_grouping_policy    group_by_serial
                path_checker            tur
                path_selector           "round-robin 0"
                prio_callout            "/sbin/mpath_prio_tpc /dev/%n"
                failback                immediate
                features                "1 queue_if_no_path"
                no_path_retry           300
        }
}
 
multipaths {
        multipath {
                wwid                    3600a0b8000177d9400002e61463f2ed3
                alias                   system
        }
        multipath {
                wwid                    3600a0b8000177bcc0000256645f7f166
                alias                   data
        }
}

How to get the WWID

This is done with scsi_id. For example with sda device, you'll do like this :

/sbin/scsi_id -g -u -s /block/sda

Don't ask me why it's /block and not /dev !

You might notice something. For example on my system sda and sdc report the same ID. That's normal, sda is the first path and sdc is the second path, but the logical drive is the same.

Building initrd

Now you're ready to build the initrd. Use, if possible, the same tools that is used by your distribution when upgrading the kernel. For Debian you can do this :

dpkg-reconfigure linux-image-2.6.18-4-686

linux-image-2.6.18-4-686 is the package I installed for the kernel.

Modifying your grub/fstab

Now you have to change grub and fstab to point to the right device. If you used aliases, your device will likely be accessible with /dev/mapper/aliasX, where alias is the name you choose and X is the partition number.

In Debian don't forget to change the kopt value in /boot/grub/menu.lst, so on the next kernel upgrade it won't break the hard work you've done :

## ## Start Default Options ##
## default kernel options
## default kernel options for automagic boot options
## If you want special options for specific kernels use kopt_x_y_z
## where x.y.z is kernel version. Minor versions can be omitted.
## e.g. kopt=root=/dev/hda1 ro
##      kopt_2_6_8=root=/dev/hdc1 ro
##      kopt_2_6_8_2_686=root=/dev/hdc2 ro
# kopt=root=/dev/mapper/system1 ro

Now reboot

When you reboot it might fails and the root file system might not be mounted. So wait until the initrd shell shows up and try to test behaviour of the different parameters you set before until you have only good values.

For example on my system if I call mpath_prio_balance_units on the running system, it will return correct value, but in the initrd it won't return anything, so the environment is different, you have to find a solution in initrd and then adapt your configuration.

You can always chroot into your root filesystem from the iniramfs shell. Use it to reconfigure your initrd.

If your system boot the first time, it means that you're much more lucky than me. Now you can do :

bladeTest:~# multipath -ll
system (3600a0b8000177d9400002e61463f2ed3) dm-0 IBM,1722-600
[size=5.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:0:0 sda 8:0   [active][ready]
\_ round-robin 0 [prio=6][active]
 \_ 0:0:1:0 sdc 8:32  [active][ready]
data (3600a0b8000177bcc0000256645f7f166) dm-1 IBM,1722-600
[size=9.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=6][active]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:1:1 sdd 8:48  [active][ready]

And see all your path. For me with a configuration for data as multibus would give me this kind of output :

bladeTest:~# multipath -ll
system (3600a0b8000177d9400002e61463f2ed3) dm-0 IBM,1722-600
[size=5.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:0:0 sda 8:0   [active][ready]
\_ round-robin 0 [prio=6][active]
 \_ 0:0:1:0 sdc 8:32  [active][ready]
data (3600a0b8000177bcc0000256645f7f166) dm-1 IBM,1722-600
[size=9.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=7][enabled]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
 \_ 0:0:1:1 sdd 8:48  [active][ready]

Testing configuration

When your system boots perfectly, but you want to try other configuration without rebooting every time, just attached another partition to your system and work on it. When a partition is mounted you cannot change its multipath table, but if it's not mounted you can clear the table with multipath -f alias and then rebuild a new one with multipath alias.

For example on my test system I've got system and data. So when the system is running I cannot change system because this is the root file system, but data can be used to test.

Hot adding host to the system (qlogic)

You've got a little script made by qlogic that will scan for new host avaible there : http://download.qlogic.com/ms/56615/readme_dynamic_lun_22.html.

This script will scan new for new host. Then just run :

bladeTest:~# multipath
sdb: checker msg is "tur checker reports path is down"
sdd: checker msg is "tur checker reports path is down"
sdf: checker msg is "tur checker reports path is down"
sdg: checker msg is "tur checker reports path is down"
sdb: checker msg is "tur checker reports path is down"
sdf: checker msg is "tur checker reports path is down"
sdg: checker msg is "tur checker reports path is down"

and then :

bladeTest:~# multipath -ll
mpath2 (3600a0b8000177bcc0000256545f7aa8a) dm-4 IBM,1722-600
[size=5.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=6][enabled]
 \_ 0:0:0:2 sdf 8:80  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:1:2 sdg 8:96  [active][ready]
system (3600a0b8000177d9400002e61463f2ed3) dm-0 IBM,1722-600
[size=5.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:0:0 sda 8:0   [active][ready]
\_ round-robin 0 [prio=6][active]
 \_ 0:0:1:0 sdc 8:32  [active][ready]
data (3600a0b8000177bcc0000256645f7f166) dm-1 IBM,1722-600
[size=9.0G][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=6][enabled]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
\_ round-robin 0 [prio=1][enabled]
 \_ 0:0:1:1 sdd 8:48  [active][ready]

As seen above, I'have a new host and I can configure it or just use it with /dev/mapper/mpath2 if it's for a one time use ! Nice !

Using XFS

Resize partition

In order to resize a partition on the SAN, there is a solution pretty simple. We have data and system as multipath partition, we added 1G to data, so we need to rescan the whole stuff. First check scsi bus on which the partition is on :

bladeTest:/# multipath -ll
sdc: checker msg is "readsector0 checker reports path is down"
sdd: checker msg is "readsector0 checker reports path is down"
system (3600a0b8000177d9400002e61463f2ed3) dm-0 IBM,1722-600
[size=5.0G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
 \_ 0:0:0:0 sda 8:0   [active][ready]
 \_ 0:0:1:0 sdc 8:32  [failed][faulty]
data (3600a0b8000177bcc0000256645f7f166) dm-1 IBM,1722-600
[size=8.0G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][active]
 \_ 0:0:0:1 sdb 8:16  [active][ready]
 \_ 0:0:1:1 sdd 8:48  [failed][faulty]

So we see that data uses bus 0:0:0:1 and 0:0:1:1. On the SAN side, the resizing process must be completed. We assume data is mounted on /srv.

We need to rescan the device like this :

bladeTest:/# echo 1 > /sys/bus/scsi/devices/0\:0\:0\:1/rescan
bladeTest:/# echo 1 > /sys/bus/scsi/devices/0\:0\:1\:1/rescan

Then we umount the partition and rebuild the multipath

bladeTest:/# umount /srv/
bladeTest:/# multipath -f data
bladeTest:/# multipath data
sdc: checker msg is "readsector0 checker reports path is down"
sdd: checker msg is "readsector0 checker reports path is down"
sdd: checker msg is "readsector0 checker reports path is down"
create: data (3600a0b8000177bcc0000256645f7f166)  IBM,1722-600
[size=9.0G][features=0][hwhandler=0]
\_ round-robin 0 [prio=1][undef]
 \_ 0:0:0:1 sdb 8:16  [undef][ready]
 \_ 0:0:1:1 sdd 8:48  [undef][faulty]
bladeTest:/# mount srv/

This process is quite short but we can see that the size went to 9G, so that what we wanted. But if we look at the disk usage we see :

bladeTest:/# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/system1   4.7G  618M  3.9G  14% /
tmpfs                1015M     0 1015M   0% /lib/init/rw
udev                   10M   84K   10M   1% /dev
tmpfs                1015M     0 1015M   0% /dev/shm
/dev/mapper/data      8.0G  384K  8.0G   1% /srv

The partition as not been resize … Why ? Well if you resize the underlying disk this doesn't mean that the filesystem on it has been resized. If you use filesystem like XFS, you can resize it when mounted :

bladeTest:/# xfs_growfs /srv/
meta-data=/dev/mapper/data       isize=256    agcount=11, agsize=196608 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=2097152, imaxpct=25
         =                       sunit=0      swidth=0 blks, unwritten=1
naming   =version 2              bsize=4096
log      =internal               bsize=4096   blocks=2560, version=1
         =                       sectsz=512   sunit=0 blks
realtime =none                   extsz=65536  blocks=0, rtextents=0
data blocks changed from 2097152 to 2359296

And that's done. For othe filesystem see filesystem documentation.

Filesystem freeze

Filesystem freeze are designed to be used with system as snapshot/flashcopy. It makes the filesystem hangs all IO while something is working to backup. The data on the filesystem are not lost and when unfreezing takes action, the system runs normally. To freeze the filesystem, we just do

xfs_freeze -f /srv

and when we are finished with it, we unfreeze with

xfs_freeze -u /srv

And all it's ok!

Bolay's Wiki

Table of Contents