This is the second part of my posts on using RAID1 on Ubuntu 14.04, if you missed the first part see it here Installing Ubuntu 14.04 on RAID 1 and LVM
So your shiny RAID1 array lost a disk. DO NOT PANIC.
I’ll say that again, as you probably missed it in your panic.
DO NOT PANIC.
Following this process and not taking care can COMPLETELY DESTROY your system. Use a test system first. To get comfortable with the process.
This is how I got out the the problem facing you now. I had to spend a few hours trawling documentation and bug posts to find the solution and then I tried it out on a virtual machine.
First of all create a test system, as I did with part one. You can try this out first, second, third before randomly trying things on the live system. If that test system breaks you can build another one. If you break the live system your life will suck big time. Once you find a solution that works for you. Test it again before trying it on the live system.
Initial Setup
The test system uses the following:
- boot
- about 750Mb
- mount point /boot
- File system EXT4
- Bootable
- /dev/sda1 + /dev/sdb1 = /dev/md0
- root
- about 24Gb
- LVM volume group vg0
- LVM logical volume lv_root
- mount point / (root)
- File system EXT4
- /dev/sda2 + /dev/sdb2 = /dev/md1
- swap
- about 2Gb
- File system Swap
- /dev/sda3 + /dev/sdb3 = /dev/md2
Some Preamble
Setup Monitoring and Alerts
To setup some monitoring to get the system to let you know your RAID array was broken or degraded. Run the command below, it will configure some default behaviour for monitoring and checking.
sudo dpkg-reconfigure mdadm
Manually Monitoring Your RAID Array
To see it is all running okay, there are some commands you can use.
Use the df command to see the mounted filesystems and what devices they are.
df -h
Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg0-lv_root 20G 1.1G 18G 7% / none 4.0K 0 4.0K 0% /sys/fs/cgroup udev 990M 4.0K 990M 1% /dev tmpfs 201M 492K 200M 1% /run none 5.0M 0 5.0M 0% /run/lock none 1001M 0 1001M 0% /run/shm none 100M 0 100M 0% /run/user /dev/md0 734M 37M 644M 6% /boot
You can see here that the root file system ‘/’ is from the device ‘/dev/mapper/vg0-lv_root’. Which was the volume group ‘vg0’ and the logical volume ‘lv_root’.
‘/boot’ is from the device ‘/dev/md0’.
To check the swap.
swapon -s
Filename Type Size Used Priority /dev/md2 partition 1993660 0 -1
You can see it is from the device /dev/md2 and is about 2GB.
If you are quick enough or your disks are bigger than 25GB you may even catch the RAID1 system rebuilding. The output below does not show this I was too slow or the host I use for the virtuals is very quick. 🙂 The output below shows a fully mirrored and working set of RAID1 mirrors.
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 23420800 blocks super 1.2 [2/2] [UU] md0 : active raid1 sda1[0] sdb1[1] 779712 blocks super 1.2 [2/2] [UU] md2 : active raid1 sda3[0] sdb3[1] 1993664 blocks super 1.2 [2/2] [UU] unused devices:
If you want to watch the array rebuild itself use the following command
watch -n1 cat /proc/mdstat
Taking md1 above as an example the things to lookout for are that we can see two disks sda1[0] and sdb2[1] are listed. the number in brackets is the order of the disks in the array. On the second line we see ‘[2/2]’. This means there should be two 2’s in the array and there are two actually in it. The [UU] is the same thing again.
To view the status of one or more arrays:
sudo mdadm -D /dev/md0
/dev/md0: Version : 1.2 Creation Time : Mon Apr 6 13:50:03 2015 Raid Level : raid1 Array Size : 779712 (761.57 MiB 798.43 MB) Used Dev Size : 779712 (761.57 MiB 798.43 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Apr 6 14:41:15 2015 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : raid-test:0 (local to host raid-test) UUID : 90cd04d4:b6754cba:52e2b88d:8749942c Events : 17 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1
To check the status of a disk in an array
sudo mdadm -E /dev/sda1
/dev/sda1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 90cd04d4:b6754cba:52e2b88d:8749942c Name : raid-test:0 (local to host raid-test) Creation Time : Mon Apr 6 13:50:03 2015 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 1559552 (761.63 MiB 798.49 MB) Array Size : 779712 (761.57 MiB 798.43 MB) Used Dev Size : 1559424 (761.57 MiB 798.43 MB) Data Offset : 1024 sectors Super Offset : 8 sectors State : clean Device UUID : 5934ad36:e16b82ac:e0eb32c1:202d3ec8 Update Time : Mon Apr 6 14:41:15 2015 Checksum : e0a03142 - correct Events : 17 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing)
This test system shares one disk mirrored over two disks for three RAID1 arrays. This means that when one mirror breaks you have to fail and then remove all three partitions on a single disk before removing the disk and replacing it with a new one.
Stop here if you have not tested this first. try it out first.
Simulating a Failed Drive
To simulate a disk failure or to mark working disks as failed we can use, the following. Remember to change md0 and sda1 to those matching your system. We will use it to simulate a broken disk.
sudo mdadm --fail /dev/md0 /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
Check that the disk was marked as failed as expected
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0] sdb2[1] 23420800 blocks super 1.2 [2/2] [UU] md0 : active raid1 sda1[0](F) sdb1[1] 779712 blocks super 1.2 [2/1] [_U] md2 : active raid1 sda3[0] sdb3[1] 1993664 blocks super 1.2 [2/2] [UU] unused devices:
It shows that md0 has a failed drive sda1[0](F) and that only one of the two drives are working [2/1] [_U]
Dealing with a Disk Failure
So we have our failed drive. Remember we need to manually mark the drives from the same disk as failed too.
sudo mdadm --fail /dev/md1 /dev/sda2
sudo mdadm --fail /dev/md2 /dev/sda3
We can now remove the first disk partition from the array and then check the output is as expected.
sudo mdadm /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sda2[0](F) sdb2[1] 23420800 blocks super 1.2 [2/1] [_U] md0 : active raid1 sdb1[1] 779712 blocks super 1.2 [2/1] [_U] md2 : active raid1 sda3[0](F) sdb3[1] 1993664 blocks super 1.2 [2/1] [_U] unused devices:
Note: Now that md0 only has one device shown sdb1[1]
We should now remove the remaining parts of md1 and md2.
sudo mdadm --remove /dev/md1 /dev/sda2
sudo mdadm --remove /dev/md2 /dev/sda3
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sdb2[1] 23420800 blocks super 1.2 [2/1] [_U] md0 : active raid1 sdb1[1] 779712 blocks super 1.2 [2/1] [_U] md2 : active raid1 sdb3[1] 1993664 blocks super 1.2 [2/1] [_U] unused devices:
We will restart the system next but we need to modify the boot KERNEL line when is does. LVM will not start up correctly on boot. This means we need to interrupt the boot process to fix that.
Now halt the system.
Remove the broken disk and add a new disk to your system. The new disk MUST be the same size or larger than the original.
Restart the system. As the system begins to reboot press the shift key, if you do not normally get a boot menu for grub. From the menu select the option, “*Ubuntu” it should already be highlighted. Press ‘e’ to edit the command lines. Find the line that starts with
linux vmlinz-3.13-30-generic
The version number will change and yours may not be the generic kernel. Add the following to the end of that line,
break=premount
And then press F10. The system will start to boot wait until you see something like floppy0: no floyyp controllers found or these two messages are shown Incrementally starting RAID arrays… mdadm: CREATE user root not found or the boot simply appears to have stopped.
Press return to get to the (initramfs) command prompt. If LVM is broken as expected the following out put from the command will be shown
lvm lvs
-wi-d---- (instead of the expected -wi-a----)
The manpage for lvs says this means device tables are missing (device mapper?)
To fix the problem run the follownig from (initramsfs).
lvm vgchange -ay
(1 logical volume(s) in logical volume "vg0" now active)
exit
The system should boot now.
Adding a New Disk
Lets get the new disk partitioned exactly the same as the mirror. We do that by copying over the partition table from the working disk, sdy, onto the new disk, sdz. The command must be run from a root prompt not with sudo.
Important: Because I removed sda, the first disk, I swapped the disk drives around so what was sdb is NOW sda and the new drive is sdb. It won’t boot otherwise. Just thought I’d mention that.
If you get this wrong or put the disks names the wrong way around YOU will DESTROY your SYSTEM! I have used 'sdy' and 'sdz' purposely as they will most likely not exist on your system!
sfdisk -d /dev/sdy | sfdisk /dev/sdz
You can run the following to check if both hard drives have the same partition sizes etc.
fdisk -l
Next we get to add /dev/sdb1 to /dev/md0, /dev/sdb2 to /dev/md1 and /dev/sdb3 to /dev/md2
sudo mdadm /dev/md0 --add /dev/sdb1
sudo mdadm /dev/md1 --add /dev/sdb2
sudo mdadm /dev/md2 --add /dev/sdb3
All of the arrays will be resynchronised. If you are quick you will see it happening, my virtuals sync’ed over about 5GB per minute, so run the following
watch -n 1 cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid1 sdb2[2] sda2[1] 23420800 blocks super 1.2 [2/1] [_U] [>....................] recovery = 1.5% (359936/23420800) finish=5.3min speed=71987K/sec md2 : active raid1 sdb3[2] sda3[1] 1993664 blocks super 1.2 [2/1] [_U] resync=DELAYED md0 : active raid1 sdb1[2] sda1[1] 779712 blocks super 1.2 [2/2] [UU] unused devices:
Once the synchronisation has finished your system is back up again!
Thank you