RAID does not start sometimes causing degraded arrays #114

patenteng · 2025-01-14T14:43:19Z

There appears to be some sort of race condition that causes RAID to not start properly. I have experienced the same issues on two separate servers. This is what I get on one of those servers:

cat /proc/mdstat
Personalities : [raid1] 
md125 : active raid1 sdb3[1] sda3[0]
      33520640 blocks super 1.2 [2/2] [UU]

md126 : active raid1 sda4[0]
      1917759488 blocks super 1.2 [2/1] [U_]
      bitmap: 1/15 pages [4KB], 65536KB chunk

md127 : active raid1 sdb2[1] sda2[0]
      1046528 blocks super 1.2 [2/2] [UU]

unused devices: <none>

wipefs /dev/sd{a..b}4
DEVICE OFFSET TYPE              UUID                                 LABEL
sda4   0x1000 linux_raid_member dde0deba-d7e7-6f4a-deca-b1cdcbcf900f any:root
sdb4   0x1000 linux_raid_member dde0deba-d7e7-6f4a-deca-b1cdcbcf900f any:root

mdadm --add /dev/md126 /dev/sdb4
mdadm: re-added /dev/sdb4

Once the removed device is added to the RAID array everything works as expected. The affected disk appears to be random, i.e. each of the two partitions can fail to start.

I have replicated the issue in QEMU using a single disk raid array.

[root@test ~]# cat /proc/mdstat 
Personalities : [raid1] 
md125 : active raid1 vda2[0]
      1046528 blocks super 1.2 [1/1] [U]
      
md126 : active raid1 vda4[0]
      6288384 blocks super 1.2 [1/1] [U]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md127 : active raid1 vda3[0]
      2094080 blocks super 1.2 [1/1] [U]
      
unused devices: <none>

This is the partition structure.

[root@test ~]# lsblk
NAME             MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINTS
fd0                2:0    1    4K  0 disk  
sr0               11:0    1  1.1G  0 rom   
vda              254:0    0   10G  0 disk  
├─vda1           254:1    0    1M  0 part  
├─vda2           254:2    0    1G  0 part  
│ └─md125          9:125  0 1022M  0 raid1 /boot
├─vda3           254:3    0    2G  0 part  
│ └─md127          9:127  0    2G  0 raid1 
│   └─swap_crypt 252:1    0    2G  0 crypt [SWAP]
└─vda4           254:4    0    7G  0 part  
  └─md126          9:126  0    6G  0 raid1 
    └─root_crypt 252:0    0    6G  0 crypt /var/log

From my testing the root RAID array is not started in about 2 out of every 10 times. After a minute and a half it times out and I can SSH into QEMU. I can drop the shell from the options and assemble the array using mdadm. However, I cannot decrypt the root partition since it has timed out.

Here is my systemd-tool fstab.

[root@test ~]# cat /etc/mkinitcpio-systemd-tool/config/fstab
# This file is part of https://github.com/random-archer/mkinitcpio-systemd-tool

# REQUIRED READING:
# * https://github.com/random-archer/mkinitcpio-systemd-tool/wiki/Root-vs-Fstab
# * https://github.com/random-archer/mkinitcpio-systemd-tool/wiki/System-Recovery

# fstab: mappings for direct partitions in initramfs:
# * file location in initramfs: /etc/fstab
# * file location in real-root: /etc/mkinitcpio-systemd-tool/config/fstab

# fstab format:
# https://wiki.archlinux.org/index.php/Fstab

# how fstab is used by systemd:
# https://www.freedesktop.org/software/systemd/man/systemd-fstab-generator.html
# https://github.com/systemd/systemd/blob/master/src/fstab-generator/fstab-generator.c

# note:
# * remove "root=/dev/mapper/root" stanza from kernel command line
# * provide here root partition mapping (instead of kernel command line)
# * ensure that mapper-path in fstab corresponds to mapper-name in crypttab
# * for x-mount options see: https://www.freedesktop.org/software/systemd/man/systemd.mount.html

#  <block-device>       <mount-point>    <fs-type>    <mount-options>                   <dump>  <fsck>
#  /dev/mapper/root      /sysroot         auto         x-systemd.device-timeout=9999h     0       1
#  /dev/mapper/swap      none             swap         x-systemd.device-timeout=9999h     0       0
/dev/mapper/root_crypt /sysroot auto x-systemd.device-timeout=9999h 0 1

Here is the systemd-tool crypttab

[root@test ~]# cat /etc/mkinitcpio-systemd-tool/config/crypttab
# This file is part of https://github.com/random-archer/mkinitcpio-systemd-tool

# REQUIRED READING:
# * https://github.com/random-archer/mkinitcpio-systemd-tool/wiki/Root-vs-Fstab
# * https://github.com/random-archer/mkinitcpio-systemd-tool/wiki/System-Recovery

# crypttab: mappings for encrypted partitions in initramfs
# * file location in initramfs: /etc/crypttab
# * file location in real-root: /etc/mkinitcpio-systemd-tool/config/crypttab

# crypttab format:
# https://wiki.archlinux.org/index.php/Dm-crypt/System_configuration#crypttab

# how crypttab is used by systemd:
# https://www.freedesktop.org/software/systemd/man/systemd-cryptsetup-generator.html
# https://github.com/systemd/systemd/blob/master/src/cryptsetup/cryptsetup-generator.c

# note:
# * provide here mapper partition UUID (instead of kernel command line)
# * use password/keyfile=none to force cryptsetup password agent prompt
# * ensure that mapper-path in fstab corresponds to mapper-name in crypttab
# * for x-mount options see: https://www.freedesktop.org/software/systemd/man/systemd.mount.html

# <mapper-name>   <block-device>       <password/keyfile>   <crypto-options>
#  root           UUID={{UUID_ROOT}}       none                luks
#  swap           UUID={{UUID_SWAP}}       none                luks
root_crypt /dev/md/root none luks

My hooks are HOOKS=(base systemd autodetect keyboard sd-vconsole modconf block mdadm_udev sd-encrypt filesystems fsck systemd-tool).

This issue is present only with systemd-tool. I have tested with both unencrypted raid and encrypted raid in QEMU.

On the servers one can boot after the failed RAID device is removed since there is a redundant device, i.e. it's a two device RAID array unlike the QEMU images. This is what I have found from testing on the two servers:

the issue affects different physical disks, i.e. I checked the PTUUIDs;
the issue affects different partitions, i.e. I have separate RAID devices for different partitions;
the issue appears to affect only a single RAID device at a time (one of the servers has 6 disks and 18 RAID 1 arrays);
the missing device is listed as removed by mdadm --detail and is not identified by a parth to /dev, i.e. it just says removed without any further information;
once re-added RAID rebuilds the device;
if the affected partition is large and has a bitmap, it takes around 5 seconds;
if the affected partition is small and does not have a bit map, it takes slightly longer to mirror;
mdadm logs simply state active with 1 out of 2 mirrors; and
the affected partition times out on reboot after 30 seconds.

Here are the dmesg logs from one reboot.

[    1.546487] md/raid1:md127: active with 2 out of 2 mirrors
[    1.554870] md127: detected capacity change from 0 to 3835518976
[    1.617467] md/raid1:md126: active with 2 out of 2 mirrors
[    1.617478] md126: detected capacity change from 0 to 2093056
[    1.739809] device-mapper: uevent: version 1.0.3
[    1.739864] device-mapper: ioctl: 4.48.0-ioctl (2023-03-01) initialised: dm-devel@lists.linux.dev
[    1.796344] raid6: skipped pq benchmark and selected avx2x4
[    1.796347] raid6: using avx2x2 recovery algorithm
[    1.799177] xor: automatically using best checksumming function   avx       
[    1.903424] Btrfs loaded, zoned=yes, fsverity=yes
[   32.218708] md/raid1:md125: active with 1 out of 2 mirrors
[   32.218733] md125: detected capacity change from 0 to 67041280

Here is another reboot that affected a different RAID device.

[    1.576421] md/raid1:md126: active with 2 out of 2 mirrors
[    1.576439] md126: detected capacity change from 0 to 67041280
[    1.644082] md/raid1:md125: active with 2 out of 2 mirrors
[    1.644101] md125: detected capacity change from 0 to 2093056
[    1.793562] raid6: skipped pq benchmark and selected avx2x4
[    1.793565] raid6: using avx2x2 recovery algorithm
[    1.796313] xor: automatically using best checksumming function   avx       
[    1.900468] Btrfs loaded, zoned=yes, fsverity=yes
[   31.559597] md/raid1:md127: active with 1 out of 2 mirrors
[   31.600410] md127: detected capacity change from 0 to 3835518976

The text was updated successfully, but these errors were encountered:

Andrei-Pozolotin · 2025-01-14T16:15:03Z

it is good that you can replicate this in qemu

usually in such cases there is a need to declare explicit inter-unit dependency for root_crypt_mdarray.mount, and by root_crypt_mdarray.mount I mean: whatever systemd-device.unit is manifested by systemd core after md array becomes "really ready"

https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd

similar to these dependencies:

https://github.com/random-archer/mkinitcpio-systemd-tool/blob/master/src/initrd-sysroot-mount.service

https://github.com/random-archer/mkinitcpio-systemd-tool/blob/master/tool/image/test/unitada/etc/systemd/system/root-entry.service