Today we upgraded cp1052 to jessie 8.4 point release and linux 4.4 (T131746, T131928).
After rebooting the system, the raid1 root device was in degraded state because of a race condition. It looks like sdb was still being recognized while the mdadm device was being assembled.
Apr 06 13:54:45.020969 cp1052 kernel: md/raid1:md0: active with 1 out of 2 mirrors [...] Apr 06 13:54:45.021098 cp1052 kernel: sd 1:0:0:0: [sdb] Attached SCSI removable disk
We manually fixed the issue by adding /dev/sdb1 to the array:
mdadm --manage /dev/md0 --add /dev/sdb1
Before carrying on rebooting other cp* hosts we need to make sure this problem is fixed.
We could try one of the following kernel boot parameters to see if they are viable workarounds, although making sure mdadm does its thing only after scsi devices have been recognized with a proper dependency would be better.
scsi_mod.scan=sync rootdelay=10
More logs:
Apr 06 13:54:45.019853 cp1052 kernel: sd 0:0:0:0: [sda] 781422768 512-byte logical blocks: (400 GB/373 GiB) Apr 06 13:54:45.019935 cp1052 kernel: sd 0:0:0:0: [sda] Write Protect is off Apr 06 13:54:45.020016 cp1052 kernel: sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 Apr 06 13:54:45.020100 cp1052 kernel: sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 06 13:54:45.020132 cp1052 kernel: ata1.00: Enabling discard_zeroes_data Apr 06 13:54:45.020159 cp1052 kernel: sda: sda1 sda2 sda3 Apr 06 13:54:45.020191 cp1052 kernel: ata1.00: Enabling discard_zeroes_data Apr 06 13:54:45.020267 cp1052 kernel: sd 0:0:0:0: [sda] Attached SCSI removable disk Apr 06 13:54:45.020301 cp1052 kernel: ata2.00: Enabling discard_zeroes_data Apr 06 13:54:45.020379 cp1052 kernel: sd 1:0:0:0: [sdb] 781422768 512-byte logical blocks: (400 GB/373 GiB) Apr 06 13:54:45.020414 cp1052 kernel: md: md0 stopped. Apr 06 13:54:45.020441 cp1052 kernel: md: bind<sda1> Apr 06 13:54:45.020518 cp1052 kernel: sd 1:0:0:0: [sdb] Write Protect is off Apr 06 13:54:45.020600 cp1052 kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0 Apr 06 13:54:45.020688 cp1052 kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0 Apr 06 13:54:45.020771 cp1052 kernel: sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00 Apr 06 13:54:45.020855 cp1052 kernel: sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Apr 06 13:54:45.020890 cp1052 kernel: ata2.00: Enabling discard_zeroes_data Apr 06 13:54:45.020916 cp1052 kernel: md: raid1 personality registered for level 1 Apr 06 13:54:45.020943 cp1052 kernel: sdb: sdb1 sdb2 sdb3 Apr 06 13:54:45.020969 cp1052 kernel: md/raid1:md0: active with 1 out of 2 mirrors Apr 06 13:54:45.020995 cp1052 kernel: md0: detected capacity change from 0 to 9990832128 Apr 06 13:54:45.021022 cp1052 kernel: ata2.00: Enabling discard_zeroes_data Apr 06 13:54:45.021098 cp1052 kernel: sd 1:0:0:0: [sdb] Attached SCSI removable disk Apr 06 13:54:45.021270 cp1052 kernel: usb 1-1.6: new high-speed USB device number 3 using ehci-pci Apr 06 13:54:45.021308 cp1052 kernel: EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null)