Disque mort raid

07/02/2023

Lister les disques :

# clair
hwinfo --disk --short
# un peu plus complet
fdisk -l
# illisible chez moi
lsblk

En général c'est

/dev/sda
/dev/sdb
/dev/sdc
...

Les partition dans chaque disques sont :

/dev/sda1
/dev/sda2
/dev/sda3
...

Voir l'etat du raid

cat /proc/mdstat

Les partitions raid sont notées

/dev/md0
/dev/md1
...

On voit les X disques
et le U dans les crochet dit que c'est up, et le _ dit que c'est mort
la ya bien [UU_] donc 2 disques ok et un mort

Voir l'etat programmatiquement

cat /sys/block/md*/md/dev-*/state

https://www.kernel.org/doc/html/v4.15/admin-guide/md.html

faulty : device has been kicked from active use due to a detected fault, or it has unacknowledged bad blocks
in_sync : device is a fully in-sync member of the array
writemostly : device will only be subject to read requests if there are no other options. This applies only to raid1 arrays.
blocked: device has failed, and the failure hasn’t been acknowledged yet by the metadata handler. Writes that would write to this device if it were not faulty are blocked.
spare: device is working, but not a full member. This includes spares that are in the process of being recovered to.
write_error : device has ever seen a write error.
want_replacement: device is (mostly) working but probably should be replaced, either due to errors or due to user request.
replacement : device is a replacement for another active device with same raid_disk.

mdadm --detail /dev/md0 | grep -e '^\s*State : ' | awk '{ print $NF; }'

will output "clean" or "active" for a good array. You can also loop over /dev/md/* to get all arrays.
i have a raid1 on 3 disks.
i see "clean" when it is fully nice.
and "active, degraded, recovering" when one disk is nice, the second is removed/dead, and the third is recovering
The possible values (can be comma separated) are described here:

pour voir le detail d'une partition raid :

mdadm -D /dev/md0

les dernières lignes listent les disques

Exemple avec un disque mort et un deuxième disque en cours de synchro :

cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear] [multipath] [raid10] 
md2 : active raid1 sdb2[1] sda2[3]
      1952461760 blocks [3/1] [_U_]
      [========>............]  recovery = 42.7% (834347520/1952461760) finish=1703.6min speed=10937K/sec
      bitmap: 15/15 pages [60KB], 65536KB chunk

md1 : active raid1 sdb1[1] sda1[0]
      523200 blocks [3/2] [UU_]
      
unused devices: 
root@raphaelpiccolo:~# mdadm -D /dev/md2
/dev/md2:
           Version : 0.90
     Creation Time : Wed Jun 10 15:12:05 2020
        Raid Level : raid1
        Array Size : 1952461760 (1862.01 GiB 1999.32 GB)
     Used Dev Size : 1952461760 (1862.01 GiB 1999.32 GB)
      Raid Devices : 3
     Total Devices : 2
   Preferred Minor : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Sun Feb 12 21:57:17 2023
             State : active, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 1

Consistency Policy : bitmap

    Rebuild Status : 42% complete

              UUID : e879a2ab:989cc6fb:a4d2adc2:26fd5302
            Events : 0.2129993

    Number   Major   Minor   RaidDevice State
       3       8        2        0      spare rebuilding   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       -       0        0        2      removed

Voir l'état du disque :

smartctl -i /dev/sdc
# si besoin lancer une analyse
smartctl -a /dev/sdc
smartctl -t short /dev/sdc

Vérifier le disque :

fsck -f -y /dev/sdc1
fsck -f -y /dev/sdc2
fsck -f -y /dev/sdc3

Ajouter le disque à nouveau dans le raid

choisir le bon raid (md1) et la bonne partition (sdc1) en se basant sur le retour de la commande "mdadm -D /dev/md1"

mdadm /dev/md1 --add /dev/sdc1
mdadm /dev/md2 --add /dev/sdc2