Recovering unbootable Fedora

This is part of my series on renovating my homelabs to ring in the roaring ’20s.

Today I’ll be attempting to recover a machine that failed a Fedora version upgrade (v29->v30) and now won’t boot.

Sections:

Troubleshooting Broken GRUB
Booting a Live USB and chrooting
Data transfer with rsync

Troubleshooting GRUB

When we try to boot the machine, we’re greeted with the following prompt:

grub>

This usually indicates that GRUB is okay but it can’t find the config or the the files it needs to boot the OS (the initramfs). We can poke around in the GRUB prompt to see if anything is obviously out of place. This GRUB command reference has some useful exploratory tools.

Executing ls shows bootable devices:

grub> ls
(hd0) (hd0,msdos1) (hd0,msdos2)

One of these should have a grub2/ directory, so we can ls them individually until we find it:

grub> ls (hd0,msdos1)/
...
grub2
...

I checked that the prefix was what I expected it to be ((hd0,1)/grub2), then tried set prefix and set root just to be sure, but boot still just returned us to the grub> prompt. No dice.

Booting a Linux USB and chrooting

Something’s clearly wrong between GRUB and the initramfs, and GRUB CLI does not have the capability for us to fix it. Time to boot a Linux live environment and try to recover it from a more full-featured system.

The only bootable USB that was on-site was an Ubuntu 18.04 disk…which may or may not be a problem since we’re troubleshooting Fedora… but let’s try.

So I get someone to put the disk in, reset the machine, and mash F11 to get to the BIOS boot menu, and… no USB is listed. Fine. Try another port? …no USB listed. Try another port? …there it is!

The Ubuntu installer comes up, so we “Try Ubuntu” to get the live environment to boot. A $ sudo systemctl start ssh gives an error, so we $ sudo apt install openssh-server and $ sudo systemctl start ssh again. At this point we still can’t ssh in because the default user (ubuntu) does not have a password set, so we set that with $ sudo passwd ubuntu, and finally, FINALLY, we can log in to the machine - albeit with a totally different, temporary OS booted.

Mounting disks

We need to identify and mount the disks before we can chroot in. These docs on chroot troubleshooting from the Fedora 22 are still helpful here. We can get a quick overview of our disks with lskblk, and Ubuntu should even show us the LVM volumes:

$ sudo lsblk
...
sde               8:64   0 111.8G  0 disk
├─sde1            8:65   0     1G  0 part
└─sde2            8:66   0 110.8G  0 part
  ├─fedora-swap 253:0    0   7.9G  0 lvm
  └─fedora-root 253:1    0    15G  0 lvm

I mounted my fedora-root to /mnt/fedora:

sudo mount /dev/mapper/fedora-root /mnt/fedora

and then mounted /dev/sde1, which I identified as my GRUB partition, in to that:

sudo mount /dev/sde1 /mnt/fedora/boot/

then mounted /dev, /proc, and /sys from the Ubuntu host:

for dir in /dev /proc /sys; do sudo mount --bind $dir /mnt/fedora/$dir ; done

Now that everything is mounted, we can chroot and work in Fedora:

$ sudo chroot /mnt/fedora

Recovering the Fedora install

dnf upgrade

The first thing I tried was updating packages, which failed:

$ dnf upgrade -y
Fedora Modular 30 - x86_64 - Updates                                                               0.0  B/s |   0  B     00:00    
Error: Failed to download metadata for repo 'updates-modular': Cannot prepare internal mirrorlist: Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f30&arch=x86_64 [Could not resolve host: mirrors.fedoraproject.org]

Why did this fail?

When we try to check the DNS settings in /etc/resolv.conf, we get an unusual error:

$ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory

That file should exist. And if we stat it, it indeed does:

$ stat /etc/resolv.conf
  File: /etc/resolv.conf -> /var/run/NetworkManager/resolv.conf
  Size: 35              Blocks: 0          IO Block: 4096   symbolic link
Device: fd01h/64769d    Inode: 8449420     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
...

The important thing to notice here is symbolic link, and where it’s pointing - -> /var/run/NetworkManager/resolv.conf. The problem is that Fedora uses NetworkManager. NetworkManager controls DNS resolution, by putting a symlink in /etc/resolv.conf to it’s own runtime resolv.conf. But because we’re in a chroot - the system is not “booted”, exactly, and has not init process - NetworkManager is not running. It also can’t be started with $ systemctl start NetworkManager because, again, we’re chrooted and systemd won’t start processes in a chroot.

We can fix this temporarily by writing our own /etc/resolv.conf:

$ mv /etc/resolv.conf /etc/resolv.conf.bak
$ echo "nameserver 8.8.8.8" > /etc/resolv.conf

and now if we retry the DNF update, it resolves the repos and succeeeds.

Grub and the initramfs

Now we can rebuild the initramfs with the latest updates that are installed. My booted kernel version was 4.15 but my Fedora install has 5.3 so I need to specific the latest installed kernel version when calling dracut:

$ dracut -v -f /boot/initramfs-5.3.16-200.fc30.x86_64.img 5.3.16-200.fc30.x86_64

With the initramfs built, we can also write a new grub config to resolve the boot issue:

$ grub2-mkconfig -o /boot/grub2/grub.cfg

And reboot, letting it boot from the internal drive instead of the USB this time.

Data transfer

This isn’t the first time this machine has flaked out, so I want to migrate data off of it preventatively. I need to transfer several TBs of data across my LAN, with some individual files in the 10s of GB range each. There are several tools available for this (I like the looks of croc, but don’t want to use a public relay server and I don’t want to learn how to configure a private one right now).
I’m going to use rsync in pull mode to transfer my data. This is also known as “daemonized” or “remote-source” rsync. I’m also going to leverage screen to be able to detach from the rsync session while leaving it running in the background.

On the source machine, we need to create an rsync config in /etc/rsyncd.conf

[data] # name your rsync server share
    path = /path/to/data # path to data for rsync to share
    read only = yes # don't let rsync write to the share
    timeout = 30

and then start the rsync daemon with:

$ rsync -D

On the destination machine, I’m going to run rsync in a screen so I can detach from it without killing the transfer, and reattach to check the progress. I’ll create a screen session name “rsync” with:

$ screen -S rsync

This drops me right in to the screen session. The screen incantations I’ll be using from within screen to detach, create, and move around are:

Keys	Effect
`<Ctrl-A>+d`	Detach
`<Ctrl-A>+c`	Create new nested screen
`<Ctrl-A>+n`	Move to next screen

And the screen flags to find and reattach to a session are:

Command	Effect
$ screen -ls	List sessions
$ screen -r name	Reattach to screen name “name”

With the rsync daemon running on the recovered machine, serving “data”, we can run rsync on the destination machine in “remote-source” mode. I’m copying the files on to my ZFS pool:

$ rsync -avhzP --inplace 192.168.0.200::src/files /recovered/files

Rsync should be logging the transfer status to stdout in the terminal, showing us how things are going. Note, rsync is serial (un-parallelized), so I created a next screen and ran iftop and htop just to see if it was saturating my network or CPU (it was neither), checked the same thing on the source machine (neither), and then started a second, parallel sync of another directory (::media/movies) to take advantage of my gig LAN, fast disk, and multicore CPUs. Repeat as needed - aim to fully saturate the network - and parallize up to 1.5X the number of threads in the system before rsync will start getting CPU bound. I’m able to read data reasonably fast due my drive array being RAID10, but unstriped data reads slower and could bottleneck on disk I/O before saturating either network or CPU. If adding an rsync process doesn’t increase either of CPU or network usage you’ve probably maxed out your disks.