Recovering unbootable Fedora
This is part of my series on renovating my homelabs to ring in the roaring ’20s.
Today I’ll be attempting to recover a machine that failed a Fedora version upgrade (v29->v30) and now won’t boot.
Sections:
Troubleshooting GRUB
When we try to boot the machine, we’re greeted with the following prompt:
grub>
This usually indicates that GRUB is okay but it can’t find the config or the the files it needs to boot the OS (the initramfs). We can poke around in the GRUB prompt to see if anything is obviously out of place. This GRUB command reference has some useful exploratory tools.
Executing ls
shows bootable devices:
grub> ls
(hd0) (hd0,msdos1) (hd0,msdos2)
One of these should have a grub2/
directory, so we can ls
them individually until we find it:
grub> ls (hd0,msdos1)/
...
grub2
...
I checked that the prefix
was what I expected it to be ((hd0,1)/grub2
), then tried set prefix
and set root
just to be sure, but boot
still just returned us to the grub>
prompt. No dice.
Booting a Linux USB and chrooting
Something’s clearly wrong between GRUB and the initramfs, and GRUB CLI does not have the capability for us to fix it. Time to boot a Linux live environment and try to recover it from a more full-featured system.
The only bootable USB that was on-site was an Ubuntu 18.04 disk…which may or may not be a problem since we’re troubleshooting Fedora… but let’s try.
So I get someone to put the disk in, reset the machine, and mash F11 to get to the BIOS boot menu, and… no USB is listed. Fine. Try another port? …no USB listed. Try another port? …there it is!
The Ubuntu installer comes up, so we “Try Ubuntu” to get the live environment to boot. A $ sudo systemctl start ssh
gives an error, so we $ sudo apt install openssh-server
and $ sudo systemctl start ssh
again. At this point we still can’t ssh in because the default user (ubuntu
) does not have a password set, so we set that with $ sudo passwd ubuntu
, and finally, FINALLY, we can log in to the machine - albeit with a totally different, temporary OS booted.
Mounting disks
We need to identify and mount the disks before we can chroot in. These docs on chroot troubleshooting from the Fedora 22 are still helpful here.
We can get a quick overview of our disks with lskblk
, and Ubuntu should even show us the LVM volumes:
$ sudo lsblk
...
sde 8:64 0 111.8G 0 disk
├─sde1 8:65 0 1G 0 part
└─sde2 8:66 0 110.8G 0 part
├─fedora-swap 253:0 0 7.9G 0 lvm
└─fedora-root 253:1 0 15G 0 lvm
I mounted my fedora-root
to /mnt/fedora
:
sudo mount /dev/mapper/fedora-root /mnt/fedora
and then mounted /dev/sde1
, which I identified as my GRUB partition, in to that:
sudo mount /dev/sde1 /mnt/fedora/boot/
then mounted /dev
, /proc
, and /sys
from the Ubuntu host:
for dir in /dev /proc /sys; do sudo mount --bind $dir /mnt/fedora/$dir ; done
Now that everything is mounted, we can chroot
and work in Fedora:
$ sudo chroot /mnt/fedora
Recovering the Fedora install
dnf upgrade
The first thing I tried was updating packages, which failed:
$ dnf upgrade -y
Fedora Modular 30 - x86_64 - Updates 0.0 B/s | 0 B 00:00
Error: Failed to download metadata for repo 'updates-modular': Cannot prepare internal mirrorlist: Curl error (6): Couldn't resolve host name for https://mirrors.fedoraproject.org/metalink?repo=updates-released-modular-f30&arch=x86_64 [Could not resolve host: mirrors.fedoraproject.org]
Why did this fail?
When we try to check the DNS settings in /etc/resolv.conf
, we get an unusual error:
$ cat /etc/resolv.conf
cat: /etc/resolv.conf: No such file or directory
That file should exist. And if we stat
it, it indeed does:
$ stat /etc/resolv.conf
File: /etc/resolv.conf -> /var/run/NetworkManager/resolv.conf
Size: 35 Blocks: 0 IO Block: 4096 symbolic link
Device: fd01h/64769d Inode: 8449420 Links: 1
Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root)
...
The important thing to notice here is symbolic link
, and where it’s pointing - -> /var/run/NetworkManager/resolv.conf
. The problem is that Fedora uses NetworkManager. NetworkManager controls DNS resolution, by putting a symlink in /etc/resolv.conf
to it’s own runtime resolv.conf
. But because we’re in a chroot - the system is not “booted”, exactly, and has not init process - NetworkManager is not running. It also can’t be started with $ systemctl start NetworkManager
because, again, we’re chrooted and systemd won’t start processes in a chroot.
We can fix this temporarily by writing our own /etc/resolv.conf
:
$ mv /etc/resolv.conf /etc/resolv.conf.bak
$ echo "nameserver 8.8.8.8" > /etc/resolv.conf
and now if we retry the DNF update, it resolves the repos and succeeeds.
Grub and the initramfs
Now we can rebuild the initramfs with the latest updates that are installed. My booted kernel version was 4.15
but my Fedora install has 5.3
so I need to specific the latest installed kernel version when calling dracut
:
$ dracut -v -f /boot/initramfs-5.3.16-200.fc30.x86_64.img 5.3.16-200.fc30.x86_64
With the initramfs built, we can also write a new grub config to resolve the boot issue:
$ grub2-mkconfig -o /boot/grub2/grub.cfg
And reboot, letting it boot from the internal drive instead of the USB this time.
Data transfer
This isn’t the first time this machine has flaked out, so I want to migrate data off of it preventatively. I need to transfer several TBs of data across my LAN, with some individual files in the 10s of GB range each. There are several tools available for this (I like the looks of croc, but don’t want to use a public relay server and I don’t want to learn how to configure a private one right now).
I’m going to use rsync in pull mode to transfer my data. This is also known as “daemonized” or “remote-source” rsync. I’m also going to leverage screen to be able to detach from the rsync session while leaving it running in the background.
On the source machine, we need to create an rsync config in /etc/rsyncd.conf
[data] # name your rsync server share
path = /path/to/data # path to data for rsync to share
read only = yes # don't let rsync write to the share
timeout = 30
and then start the rsync daemon with:
$ rsync -D
On the destination machine, I’m going to run rsync in a screen so I can detach from it without killing the transfer, and reattach to check the progress. I’ll create a screen session name “rsync” with:
$ screen -S rsync
This drops me right in to the screen session. The screen incantations I’ll be using from within screen to detach, create, and move around are:
Keys | Effect |
---|---|
<Ctrl-A>+d |
Detach |
<Ctrl-A>+c |
Create new nested screen |
<Ctrl-A>+n |
Move to next screen |
And the screen flags to find and reattach to a session are:
Command | Effect |
---|---|
$ screen -ls | List sessions |
$ screen -r name | Reattach to screen name “name” |
With the rsync daemon running on the recovered machine, serving “data”, we can run rsync on the destination machine in “remote-source” mode. I’m copying the files on to my ZFS pool:
$ rsync -avhzP --inplace 192.168.0.200::src/files /recovered/files
Rsync should be logging the transfer status to stdout in the terminal, showing us how things are going. Note, rsync is serial (un-parallelized), so I created a next screen and ran iftop
and htop
just to see if it was saturating my network or CPU (it was neither), checked the same thing on the source machine (neither), and then started a second, parallel sync of another directory (::media/movies
) to take advantage of my gig LAN, fast disk, and multicore CPUs. Repeat as needed - aim to fully saturate the network - and parallize up to 1.5X the number of threads in the system before rsync will start getting CPU bound. I’m able to read data reasonably fast due my drive array being RAID10, but unstriped data reads slower and could bottleneck on disk I/O before saturating either network or CPU. If adding an rsync
process doesn’t increase either of CPU or network usage you’ve probably maxed out your disks.
Read the other articles in this series here:
- New Year, New Lab
- #TODO - Epyc EKWB liquid cooled server build
- ZFS on Linux, ZED, and Postfix
- Configuring Postfix with Gmail
- WireGuard VPN mesh
- PiHole and DNS over WireGuard
- Private DNS with CoreDNS
- #TODO - VFIO GPU Passthrough
- #TODO - Networking: Unifi, VLANs, and (Core)DNS localzones over WireGuard
- Rescuing a bad Fedora upgrade via chroot