Lately I’ve been fielding some questions about Zos nodes that get stuck while booting up. This happens occasionally at different places in the boot sequence, seemingly at random. Without a clear cause or steps to reliable reproduce these issues, this is a difficult thing to pursue fixes for in our code bases—it’s not even clear if adding or changing code in the bootstrap or Zos could provide a solution.
Since the introduction of the famerbot, Zos is getting booted a lot more and failure to boot is a more serious matter for farmers. Surely there’s something we can do?
While looking for an answer, I ventured into the arcane realm of modern manifestations of the humble and once ubiquitous serial port.
What about a watchdog?
One idea from a farmer about how we could address nodes stuck during boot was to add some kind of watchdog that can detect when boot progress is stalled and take some action to unstall it, like rebooting the node. Indeed, watchdog timers are a staple of embedded systems, a domain where serial ports especially maintain their relevance. If your code doesn’t reset the watchdog timer soon enough, the system reboots, thus providing an escape hatch for any potential issue stalling progress.
So in a way it seems simple enough—just watch the output from the node on screen and hit the reset switch if stuff isn’t happening fast enough. Of course, that’s a plan that requires a lot of screens and a lot of specially trained robots. Thankfully, in the world of enterprise server gear, we have out-of-band (aka, lights-out) management systems like iDRAC and iLO, which provide remote console access. That’s usually a matter of opening a web browser and getting a view of a virtual screen that conveniently has a virtual power button beside it.
Well, with virtual screens, and virtual power buttons, why not virtual robots? AI is getting pretty good these days, after all…
The bliss of boring old technology
OOB systems don’t just provide slick browser based interfaces. They also provide a smattering of APIs, bespoke consoles accessed via SSH, and also implementations of standardized protocols designed for managing networks and servers.
I opened the iLO browser UI on one of my test machines and started poking around. Here’s a look at some of the options we have for interacting with the OOB system:
Wow, that’s kinda a lot, huh?
This feels like a good time for a public service announcement. Please be careful with OOB systems. Even with security features enabled, they should never be exposed directly to the public internet.
I started looking up some acronyms from this page and ended up on the Wikipedia page for IPMI. There I learned that this standardized and widely adopted protocol for server management included serial over LAN (SOL) since version 2.0, “whereby serial console output can be remotely viewed over the LAN.”
So far I hadn’t been sure if it would be possible to get the text of Zos logs as text via some OOB feature. It’s certainly possible to see them as an image using the remote console. This is where some thinking around AI entered the picture, in terms of doing optical character recognition on the video output.
Thankfully I now had a much more promising lead: the simple old and boring technology of RS-232 serial (OCR is actually a quite old technology too, but it’s decided less boring in the current hype cycle ). The only problem was that I didn’t have a clue if Zos could talk to a serial port, nor where I’d find a serial port, since I was pretty sure my test machine has no DB-9 (sic) connector on its back panel.
IPMI/DCMI over LAN Access = Enabled
My screenshot above is actually an anachronism. When I first opened that menu, IPMI/DCMI over LAN Access was disabled. So of course I enabled it and then went looking for some instructions about what the heck to do next.
I landed at this blog post that described how you could use IPMI and SOL to get a shell on a server thousands of miles away after accidentally blowing up your SSH access. My machine isn’t that far away, but it is down a set of stairs in a spider infested basement that’s very cold this time of year—so I get the sentiment and the remote console in the browser feature has indeed been super helpful.
It turns out that in Linux land we just need to install ipmitool
and issue the correct incantation to wire our local terminal emulator up to the serial over LAN port:
# apt install ipmitool
pacman -S ipmitool
ipmitool -I lanplus -H 192.168.0.42 -U $user -P pass $password sol activate
In my case, the user name and password are the same as I use to login to iLO in the browser (the default being on a sticker on the chassis). Actually first I did a simpler test with the chassis power status
command, and it worked! I could check whether the system was powered on or off from my shell.
Running sol activate
dropped me into a console session like this, and the fact that something happened seemed to indicate success:
I tried booting up the machine with this session open. The screen blanked as if preparing for a big show, but ultimately nothing happened.
Preboot
I know that the initial phase of Zos boot is done using iPXE. So I thought that would be a good place to go look for tips on how to get it talking to the serial port I was fairly sure I’d managed to open up. That led me to this page which describes how iPXE interacts with the console. This in particular was of interest:
Some BIOSes provide “console redirection” and “serial over LAN” features that can be used to access the BIOS console remotely. If your BIOS is already providing console redirection, then you should not enable the iPXE serial port console, since it will interfere with the BIOS’ own use of the serial port.
What I took from this was that the default config in iPXE, which is to use the “BIOS console”, is probably fine, and I should look for how to activate this “console redirection”. There wasn’t anything about this in the iLO settings. I’d have to get into the BIOS/UEFI settings (via remote console in the browser) and have a look around.
First I found this page:
The BIOS Serial Console Port was set to Auto
, so I changed it to Virtual Serial Port
. The only other option is Disabled
, so this wasn’t especially satisfying—it didn’t really make sense that Auto
would be disabling the feature and there was no other choice aside from virtual.
Thus I went looking elsewhere, via a procedure that might feel familiar to anyone who ever entered one of these BIOS menus. That is to say, I started entering every menu option from the top and descending recursively into every submenu in the same way. Fun times.
Interlude in the key of ttyS1
I should back up and say that along the way I learned something about how serial ports work on Linux. Zos is after all Linux based, and I figured that might be relevant after iPXE hands off control to the next bootstrapping phase. What I found is that serial ports on Linux, in “everything is a file” fashion, can be found under /dev/ttyS0
et cetera, where the 0
can be changed to 1
and so on.
Our iPXE stage downloads and executes a Linux kernel and initramfs. The code to generate them is found in a separate repo from Zos, so I went there and did a search for ttyS
. What I found is that we do have a serial tty configured by default, and it’s ttyS1
:
Alright, so back in this config menu, I found another section related to serial ports. Thankfully it’s right up toward the top so I didn’t have to dig far:
So here is where we can choose which COM port the virtual serial port gets wired up to. There are two options in this menu. Originally it was set to COM 1
and I switched it to COM 2
. Why? Well, since we counted starting at zero in one place and we started counting at one in another, our mapping actually looks like this:
COM 1 > ttyS0
COM 2 > ttyS1
Since the our default setting in the initramfs config is ttyS1
, I thought maybe I could also get the kernel talking over the virtual serial port by switching this.
Just boot the thing already
I saved these settings and rebooted the machine. Watching in my terminal with ipmitool
running, I could see a textual version of what shows up during POST, like system specs, an option to enter a textual version of various menus (including the one I had just exited, and also one time boot options), and then finally:
Here are the logs from iPXE, showing up directly in my terminal. Nice! After this, I could also see the kernel messages:
But what’s missing is the actual log output from the Zos bootstrap utility that the kernel is running at this point. That’s are important, because some of the freezes we’re interested in happen during this stage. The kernel also doesn’t produce enough logs to serve as a proxy for ongoing bootstrap activity.
One last kernel param
I did some searching for answers to why output from the kernel might be present while the other console output was not. This Github issue gave me a clue (thanks to the kind soul who posted a solution to this issue three years after the issue’s creator closed it saying they don’t need answer anymore):
Try swaping your commandline
formconsole=serial0,115200 console=tty1
toconsole=tty1 console=serial0,115200
initramfs init script is only using the last console
If we recall the line from our initramfs config from before, we have the same ordering, with the serial console coming first and the regular tty1
coming second. So maybe if we can change which console
setting comes last, we can control where the output from the bootstrap goes.
Well, these are kernel command line parameters, and I happen to know that we can add to this list using the Zos bootstrap generator.
Here’s what I tried, using expert mode:
As I’m only noticing now, there’s a mix of ttyS
and serial
in the console names, and I also forgot to specify the baud rate (115200). But this worked. When I boot from this bootstrap, I see all logs from the boot process in the IPMI SOL console.
As maybe should also be expected, the non kernel logs no longer appear on regular remote console meaning they also won’t appear on a screen if one were connected. To me, the trade off here is okay. If we capture these logs, we actually get a much more complete view than the small window that’s available on the display. The logs are also in text, which is much nicer to work with. While I didn’t test it yet, I did notice that iLO also has the ability to capture logs from the virtual serial port—all watchdog plans aside, having a history of node logs captured in the OOB tool seems like an upgrade from viewing logs in video mode on the remote console.
Conclusion
One thing I recognized here is that switching to COM 2
was unnecessary. Instead, setting console=ttyS0
should do the trick.
We could distill these steps generally as:
- Ensure IPMI over LAN is activated in the OOB config
- Redirect the BIOS console to virtual serial port/serial over LAN
- Generate a new Zos bootstrap with
console=ttyS0
(or use the tty that matches the virtual serial port number) - Install
ipmitool
on a system in the same LAN as the OOB - Connect to the serial over LAN using a line like this:
ipmitool -I lanplus -H 192.168.0.42 -U $user -P pass $password sol activate
- Boot the node with the new bootstrap
It would be great if anyone an test this method on hardware other than HPE with iLO. The next step from here would be generating some code that can monitor the logs coming from a collection of nodes and take action if the node gets stuck.
If it looks like this can work on a variety of hardware and there’s sufficient interest, I’ll keep working on it. Let me know!