Saturday, September 15, 2012

kexec


Kexec is a patch to the Linux kernel that allows you to boot directly to a new kernel from the currently running one. In the boot sequence described above, kexec skips the entire bootloader stage (the first part) and directly jumps into the kernel that we want to boot to. There is no hardware reset, no firmware operation, and no bootloader involved. The weakest link in the boot sequence -- that is, the firmware -- is completely avoided. The big gain from this feature is that system reboots are now extremely fast. For enterprise-class systems, kexec drastically reduces reboot-related system downtime. For kernel and system software developers, kexec helps you quickly reboot your system during development or testing efforts without having to go through the costly firmware stage every time.
The kexec patch is the work of Eric Biederman and the project is under active development (see the Resources section for more details on the project and how to contribute to it).
Obviously, since this feature touches so many sensitive parts of the operating system, a great deal of care is needed to make it all work properly. The biggest challenge for kexec is that, in Linux, the new kernel that is to be rebooted to needs to sit in the same place in memory as the currently executing one. Replacing the existing kernel in memory with the new one, while still running in the context of the existing kernel, is a tough task. Another big issue is the state of the devices in the system. Firmware always initializes (or resets) the devices to a known "sane" state. The fact that kexec bypasses the firmware stage means that the state of the devices is unreliable.
Subsequent sections of this article will show you how to overcome these challenges, and how the direct booting to a new kernel is achieved. Note that kexec is currently available only on the x86 32-bit platform. Although work is underway to port kexec to other platforms, there is no working version of the code yet. Hence, all technical details in the subsequent sections are specific to the x86 platform

Kexec has two components. The first is the userspace component known as "kexec-tools." The second is the actual kernel patch. The two parts achieve the two main operations of kexec: loading the new kernel into memory and rebooting to it. Getting a kexec-enabled kernel is simple. Just download the kexec-tools package and the kernel-specific patch (see the link in the Resourcessection), build the kexec-tools package to obtain the kexec tool, and apply the kernel-specific patch to the kernel tree and reboot to it. Of course, make sure you have selected the CONFIG_KEXEC option while building the kernel.
As mentioned above, using kexec consists of (1) loading the kernel to be rebooted to into memory, and (2) actually rebooting to it. To load a kernel, the syntax is as follows:
kexec -l <kernel-image> --append="<command-line-options>"
where <kernel-image> is the kernel file that you intend to reboot to and <command-line-options> contain the command-line parameters that need to be passed to the new kernel. Because the wrong command-line options can cause problems during the reboot, passing the contents of /proc/cmdline is the safest way to ensure that legal values are passed to the rebooting kernel.
For example, if the kernel image you want to reboot is /boot/bzImage, and the contents of /proc/cmdline are"root=/dev/hda1", the command to load the kernel would be:
kexec -l /boot/bzImage -append="root=/dev/hda1"
Then, to actually reboot to the loaded kernel, just type:
kexec -e
The system will reboot immediately. Unlike the normal reboot process, kexec does not perform a clean shutdown of the system before rebooting. It is left to you to kill all applications and unmount file systems before attempting a kexec reboot.
One of the biggest challenges in the development of kexec comes from the fact that the Linux kernel runs from a fixed address in memory. This means that the new kernel needs to sit at the same place that the current kernel is running from. On x86 systems, the kernel sits at the physical address 0x100000 (virtual address 0xc0000000, known as PAGE_OFFSET). The task of overwriting the old kernel with the new one is done in three stages:
  1. Copy the new kernel into memory.
  2. Move this kernel image into dynamic kernel memory.
  3. Copy this image into the real destination (overwriting the current kernel), and start the new kernel.
The first two stages are achieved during the "loading" of the kernel. The first task here is to interpret the contents of the kernel image file. Kexec-tools has been built so that, in principle, you could load and boot to any (even a non-Linux) kernel. Currently, it is possible to boot to any elf32-format kernel image. The file is parsed and the kernel "segments" are loaded into buffers. These segments are categorized based on the nature of the code. For example, in the case of the commonly used "bzImage" kernel file format, the typical segments are for 16-bit kernel code, 32-bit kernel code, and init ramdisk code. The structure used to track these segments is known as kexec_segment and is a fairly simple structure:

Listing 1. The kexec_segment structure
struct kexec_segment {
   void *buf;
   size_t bufsz;
   void *mem;
   size_t memsz;
};

The first two elements of the structure point to the userspace buffer and its size, while the next two elements indicate the final destination of the segment and its size.
Once the kernel-file format-specific module loads the image into user memory, the image is transferred to dynamic kernel memory through the use of the sys_kexec system call. This system call allocates dynamic kernel pages for each of the segments that have been passed from userspace and copies the segments onto these kernel pages.
Kexec also allocates a kernel page to store a small stub of assembly code, known as the reboot_code_buffer. This stub of code does the actual job of overwriting the current kernel with the to-be-rebooted kernel and jumps to it. Thereboot_code_buffer is the only buffer that resides in its final resting place. In other words, it is executed from the same place that it is initially loaded to. In order to achieve this, on systems with MMU enabled, the page holding the code is identity mapped. Simply speaking, this involves creating a page table entry in init_mm (the kernel's page table structure) with the same physical and virtual address. This is necessary to be able to access this piece of code during the reboot operation, as discussed later.
Information about the reboot_code_buffer, the various segments, and other details is maintained through the use of thekimage structure:

Listing 2. The kimage structure
struct kimage {
        kimage_entry_t head;
        kimage_entry_t *entry;
        kimage_entry_t *last_entry;

        unsigned long destination;
        unsigned long offset;

        unsigned long start;
        struct page *reboot_code_pages;

        unsigned long nr_segments;
        struct kexec_segment segment[KEXEC_SEGMENT_MAX+1];

        struct list_head dest_pages;
        struct list_head unuseable_pages;
};

The most important parts of this structure are, of course, the segment[KEXEC_SEGMENT_MAX+1] elements, which point to the buffers in kernel memory containing the image, and the reboot_code_pages pointer to the assembly stub used during reboot.
Once the kernel image has been loaded, the system is ready to reboot into it. The actual operation on rebooting to the new kernel starts with the kexec -e command. This command essentially calls the kernel to perform a reboot using the sys_reboot system call, but with a special flag of - LINUX_REBOOT_CMD_KEXEC.
The reboot system call, upon seeing the special flag, transfers control to the machine_kexec() function. The actions performed by machine_kexec() are extremely architecture-specific. In the current x86 implementation, the sequence of actions is as follows:
  1. To access the identity-mapped reboot_code_buffer, switches from the current process's mm struct to using the kernel's init_mm structure.
  2. Stops the apics and disables interrupts.
  3. Copies the assembly stub code into the reboot_code_buffer that you had allocated during the loading of the kernel image. The assembly code is found in the relocate_new_kernel routine.
  4. Loads all the segment registers with the kernel data segment (__KERNEL_DS) value, and invalidates the GDT and IDT.
  5. Jumps to the code in the reboot_code_buffer, and passes some vital information as parameters to the new kernel, such as the indirection page containing the source/destination addresses of the kernel image, the starting address of the new kernel, the address of the reboot_code_buffer page, and a flag indicating whether the system has physical address extension (PAE) enabled.
The assembly stub code performs the following operations:
  • Reads the arguments from the stack and stores them on registers, and disables interrupts.
  • Using the address of its own page, which has been passed to it as an argument, sets up a stack at the end of that page.
  • Stores the starting address of the new kernel image onto the stack so that a return from the stub code automatically takes the system to the new kernel image.
  • Disables paging by setting appropriate bits on the cr0 register.
  • Resets the page directory base register, cr4, to zero.
  • Flushes the Translation Lookaside Buffers (TLBs).
  • Copies all the kernel image pages onto their final destination pages.
  • Flushes the TLB once again.
  • Resets all the registers to zero, except the stack pointer register esp (as it is pointing to the stack containing the starting address of the new kernel).
  • "Returns" from the stub code. This automatically takes the system to the new kernel.
After this sequence completes, the new kernel takes control and the system is booted up normally.

Systems with high availability requirements and kernel developers who have to constantly reboot their systems will benefit most from kexec. Because kexec skips the most time-consuming parts of system reboot, namely the firmware stage, reboots are extremely quick and availability is increased.
Kexec also has interesting applications in crash dumping tools. The Linux Kernel Crash Dumps (LKCD) project  has used kexec to develop a different dumping mechanism. At a system panic or user dump initiation, the system memory image is compressed and stored in available free memory pages. Next, the system is rebooted to another kernel using kexec. This new kernel is told where the dump is stored, and prevents the use of those memory regions by anyone. Subsequently, the memory dump can be written out to either a disk partition or across the network to a different machine.
The key to this design is the fact that by avoiding the firmware stage during reboot, LKCD is able to prevent the physical memory contents from being erased by the firmware. In a crash situation, LKCD also does not have to depend on an unreliable disk or network device driver to write out the memory image to the destination. Once a reboot has been performed and the system is in a reliable state, the dump is written out to the destination using normal system device drivers.
Kexec is currently available on the x86 32-bit platform only Having it on other architecture platforms such as PPC 64 and AMD 64 would be helpful. Also, better integration with the shutdown interface for graceful termination of processes, shutdown of devices, and unmounting of file systems would make it much more convenient for the average user.
You can contribute to the development of kexec. To get started, try out kexec on a test system. You can also join the "fastboot" mailing list, where all the technical discussions about the project take place

No comments:

Post a Comment