There is one area where the DOS era’s “I own the whole system” attitude persists, and it’s a virtualisation millstone: device drivers. You may not realise how ugly the problem and present solution are — or how simple and elegant the real solution will be.
Let’s address ugly first. At boot time, each peripheral controller in an x86 system is mapped to a range of memory addresses in the OS kernel’s address space. That location is fixed, and the device driver takes sole ownership of it. After that one-to-one relationship between a driver and a controller is established, there’s no place where a virtualisation host can tap traffic between devices and drivers so the devices can be shared. Usually, there is no apparent traffic — just reads and writes to memory, and software cannot intercept these.
Today, virtualisation hosts are forced to take ownership of all system devices and present dummy, emulated devices to guest OS instances. This gets the job of sharing done, but with gross overhead and limitations. For example, imagine that you’re running a server with a top-end storage array controller. When the virtualisation host initialises, its device driver grabs hold of this controller.
When a guest OS boots, the host makes that fancy controller look like a simple device, like an Adaptec PCI SCSI or an Intel parallel ATA adapter.
The host mimics such old and simple peripheral controllers, and does so for network adapters, video cards and all the rest, because older devices are easier to emulate, they use fewer resources and it’s likely that all guest OSes will have drivers for them.
In the worst case, which is often the usual case, every disk I/O request a guest makes gets converted to a user-level system call by the host, which then trickles down through another two or three layers to get to the real device driver, and the result bubbles up to the guest after the data’s been copied from one location to another several times.
Throughout this whole process, the guest’s driver probably has the guest OS kernel locked while it waits for a response.
What’s worse, your fancy storage controller with a 256MB cache might be emulated as a controller with a 128KB buffer. So not only does every request for a block of disk data have to travel all the way up and down the whole host/guest stack, it has to be broken into much smaller bits and many more requests.
And now for the elegant. The specification for AMD’s IOMMU (I/O memory management unit) shows that the problem has a pretty simple fix: with IOMMU, the virtualisation host can create real one-to-one mappings between peripherals and drivers that are unique for each guest and managed entirely by the CPU.
The CPU inserts the tap between drivers and peripherals by watching memory transfers and through address translation that the x86 architecture does not extend to I/O, gives each guest the impression that it is dealing directly with the device.
The host still has to receive and route to guests the hardware interrupts that say “I’m finished”, but that is comparatively simple. IOMMU is useful even in non-virtualised environments, where it sets up the tantalising possibility of permitting highly performance-sensitive processes complete their access to peripherals without the overhead of a driver.
Virtualisation needn’t be an all-or-nothing arrangement. All the hardware and OS facilities we’re putting in place to support efficient virtualisation will make non-virtualised systems and applications more flexible, reliable and efficient.