Fun and games in userland

A brief history of virtualisation

By Liam ProvenGet more from this author

Posted in IT Management, 18th July 2011 01:00 GMT

Operating-system level virtualisation

As we explained in part 2 of this series, A brief history of virtualisation, in the 1960s, it was a sound move to run one OS on top of a totally different one. On the hardware of the time, full multi-user time-sharing was a big challenge, which virtualisation neatly sidestepped by splitting a tough problem into two smaller, easier ones.

Within a decade, though, a new generation of hardware made it easy enough that a skunkworks project at AT&T was able to create a relatively small, simple OS that was nonetheless a full multi-user, time-sharing one: Unix.

After its early years as a research project, Unix spent a few decades as a proprietary product, with dozens of competing companies offering their own versions – meaning that it splintered into many incompatible varieties. Each company implemented its own enhancements and then its rivals would copy it to create their own versions.

Fairly late in this process, a new form of virtualisation emerged as one of these features. First implemented as part of FreeBSD 4 in 2000, a version came to Linux in 2001, Sun Solaris in 2005 and IBM AIX in 2007. Each Unix calls it by a different name and has slightly different functionality, but the overall concept is the same.

It springs from a far simpler piece of Unix functionality – the humble chroot command, which dates back all the way to Version 7 Unix in 1979. For those Windows-only types out there, a tiny bit of Unix background is needed at this point.

One big directory tree

In all flavours of Unix, there is just one, system-spanning directory tree, starting at the root directory. On CP/M and DOS and Windows, the base level of storage is an assortment of drive letters, on each of which is a directory tree.

Unix does things the other way round: there’s one big directory tree, starting at the root directory – called just “/” – and disk partitions and volumes appear as directories within it.

What chroot lets you do is transform a subdirectory into the root filesystem just for one particular process. You build a skeletal Unix filesystem containing only whatever files are necessary for that process to run, then imprison that process – and any subprocesses it might create – within it, so that it can no longer see the whole directory tree, just its particular subtree.

In essence, then, it virtualises just the filesystem: it doesn't protect the system against a program with superuser – ie, administrative – rights, but it is very handy for testing purposes.

The chroot command proved to be such a useful tool that in time it was extended into a more complete form, which also virtualised the memory space, I/O and so on of the operating system. The result is that a process locked inside the virtual environment seemed to have the whole computer to itself.

To understand how this works, consider how modern operating systems work. In most processors, there are at least two privilege levels at which code can execute, which are generally called something like kernel space and user space.

Code running in kernel space – usually the OS kernel and any essential device drivers – is in direct control of the hardware and can directly manipulate peripherals and so on. In contrast, code running in user space can’t – it just gets given its own block of memory, which is all it’s allowed to access, and it has to ask the kernel nicely for I/O. On x86 chips, kernel code runs at a level called Ring 0 and user code in Ring 3, and the levels in between are left unused.

Splitting up

There's only one kernel, and generally, in most systems, there's only one big program running in user space. Sometimes it's called "userland", and it encompasses all the bits that you actually interact with.

As far as the kernel is concerned, userland can effectively be considered as one big program. One original parent process – on Unix boxes, traditionally called init – starts up all the rest and thus is the parent of the whole tree of dozens to hundreds of others.

So if you set things up so that the kernel is able to run more than one userland at a time, you can effectively virtualise the whole visible face of the OS. So long as the primary copy stays in control of the filesystem and the secondary copies are penned up in subdirectories, you can suddenly split your computer into multiple identical “virtual environments” (VEs).

There’s only one copy of the actual OS installed and only one kernel running, but you can have lots of separate root directories and install whatever you like in each of them without it affecting the others.

Each one starts with a skeletal copy of the filesystem with just the essential files it needs – which is what you do with the chroot command anyway – and then it can put whatever it likes, wherever it likes, and it all stays neatly penned up and separate from all the other software on the computer.

This is called “operating-system level virtualisation” or “kernel-level isolation.”

Every sysadmin's dream

No more “DLL hell,” no more clashing system requirements, no more trying to untangle which directories or files belong to which app. Apps are completely isolated, simplifying management – for instance, they can be removed without a trace, as every file the app ever wrote to disk is locked inside its VE.

So far, so good. Sounds like running a few Windows VMs on a server, doesn’t it?

But it isn’t. Because at the same time, there’s only a single install of a single OS to configure, patch and update; one set of device drivers; and rather importantly if you’re running a commercial OS, one licence to pay for.

No more “DLL hell,” no more clashing system requirement

To anyone who’s ever been a sysadmin, it’s a dream come true. Every app on the machine is locked away from every other one in its own little walled garden.It’s also very different from a performance or administration point of view – each VE is equivalent to just a program, rather than a whole OS instance. You don’t need to allocate storage to VEs – they all share the same pool of memory and disk space, managed by the kernel, so it is vastly more efficient to run multiple VEs than full VMs.

A dozen copies of a full OS under a hypervisor means thirteen times the hardware resources needed by one – so suddenly you need a dual-socket eight-core server with 32GB of RAM to make it all work. Not so with VEs – it only needs the resources for one OS plus the dozen apps.

It must be admitted that this approach doesn’t work for every type of program. If an application has to modify the kernel or change its behaviour, or needs kernel privileges, then you can’t run it inside a virtual environment, because that would affect all of them – so certain apps need to run in the base or parent environment. You can’t just install absolutely anything alongside anything else. Some don’t get along and won’t share.

But overall, VEs are a very useful and powerful tool. The snag is that at the moment, you only get this if you’re running a version of Unix with long trousers.

Solaris 10 calls VEs Zones, management by Containers, and it’s a very powerful implementation. For instance, each zone can have its own network interfaces and sets of user IDs.

Zones can be bound to particular processors for performance optimisation, but they don’t need a dedicated one, nor must they be allocated any memory. The OS not only supports zones offering the native Solaris API but also ones emulating older versions of Solaris as well as Linux-branded zones.

AIX 6.1 does it, too; IBM call them “workload partitions” – WPARs for short – as opposed to LPARs, IBM’s name for full-system virtualisation, as we described in the previous article. WPARs offer several levels of isolation, from some shared resources to none, right down to a single process. A running WPAR can even be migrated onto a different host server.

And for the Free Software user, FreeBSD offers "jails". Jails don’t have all the bells and whistles of their commercial rivals – you just get multiple instances of the same version of the same OS – but are still a very useful tool. Most of the other BSDs shared a similar implementation under the name "sysjails", but an insecurity in their implementation caused development to stop in 2009.

With Linux, the situation is more complex. VEs are not a standard part of the Linux kernel, but there are multiple competing tools offering variations on the same functionality. The newest and possibly simplest is LXC (“Linux Containers”), which builds on the cgroups functionality that’s been built into the kernel since 2.6.29. Rather more mature is Linux-VServer, which is sufficiently robust to allow other distributions’ userlands to be started inside a VE.

Parallel worlds

Probably the most capable for now is OpenVZ, which allows VEs to have their own network and I/O devices. OpenVZ is the basis of Parallels’ commercial Virtuozzo Containers product and its development is sponsored by Parallels.

Aimed at service providers, Virtuozzo builds on OpenVZ with additional management and provisioning tools. It can support a higher density of containers with closer management of their resources and it integrates with Parallels’ Plesk management tools.

Parallels logoInterestingly, Virtuozzo also runs on Windows. Parallels is not as well-known in PC virtualisation circles as it is on Linux and on the Mac, where its Parallels Desktop product brought several new features to Mac users wishing to run Windows. Virtuozzo Containers for Windows brings Unix-style partial virtualisation to Microsoft’s platform.

Each container takes only about 60MB of files, but appears from the management console to be a complete, independent machine – you can even assign it its own IP address and connect to it with Remote Desktop. Obviously, all the containers on a host run the same OS as the host itself, but the memory and disk footprint is dramatically reduced as you're only running a single OS instance.

Virtuozzo or something like it might yet cause a small revolution in PC virtualisation. If it does, going by the company's history, it’s likely that the technique will be imitated by Microsoft itself. Partly because that's what it's currently doing with Hyper-V, which is progressively acquiring more and more of the features of VMware’s VSphere and VCenter management tools. Mostly, though, because Microsoft is in the best position to incorporate OS-level virtualisation into its own OS.

To be fair, the notion of OS-level virtualisation is not a Parallels innovation – as we've discussed, it's been around for more than a decade and has been implemented in multiple OSs. Parallels is just the first company to make it happen on Windows. If the concept were to catch on in the Windows world, it would make virtualisation a great deal simpler, faster and more efficient.

In the next part of this series, we will look at the state of the virtualisation market today – and in the final one, where it might go next. For the historically-inclined, one of the only PC OSs ever to actually use more than rings 0 and 3 was IBM's OS/2. Its kernel ran in ring 0 and ordinary unprivileged code in ring 3, as usual, but unprivileged code that did I/O rang in ring 2.

This is why OS/2 won’t run under Oracle’s open-source hypervisor VirtualBox in its software-virtualisation mode, which forces Ring 0 code in the guest OS to run in Ring 1. ®

Source: http://www.theregister.co.uk/2011/07/18/brief_history_of_virtualisation_part_3/page2.html