Different Scalability
Virtual Machine: abstraction of OS-visible hardware
Fidelity: Programs running in the virtualized environment run identically to running natively.
Performance: A statistically dominant subset of the instructions must be executed directly on the CPU. (Distinguises VMMs from emulators)
Safety and isolation:
The VMM must completely control access to system resources.
Virtual Machine Monitor (VMM, hypervisor): implements VM abstraction
software layer between OS and hardware
functionally invisible to OS and apps
able to multiplex hardware among multiple VMs
Purpose:
transform "capital expenses" (CAPEX) into "operational expenses" (OPEX): instead of buying new machine, we do some software work
elasticity: flexible allocation
Encapsulation and portability: can capture all snapshot and recreate snapshots of machine running state
Interposition: can transform instructions and I/O instructions before putting into CPU
isolation: The VMM must completely control access to system resources.
History
mid-1960s to early 1970s: birth and emergence
early 1970s to late 1970s: extensive commercial use (VM/CMS)
late 1970s to early 1980s: emergence of personal computers (IBM PC)
late 1980s to late 1990s: “demise” of VMs
late 1990s: rebirth of VMs (VMware)
early 2000s: resurgence of research interest in VMs
late 2000s: to present explosion of commercial interest
now: (cloud computing)
Hardware virtualization: easier to implement than software processes virtualization
narrow interface: fewer things to implement
vendor neutrality: no one vendor can control the definition, revision or distribution of a specification
stable: longevity / ubiquity
Types of System Virtualization
Type 1: native / bare metal
Type 2: nosted
Privileged instructions (e.g., IO requests, Update CPU state, Manipulate page table):
when executed from user mode: "trap to OS" and executed from kernel mode
when using VM: "Trap to VMM" (and then VMM calls the real kernel)
Non-privileged instructions (e.g., Load from mem):
when executed from user mode: executes on user mode
when using VM: directly run on native CPU
3 layers of memory
Logical: process address space in a VM. (OS's original virtualization)
Physical: abstraction of hardware memory. (OS believe it is hardware, but its not)
Machine: actual hardware memory (e.g. 2GB of DRAM). (Managed by VMM)
I/O Virtualization:
Direct access (type 1): VMs can directly access devices (require hardware support, e.g. DMA passthrough, SR-IOV). (IO passing through VMM)
Shared access (type 2): VMM provides an emulated device and routes I/O data to and from the device and VMs. (IO emulated by VMM)
Live migration: migrate OS to a different place without guest OS notice
purpose: no downtime for upgrades and maintenance
method: copy OS state to another machine
Containers:
Motivation
Downside: no interposition (since instructions run directly on CPU and there is no hypervisor) means no security isolation
Solution: isolate programs and limit the resources they can access (files, other processes, device), but share kernel
Other names: OS-level virtualization, partitions, jails (FreeBSD jail, chroot jail)
Implementation: each process is assigned with a "namespace" per resource type (PIDs, UIDs, networks, IPC)
systemcall only show resources attached to their own namespace
subprocesses inherit namespace
Resource Isolation: usage counters for groups of processes (cgroups, kernfs)
Compressible resources (CPU, I/O bandwidth): rate limiting
Non-compressible resources (Memory/disk space): require terminating containers (e,g., OOM killer per cgroup)
Layering of Filesystem: copy on write
Upper Layer: read-write, per container (ephemeral)
Lower Layer: read only for original files (persistent)
Fast Boot: 100 milliseconds
High Density: 1000 containers per machine
Limitation of Container
Hard to Implement: interface too wide
Less General than VM: must share OS
Harder to Migrate: container state not fully encapsulated, there is state leak into host OS (no no container migration in practice, better to start a new one)
Adversarial Attack: containers have access to systemcall, and there are 400+ system call (+10/year), one could do "systemcall filtering" which is very complicated to do
In practice, we use VMs to isolate between different users, and containers to isolate different applications/services of a single user
Table of Content