Developing hypervisor from scratch: Part 3 - Setting up VMCS

In this article series you are going to learn how to develop your own hypervisor for virtualization in linux ecosystem. In this part we will do the setup of VMCS structure.

9 min read
Developing hypervisor from scratch: Part 3 - Setting up VMCS

In the previous parts of the series we talk about VMM basics and VMXON operation. If you are not able to do a successful VMXON (VMX root operation) then you may have missed setting up few bits. To summarize, here are the checks that processor perform before entering vmxon.

  • Check VMX support in processor using CPUID.
  • Determine the VMX capabilities supported by the processor through the VMX capability MSRs.
  • Create a VMXON region in non-pageable memory of a size specified by IA32_VMX_BASIC MSR(or just 4KB page) and aligned to a 4-KByte boundary.
  • Initialize the version identifier in the VMXON region (the first 31 bits) with the VMCS revision identifier reported by capability MSRs.
  • Ensure the current processor operating mode meets the required CR0 fixed bits (CR0.PE = 1,* CR0.PG* = 1).
    Other required CR0 fixed bits can be detected through the IA32_VMX_CR0_FIXED0 and IA32_VMX_CR0_FIXED1 MSRs.
  • Enable VMX operation by setting CR4.VMXE = 1. Ensure the resultant CR4 value supports all the CR4 fixed bits
    reported in the IA32_VMX_CR4_FIXED0 and IA32_VMX_CR4_FIXED1 MSRs.
  • Ensure that the IA32_FEATURE_CONTROL MSR (MSR index 3AH) has been properly programmed and that its lock bit is set (Bit 0 = 1).
  • Execute VMXON with the physical address of the VMXON region as the operand. Check successful execution of VMXON by checking if RFLAGS.CF = 0.

If you have properly follow these steps then you must get VMXON successfully.

Setting up VMCS (VM Control Structures)

Major hypervisor development consist of setting up VMCS structure properly. VMCS (Virtual machine control structures) is a structure in memory that our processor uses to store and keep track of data requires during VMX transition i.e during switching from VM non root to VM root operation and vise versa. Things like - guest eip while getting inside VMX non root operation(running guest os), host eip when returning from VMX non root operation, Instruction that will cause VMEXIT etc are saved in VMCS structure.

You can create a VMCS for a VM by allocating a region in memory(known as VMCS region). This memory should be 4kb aligned and zeroed out just like VMXON region. Before allocating the memory lets first discuss properties of VMCS region.

Properties of VMCS structure:

  • A VMM can use a different VMCS for each virtual machine that it supports. For a virtual machine with multiple logical processors (virtual processors), the VMM can use a different VMCS for each virtual processor.
  • A processor can have many VMCS( for multiple VMs) that they define as active. But will only have one VMCS that it use/execute at a given time called current VMCS.
  • All VMCS related operations can be done by using VMREAD and VMWRITE instruction.

Different states of VMCS:

Lets look at the below figure:

State of VMCS

We have already talked about Active and Current states of VMCS. What remain is Launched state, which determine if that VMCS is launched previously or not. To launch a state we use VMLAUNCH , once the state is launched then we need to do VMRESUME on that state to launch it again.

Note:-Launching just means stating the VM or getting inside VM Non-root operation for that VMCS.

Lets see operations that you can use on VMCS:-

VMPTRLD - Will make the VMCS Active and current.
VMCLEAR - Change the state from current to Not current. Idealy should be used before every VMPTLDR so that any previous current VMCS become Non current.
VMLAUNCH - To launch the VMCS or VM defined by VMCS.
VMRESUME - Used to launch again the previously launched VMCS. Used to launch the VM which is exit (VMEXIT) due to some reason.

The above figure can be easily understood by keeping in mind these terms.

Before moving further with theoretical aspects of VMCS, lets first allocate 4Kib region for that and zero it out.

// CH 23.7, Vol 3
// Enter in VMX mode
MYPAGE_SIZE = 4096;
uint64_t *vmcsRegion = NULL;
...
bool allocVmcsRegion(void) {
	vmcsRegion = kzalloc(MYPAGE_SIZE,GFP_KERNEL);
   	if(vmcsRegion==NULL){
		printk(KERN_INFO "Error allocating vmcs region\n");
      	return false;
   	}
	return true;
}

The format of  VMCS region is as follow.

VMCS region format

The first 4 bytes of the VMCS region contain the VMCS revision identifier at bits 30:0. So, lets put that in first 4 bytes:

long int vmcsPhyRegion = 0;
	if (allocVmcsRegion()){
		vmcsPhyRegion = __pa(vmcsRegion);
		*(uint32_t *)vmcsRegion = vmcs_revision_id();
	}
	else {
		return false;
	}

The next 4 bytes are abort indicator. A logical processor writes a non-zero value into these bits if a VMX abort occurs. We are not required to write anything on these bytes.

The remainder of the VMCS region is used for VMCS data (those parts of the VMCS that control VMX non-root operation and the VMX transitions). Setting this structure is our biggest task to make hypervisor run successfully. You can only change VMCS Data if the VMCS is current and Active VMCS. So, lets first make our vmcs current using VMPTRLD.  VMPTRLD takes address of VMCS structure as the operand.

VMPTRLD instruction

Lets do the same in our code.( We have used setna to check if the instruction succeeded or not.


static inline int _vmptrld(uint64_t vmcs_pa)
{
	uint8_t ret;

	__asm__ __volatile__ ("vmptrld %[pa]; setna %[ret]"
		: [ret]"=rm"(ret)
		: [pa]"m"(vmcs_pa)
		: "cc", "memory");
	return ret;
}
...
//making the vmcs active and current
	if (_vmptrld(vmcsPhyRegion))
		return false;
	return true;

Now what remains is to initialize the VMCS data structure.

Initializing VMCS data

VMCS data is organized in 6 parts:-

  • Guest-state area - This space contains information/values of guest state when processor transtion to VM non-root mode. Or in simple terms Registers state of the guest for next VM entry like guest eip, esp, different MSR's value etc.
  • Host-state area - This buffer contains processor state that need to be loaded back after the VMexit occur(transition to VM root).
  • VM-exit control fields - VM-exit control fields govern the behavior of VM exit. For example, what msr need to be saved on VM-exit etc.
  • VM-execution control fields - These fields control processor behavior in VMX non-root operation. They determine in part the causes of VM exits.
  • VM-entry control fields - These fields control VM entries or basic operations on VM-entry.For example, what registers and msr to be loaded etc.
  • VM-exit information fields - These fields receive information on VM exits and describe the cause and the nature of VM exits. They are used for debugging purposes only and we don't need to initialize them.

We need to setup all the fields except VM-exit information fields. First we are going to setup VM-execution control then remaining once.So, lets start the coding again.

VM-execution control field

VM execution control further divided into following fields

  • Pin-based (asynchronous) controls
  • Processor-based (synchronous) controls
  • Exception bitmap
  • I/O bitmap addresses
  • Timestamp Counter offset
  • CR0/CR4 guest/host masks
  • CR3 targets
  • MSR Bitmaps
  • Extended-Page-Table Pointer (EPTP) (Will ignore for now)
  • Virtual-Processor Identifier (VPID) (Will ignore for now)

Let's learn about each of these field and set the one that are required for VM entry. But first you need to knew few terms:

default 0-settings - means those bytes have 0 value set by default

default 1-settings - means those bytes have 1 value set by default

Pin based controls - One 32-bit value that controls asynchronous events in VMX non-root.

It supports settings governed by IA32_VMX_PINBASED_CTLS MSR. According to Intel document this is how we can set pin based controls.

Above complex wording just means in simple words that we can put IA32_VMX_PINBASED_CTLS  values to pin based controls but we need to do and operation between first 32 bits to next 32 bits to get the supported value of that bit in pin based control.

#define MSR_IA32_VMX_PINBASED_CTLS		0x00000481
#define PIN_BASED_VM_EXEC_CONTROLS		0x00004000
...
bool initVmcsControlField(void) {
    uint32_t pinbased_control0 = __rdmsr1(MSR_IA32_VMX_PINBASED_CTLS);
    uint32_t pinbased_control1 = __rdmsr1(MSR_IA32_VMX_PINBASED_CTLS) >> 32;
    uint32_t pinbased_control_final = (pinbased_control0 & pinbased_control1);
	vmwrite(PIN_BASED_VM_EXEC_CONTROLS, pinbased_control_final);

Processor based controls - Controls handling of synchronous events
i.e., events caused by execution of specific instructions.  Similar to pin based control we can set proc based control using IA32_VMX_PROCBASED_CTLS.

#define MSR_IA32_VMX_PROCBASED_CTLS		0x00000482
#define PROC_BASED_VM_EXEC_CONTROLS		0x00004002
...
bool initVmcsControlField(void) {
...
    uint32_t procbased_control0 = __rdmsr1(MSR_IA32_VMX_PROCBASED_CTLS);
    uint32_t procbased_control1 = __rdmsr1(MSR_IA32_VMX_PROCBASED_CTLS) >> 32;
    uint32_t procbased_control_final = (procbased_control0 & procbased_control1);
    vmwrite(PROC_BASED_VM_EXEC_CONTROLS, procbased_control_final);

There is also a secondary processor based controls which you can use for further settings of synchronous events setup. You can set it up the same way you are setting processor based controls.

Exception bitmap - This is a 32-bit field in which one bit is for each exception. Setting this will define which exception should cause vmexit. We are just going to set it up to 0 to ignore vmexit for any guest exception.

#define EXCEPTION_BITMAP				0x00004004
...
bool initVmcsControlField(void) {
...
	vmwrite(EXCEPTION_BITMAP, 0);

I/O bitmap - Tells which I/O port request need to cause VMexit. Its two 4K bitmaps (A and B) -
A contains one bit for each I/O port in range 0000h through 7FFFh
B contains one bit for each I/O port in range 8000h through FFFFh

We can ignore this field as it is not mandatory to set.

Timestamp Counter offset - Used to set TSC which we can ignore for now.

CR0/CR4 guest/host masks - Define which bits in CR0/CR4 will cause VMexit. Done through masking. Host/guest mask determines who “owns” that bit (guest or host) in CR0/CR4.
- For bits set to 1 in the mask, these are owned by host
- Guest bit-setting events
- Bits set in the mask that differ from respective shadow value will cause VMExit
- Guest bit-read event for bit in bitmask will read from corresponding shadow register  
- For bits set to 0 in the mask, these are owned by guest

We don't care if our guest is accessing CR0/CR4, hence we gone ignore them for now.

CR3-target values - Allows for an exception to the rule of exiting for all MOV to/from CR3. Does not cause a VM exit if its source operand matches one of these values.

We can ignore that to.

MSR Bitmaps - Partitioned into four 1KB contiguous blocks.

  • Read bitmap for low MSRs
  • Read bitmap for high MSRs
  • Write bitmap for low MSRs
  • Write bitmap for high MSRs

If the bitmaps are used, an execution of RDMSR or WRMSR causes a VM exit if the value of RCX is in neither of the ranges covered by the bitmaps or if the appropriate bit in the MSR bitmaps (corresponding to the instruction and the RCX value) is 1.

We can happily ignore that for our minimalist hypervisor.:)

Since we are done with VM-execution control fields, now lets setup VM-entry control fields and VM-exit information fields.

VM-exit control fields

VM-exits field consist of two groups:

  • VM-Exit Controls
  • VM-Exit Controls for MSRs

VM-Exits Controls - The VM-exit controls constitute a 32-bit vector that governs the basic operation of VM exits. Below table give details of what each bit corresponds.

VM-Exit Controls for MSRs - A VMM may specify lists of MSRs to be stored and loaded on VM exits. Below fields show how can MSR's be restored on VMExit.

We can skip the VM-Exit controls for MSRs since its optional to use. So, what remain is to set VM-Exit control. VM-Exit control depends on if the VM exit is occur in 64 bit space or not. Hence, we need to set bits 9-Host address space size to 1 and can put the remaining bits from MSR_IA32_VMX_EXIT_CTLS.

#define VM_EXIT_CONTROLS				0x0000400c
#define MSR_IA32_VMX_EXIT_CTLS			0x00000483
#define VM_EXIT_HOST_ADDR_SPACE_SIZE	0x00000200
...
bool initVmcsControlField(void) {
...
    vmwrite(VM_EXIT_CONTROLS, __rdmsr1(MSR_IA32_VMX_EXIT_CTLS) |
            VM_EXIT_HOST_ADDR_SPACE_SIZE);

VM-entry control fields

Its 32-bit vector that controls the basic operation of VM entries. It consists of three groups-

  • VM-Entry Controls
  • VM-Entry Controls for MSRs
  • VM-Entry Controls for Event Injection

VM-Entry Controls -  Used while VM entries.

VM-Entry Controls for MSRs - A VMM may specify a list of MSRs to be loaded on VM entries. The following VM-entry control fields manage this
functionality:

  • VM-entry MSR-load count (32 bits)- This field contains the number of MSRs to be loaded on VM entry.
  • VM-entry MSR-load address (64 bits)- This field contains the physical address of the VM-entry MSR-load area.

VM-Entry Controls for Event Injection - VMX operation allows injecting interruptions to a guest virtual machine through the use of VM-entry interrupt-information field in VMCS. It generate event on next VMEnter.
Happens after all guest state is loaded.
Allows injection of:-

  • External interrupts
  • Non-maskable interrupts
  • Exceptions (eg Page faults)
  • Traps

If the interrupt-information field indicates a valid interrupt, exception or trap event upon the next VM entry; the processor will use the information in the field to vector a virtual interruption through the guest IDT after all guest state and MSRs are loaded.

We will configure the VM-entry in same way we have configured VM-exit.

#define VM_ENTRY_CONTROLS				0x00004012
#define MSR_IA32_VMX_ENTRY_CTLS			0x00000484
#define VM_ENTRY_IA32E_MODE				0x00000200
...
bool initVmcsControlField(void) {
...
    vmwrite(VM_ENTRY_CONTROLS, __rdmsr1(MSR_IA32_VMX_ENTRY_CTLS) |
		VM_ENTRY_IA32E_MODE);

That's it for this part. We will setup Guest state and Host state in next part. You can see the complete code here.