qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [qemu-web PATCH] Add "Understanding QEMU devices" blog


From: Eric Blake
Subject: Re: [Qemu-devel] [qemu-web PATCH] Add "Understanding QEMU devices" blog post
Date: Fri, 26 Jan 2018 08:46:39 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

On 01/26/2018 06:40 AM, Paolo Bonzini wrote:
> On 26/01/2018 10:19, Thomas Huth wrote:
>> Last July, Eric Blake wrote a nice summary for newcomers about what
>> QEMU has to do to emulate devices for the guests. So far, we missed
>> to integrate this somewhere into the QEM web site or wiki, so let's
>> publish this now as a nice blog post for the users.
> 
> It's very nice!  Some proofreading and corrections follow.

Thanks for digging up my original email, and enhancing it (I guess the
fact that I don't blog very often, and stick to email, means that I rely
on others helping to polish my gems for the masses).

>> +++ b/_posts/2018-01-26-understanding-qemu-devices.md
>> @@ -0,0 +1,139 @@
>> +---
>> +layout: post
>> +title:  "Understanding QEMU devices"
>> +date:   2018-01-26 10:00:00 +0100

That's when you're posting it online, but should it also mention when I
first started these thoughts in email form?

>> +author: Eric Blake
>> +categories: blog
>> +---
>> +Here are some notes that may help newcomers understand what is actually
>> +happening with QEMU devices:
>> +
>> +With QEMU, one thing to remember is that we are trying to emulate what
>> +an OS would see on bare-metal hardware.  All bare-metal machines are
> 
> s/All/Most/ (s390 anyone? :))

Also, s/OS/Operating System (OS)/ to make the acronym easier to follow
in the rest of the document.

> 
>> +basically giant memory maps, where software poking at a particular
>> +address will have a particular side effect (the most common side effect
>> +is, of course, accessing memory; but other common regions in memory
>> +include the register banks for controlling particular pieces of
>> +hardware, like the hard drive or a network card, or even the CPU
>> +itself).  The end-goal of emulation is to allow a user-space program,
>> +using only normal memory accesses, to manage all of the side-effects
>> +that a guest OS is expecting.
>> +
>> +As an implementation detail, some hardware, like x86, actually has two
>> +memory spaces, where I/O space uses different assembly codes than
>> +normal; QEMU has to emulate these alternative accesses.  Similarly, many
>> +modern hardware is so complex that the CPU itself provides both
>> +specialized assembly instructions and a bank of registers within the
>> +memory map (a classic example being the management of the MMU, or
>> +separation between Ring 0 kernel code and Ring 3 userspace code - if
>> +that's not crazy enough, there's nested virtualization).
> 
> I'd say the interrupt controllers are a better example so:
> 
> Similarly, many modern CPUs provide themselves a bank of CPU-local
> registers within the memory map, such as for an interrupt controller.

Is it still worth a mention of nested virtualization?

> 
> And then a paragraph break.
> 
>> +With certain
>> +hardware, we have virtualization hooks where the CPU itself makes it
>> +easy to trap on just the problematic assembly instructions (those that
>> +access I/O space or CPU internal registers, and therefore require side
>> +effects different than a normal memory access), so that the guest just
>> +executes the same assembly sequence as on bare metal, but that execution
>> +then causes a trap to let user-space QEMU then react to the instructions
>> +using just its normal user-space memory accesses before returning
>> +control to the guest.  This is the kvm accelerator, and can let a guest
> 
> This is supported in QEMU through "accelerators" such as KVM.

Yeah, when I first wrote the email, we didn't have as many accelerators
in qemu.git :)

> 
>> +run nearly as fast as bare metal, where the slowdowns are caused by each
>> +trap from guest back to QEMU (a vmexit) to handle a difficult assembly
>> +instruction or memory address.  QEMU also supports a TCG accelerator,
> 
> QEMU also supports other virtualizing accelerators (such as
> [HAXM](https://www.qemu.org/2017/11/22/haxm-usage-windows/) or macOS's
> Hypervisor.framework) and also TCG,
> 
>> +which takes the guest assembly instructions and compiles it on the fly
>> +into comparable host instructions or calls to host helper routines (not
>> +as fast, but results in QEMU being able to do cross-hardware emulation).
> 
> While not as fast, TCG is able to do cross-hardware emulation, such as
> running ARM code on x86. (Removing the parentheses)
> 
>> +The next thing to realize is what is happening when an OS is accessing
>> +various hardware resources.  For example, most OS ship with a driver
> 
> most operating systems
> 
>> +that knows how to manage an IDE disk - the driver is merely software
>> +that is programmed to make specific I/O requests to a specific subset of
>> +the memory map (wherever the IDE bus lives, as hard-coded by the
>> +hardware board designers),
> 
> (wherever the IDE bus lives, which is specific the hardware board).

specific to the

> 
>  in order to make the disk drive hardware then
>> +obey commands to copy data from memory to persistent storage (writing to
>> +disk) or from persistent storage to memory (reading from the disk).
> 
> When the IDE controller hardware receives those I/O requests it
> communicates with the disk drive hardware, ultimately resulting in data
> being copied from memory...
> 
>> +When you first buy bare-metal hardware, your disk is uninitialized; you
>> +install the OS that uses the driver to make enough bare-metal accesses
>> +to the IDE hardware portion of the memory map to then turn the disk into
>> +a set of partitions and filesystems on top of those partitions.
>> +
>> +So, how does QEMU emulate this? In the big memory map it provides to the
>> +guest, it emulates an IDE disk at the same address as bare-metal would.
>> +When the guest OS driver issues particular memory writes to the IDE
>> +control registers in order to copy data from memory to persistent
>> +storage, QEMU traps on those writes (whether via kvm hypervisor assist,
>> +or by noticing during TCG translation that the addresses being accessed
>> +are special),
> 
> the accelerator knows that these writes must trap (remove everything in
> parentheses) and passes them to the QEMU IDE controller _device model_.
> 
>  and emulates the same side effects by issuing host
>> +commands to copy the specified guest memory into host storage.
> 
> The device model parses the I/O requests, then emulates them by issuing
> host system calls.  The result is that guest memory is copied into host
> storage.

Works for me.  Thanks for helping clarify my concepts.

> 
> (New paragraph).
> 
>>  On the
>> +host side, the easiest way to emulate persistent storage is via treating
>> +a file in the host filesystem as raw data (a 1:1 mapping of offsets in
>> +the host file to disk offsets being accessed by the guest driver), but
>> +QEMU actually has the ability to glue together a lot of different host
>> +formats (raw, qcow2, qed, vhdx, ...) and protocols (file system, block
>> +device, NBD, sheepdog, gluster, ...) where any combination of host
> 
> Can we link NBD, sheepdog and gluster?  Maybe Ceph instead of Sheepdog.
> 
>> +format and protocol can serve as the backend that is then tied to the
>> +QEMU emulation providing the guest device.
>> +
>> +Thus, when you tell QEMU to use a host qcow2 file, the guest does not
>> +have to know qcow2, but merely has its normal driver make the same
>> +register reads and writes as it would on bare metal, which cause vmexits
>> +into QEMU code, then QEMU maps those accesses into reads and writes in
>> +the appropriate offsets of the qcow2 file.  When you first install the
>> +guest, all the guest sees is a blank uninitialized linear disk
>> +(regardless of whether that disk is linear in the host, as in raw
>> +format, or optimized for random access, as in the qcow2 format); it is
>> +up to the guest OS to decide how to partition its view of the hardware
>> +and install filesystems on top of that, and QEMU does not care what
>> +filesystems the guest is using, only what pattern of raw disk I/O
>> +register control sequences are issued.
>> +
>> +The next thing to realize is that emulating IDE is not always the most
>> +efficient.  Every time the guest writes to the control registers, it has
>> +to go through special handling, and vmexits slow down emulation.  One
>> +way to speed this up is through paravirtualization, or cooperation
>> +between the guest and host.
> 
> Replace last sentence with:
> 
> Of course, different hardware models have different performance
> characteristics when virtualized.  In general, however, what works best
> for real hardware does not necessarily work best for virtualization and,
> until recently, hardware was not designed to operate fast when emulated
> by software such as QEMU.  Therefore, QEMU includes _paravirtualized_
> devices that _are_ designed specifically for this purpose.
> 
> The meaning of "paravirtualization" here is slightly different from the
> original one of "virtualization through cooperation between the guest
> and host".  (Continue with next sentence in the same paragraph).
> 
>>  The QEMU developers have produced a
>> +specification for a set of hardware registers and the behavior for those
>> +registers which are designed to result in the minimum number of vmexits
>> +possible while still accomplishing what a hard disk must do, namely,
>> +transferring data between normal guest memory and persistent storage.
>> +This specification is called virtio; using it requires installation of a
>> +virtio driver in the guest.  While there is no known hardware that
> 
> s/there is no known hardware/no physical device exists/
> 
>> +follows the same register layout as virtio, the concept is the same: a
>> +virtio disk behaves like a memory-mapped register bank, where the guest
>> +OS driver then knows what sequence of register commands to write into
>> +that bank to cause data to be copied in and out of other guest memory.
>> +Much of the speedups in virtio come by its design - the guest sets aside
>> +a portion of regular memory for the bulk of its command queue, and only
>> +has to kick a single register to then tell QEMU to read the command
>> +queue (fewer mapped register accesses mean fewer vmexits), coupled with
>> +handshaking guarantees that the guest driver won't be changing the
>> +normal memory while QEMU is acting on it.
> 
> Maybe add a short paragraph here like:
> 
> As an aside, just like recent hardware is fairly efficient to emulate,
> virtio is evolving to be also efficient to implement in hardware, of
> course without sacrificing performance for emulation or virtualization.
> Therefore, in the future you could stumble upon physical virtio devices
> as well.
> 
>> +In a similar vein, many OS have support for a number of network cards, a
>> +common example being the e1000 card on the PCI bus.  On bare metal, an
>> +OS will probe PCI space, see that a bank of registers with the signature
>> +for e1000 is populated, and load the driver that then knows what
>> +register sequences to write in order to let the hardware card transfer
>> +network traffic in and out of the guest.  So QEMU has, as one of its
>> +many network card emulations, an e1000 device, which is mapped to the
>> +same guest memory region as a real one would live on bare metal.  And
>> +once again, the e1000 register layout tends to require a lot of register
>> +writes (and thus vmexits) for the amount of work the hardware performs,
>> +so the QEMU developers have added the virtio-net card (a PCI hardware
>> +specification, although no bare-metal hardware exists that actually
>> +implements it), such that installing a virtio-net driver in the guest OS
>> +can then minimize the number of vmexits while still getting the same
>> +side-effects of sending network traffic.  If you tell QEMU to start a
>> +guest with a virtio-net card, then the guest OS will probe PCI space and
>> +see a bank of registers with the virtio-net signature, and load the
>> +appropriate driver like it would for any other PCI hardware.
>> +
>> +In summary, even though QEMU was first written as a way of emulating
>> +hardware memory maps in order to virtualize a guest OS, it turns out
>> +that the fastest virtualization also depends on virtual hardware: a
>> +memory map of registers with particular documented side effects that has
>> +no bare-metal counterpart.  And at the end of the day, all
>> +virtualization really means is running a particular set of assembly
>> +instructions (the guest OS) to manipulate locations within a giant
>> +memory map for causing a particular set of side effects, where QEMU is
>> +just a user-space application providing a memory map and mimicking the
>> +same side effects you would get when executing those guest instructions
>> +on the appropriate bare metal hardware.
>>
> 
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]