qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 5/8] kvmvapic: Introduce TPR access optimizat


From: Blue Swirl
Subject: Re: [Qemu-devel] [PATCH v2 5/8] kvmvapic: Introduce TPR access optimization for Windows guests
Date: Mon, 13 Feb 2012 18:50:08 +0000

On Mon, Feb 13, 2012 at 10:16, Jan Kiszka <address@hidden> wrote:
> On 2012-02-11 16:25, Blue Swirl wrote:
>> On Fri, Feb 10, 2012 at 18:31, Jan Kiszka <address@hidden> wrote:
>>> This enables acceleration for MMIO-based TPR registers accesses of
>>> 32-bit Windows guest systems. It is mostly useful with KVM enabled,
>>> either on older Intel CPUs (without flexpriority feature, can also be
>>> manually disabled for testing) or any current AMD processor.
>>>
>>> The approach introduced here is derived from the original version of
>>> qemu-kvm. It was refactored, documented, and extended by support for
>>> user space APIC emulation, both with and without KVM acceleration. The
>>> VMState format was kept compatible, so was the ABI to the option ROM
>>> that implements the guest-side para-virtualized driver service. This
>>> enables seamless migration from qemu-kvm to upstream or, one day,
>>> between KVM and TCG mode.
>>>
>>> The basic concept goes like this:
>>>  - VAPIC PV interface consisting of I/O port 0x7e and (for KVM in-kernel
>>>   irqchip) a vmcall hypercall is registered
>>>  - VAPIC option ROM is loaded into guest
>>>  - option ROM activates TPR MMIO access reporting via port 0x7e
>>>  - TPR accesses are trapped and patched in the guest to call into option
>>>   ROM instead, VAPIC support is enabled
>>>  - option ROM TPR helpers track state in memory and invoke hypercall to
>>>   poll for pending IRQs if required
>>>
>>> Signed-off-by: Jan Kiszka <address@hidden>
>>
>> I must say that I find the approach horrible, patching guests and ROMs
>> and looking up Windows internals. Taking the same approach to extreme,
>> we could for example patch Xen guest to become a KVM guest. Not that I
>> object merging.
>
> Yes, this is horrible. But there is no real better way in the absence of
> hardware assisted virtualization of the TPR. I think MS is recommending
> this patching approach as well.

Maybe instead of routing via ROM and the hypercall, the TPR accesses
could be handled directly with guest invisible breakpoints (like GDB
breakpoints, but for QEMU internal use), much like other
instrumentation could be handled.

>>> diff --git a/hw/apic.c b/hw/apic.c
>>> index 086c544..2ebf3ca 100644
>>> --- a/hw/apic.c
>>> +++ b/hw/apic.c
>>> @@ -35,6 +35,10 @@
>>>  #define MSI_ADDR_DEST_ID_SHIFT         12
>>>  #define        MSI_ADDR_DEST_ID_MASK           0x00ffff0
>>>
>>> +#define SYNC_FROM_VAPIC                 0x1
>>> +#define SYNC_TO_VAPIC                   0x2
>>> +#define SYNC_ISR_IRR_TO_VAPIC           0x4
>>
>> Enum, please.
>
> OK.
>
>>
>>> +
>>>  static APICCommonState *local_apics[MAX_APICS + 1];
>>>
>>>  static void apic_set_irq(APICCommonState *s, int vector_num, int 
>>> trigger_mode);
>>> @@ -78,6 +82,70 @@ static inline int get_bit(uint32_t *tab, int index)
>>>     return !!(tab[i] & mask);
>>>  }
>>>
>>> +/* return -1 if no bit is set */
>>> +static int get_highest_priority_int(uint32_t *tab)
>>> +{
>>> +    int i;
>>> +    for (i = 7; i >= 0; i--) {
>>> +        if (tab[i] != 0) {
>>> +            return i * 32 + fls_bit(tab[i]);
>>> +        }
>>> +    }
>>> +    return -1;
>>> +}
>>> +
>>> +static void apic_sync_vapic(APICCommonState *s, int sync_type)
>>> +{
>>> +    VAPICState vapic_state;
>>> +    size_t length;
>>> +    off_t start;
>>> +    int vector;
>>> +
>>> +    if (!s->vapic_paddr) {
>>> +        return;
>>> +    }
>>> +    if (sync_type & SYNC_FROM_VAPIC) {
>>> +        cpu_physical_memory_rw(s->vapic_paddr, (void *)&vapic_state,
>>> +                               sizeof(vapic_state), 0);
>>> +        s->tpr = vapic_state.tpr;
>>> +    }
>>> +    if (sync_type & (SYNC_TO_VAPIC | SYNC_ISR_IRR_TO_VAPIC)) {
>>> +        start = offsetof(VAPICState, isr);
>>> +        length = offsetof(VAPICState, enabled) - offsetof(VAPICState, isr);
>>> +
>>> +        if (sync_type & SYNC_TO_VAPIC) {
>>> +            assert(qemu_cpu_is_self(s->cpu_env));
>>> +
>>> +            vapic_state.tpr = s->tpr;
>>> +            vapic_state.enabled = 1;
>>> +            start = 0;
>>> +            length = sizeof(VAPICState);
>>> +        }
>>> +
>>> +        vector = get_highest_priority_int(s->isr);
>>> +        if (vector < 0) {
>>> +            vector = 0;
>>> +        }
>>> +        vapic_state.isr = vector & 0xf0;
>>> +
>>> +        vapic_state.zero = 0;
>>> +
>>> +        vector = get_highest_priority_int(s->irr);
>>> +        if (vector < 0) {
>>> +            vector = 0;
>>> +        }
>>> +        vapic_state.irr = vector & 0xff;
>>> +
>>> +        cpu_physical_memory_write_rom(s->vapic_paddr + start,
>>> +                                      ((void *)&vapic_state) + start, 
>>> length);
>>
>> This assumes that the vapic_state structure matches guest what guest
>> expect without conversion. Is this true for i386 on x86_64? I didn't
>> check the structure in question.
>
> Yes, the structure in question is a packed one, stable on both guest and
> host side (the guest side is 32-bit only anyway).
>
>>> diff --git a/hw/apic_common.c b/hw/apic_common.c
>>> index 588531b..1977da7 100644
>>> --- a/hw/apic_common.c
>>> +++ b/hw/apic_common.c
>>> @@ -20,8 +20,10 @@
>>>  #include "apic.h"
>>>  #include "apic_internal.h"
>>>  #include "trace.h"
>>> +#include "kvm.h"
>>>
>>>  static int apic_irq_delivered;
>>> +bool apic_report_tpr_access;
>>
>> This should go to APICCommonState.
>
> Nope, it is a global state, also checked in a place where the APIC is
> set up, thus have no local clue about it yet and needs to pick up the
> global view.
>
>>> @@ -238,6 +275,7 @@ static int apic_init_common(SysBusDevice *dev)
>>>  {
>>>     APICCommonState *s = APIC_COMMON(dev);
>>>     APICCommonClass *info;
>>> +    static DeviceState *vapic;
>>>     static int apic_no;
>>>
>>>     if (apic_no >= MAX_APICS) {
>>> @@ -248,10 +286,29 @@ static int apic_init_common(SysBusDevice *dev)
>>>     info = APIC_COMMON_GET_CLASS(s);
>>>     info->init(s);
>>>
>>> -    sysbus_init_mmio(&s->busdev, &s->io_memory);
>>> +    sysbus_init_mmio(dev, &s->io_memory);
>>> +
>>> +    if (!vapic && s->vapic_control & VAPIC_ENABLE_MASK) {
>>> +        vapic = sysbus_create_simple("kvmvapic", -1, NULL);
>>> +    }
>>> +    s->vapic = vapic;
>>> +    if (apic_report_tpr_access && info->enable_tpr_reporting) {
>>
>> I think you should not rely on apic_report_tpr_access being in sane
>> condition during class init.
>
> It is mandatory, e.g. for CPU hotplug, as reporting needs to be
> consistent accross all VCPUs. Therefore it is a static global, set to
> false initially. However, you are right, we lack proper clearing of  the
> access report feature on reset, not only in this variable.

I'd also set it to false initially.

>>> diff --git a/hw/kvmvapic.c b/hw/kvmvapic.c
>>> new file mode 100644
>>> index 0000000..0c4d304
>>> --- /dev/null
>>> +++ b/hw/kvmvapic.c
>>> @@ -0,0 +1,774 @@
>>> +/*
>>> + * TPR optimization for 32-bit Windows guests
>>> + *
>>> + * Copyright (C) 2007-2008 Qumranet Technologies
>>> + * Copyright (C) 2012      Jan Kiszka, Siemens AG
>>> + *
>>> + * This work is licensed under the terms of the GNU GPL version 2, or
>>> + * (at your option) any later version. See the COPYING file in the
>>> + * top-level directory.
>>> + */
>>> +#include "sysemu.h"
>>> +#include "cpus.h"
>>> +#include "kvm.h"
>>> +#include "apic_internal.h"
>>> +
>>> +#define APIC_DEFAULT_ADDRESS    0xfee00000
>>> +
>>> +#define VAPIC_IO_PORT           0x7e
>>> +
>>> +#define VAPIC_INACTIVE          0
>>> +#define VAPIC_ACTIVE            1
>>> +#define VAPIC_STANDBY           2
>>
>> Enums, please.
>
> OK.
>
>>
>>> +
>>> +#define VAPIC_CPU_SHIFT         7
>>> +
>>> +#define ROM_BLOCK_SIZE          512
>>> +#define ROM_BLOCK_MASK          (~(ROM_BLOCK_SIZE - 1))
>>> +
>>> +typedef struct VAPICHandlers {
>>> +    uint32_t set_tpr;
>>> +    uint32_t set_tpr_eax;
>>> +    uint32_t get_tpr[8];
>>> +    uint32_t get_tpr_stack;
>>> +} QEMU_PACKED VAPICHandlers;
>>> +
>>> +typedef struct GuestROMState {
>>> +    char signature[8];
>>> +    uint32_t vaddr;
>>
>> This does not look 64 bit clean.
>
> It's packed.

I meant "virtual address could be 64 bits on a 64 bit host", not
structure packing.

>>
>>> +    uint32_t fixup_start;
>>> +    uint32_t fixup_end;
>>> +    uint32_t vapic_vaddr;
>>> +    uint32_t vapic_size;
>>> +    uint32_t vcpu_shift;
>>> +    uint32_t real_tpr_addr;
>>> +    VAPICHandlers up;
>>> +    VAPICHandlers mp;
>>> +} QEMU_PACKED GuestROMState;
>>
>> Why packed, is this passed to guest directly?
>
> It is a data field in the option ROM, see vapic_base in kvmvapic.S.
>
>>
>>> +
>>> +typedef struct VAPICROMState {
>>> +    SysBusDevice busdev;
>>> +    MemoryRegion io;
>>> +    MemoryRegion rom;
>>> +    bool rom_mapped_writable;
>>
>> I'd put this later to avoid a structure hole.
>
> Moving it after rom_state may save us a few precious bytes. Well, ok. :)
>
>>
>>> +    uint32_t state;
>>> +    uint32_t rom_state_paddr;
>>> +    uint32_t rom_state_vaddr;
>>> +    uint32_t vapic_paddr;
>>> +    uint32_t real_tpr_addr;
>>> +    GuestROMState rom_state;
>>> +    size_t rom_size;
>>> +} VAPICROMState;
>>> +
>>> +#define TPR_INSTR_IS_WRITE              0x1
>>> +#define TPR_INSTR_ABS_MODRM             0x2
>>> +#define TPR_INSTR_MATCH_MODRM_REG       0x4
>>> +
>>> +typedef struct TPRInstruction {
>>> +    uint8_t opcode;
>>> +    uint8_t modrm_reg;
>>> +    unsigned int flags;
>>> +    size_t length;
>>> +    off_t addr_offset;
>>> +} TPRInstruction;
>>
>> Also here the order is pessimized.
>
> Don't see the gain here, though.

There are two bytes' hole between modrm_reg and flags, maybe also 4
bytes between length and addr_offset (if size_t is 32 bits but off_t
64 bits). I'd reverse the order so that members with largest alignment
needs come first.

>>> +static int find_real_tpr_addr(VAPICROMState *s, CPUState *env)
>>> +{
>>> +    target_phys_addr_t paddr;
>>> +    target_ulong addr;
>>> +
>>> +    if (s->state == VAPIC_ACTIVE) {
>>> +        return 0;
>>> +    }
>>> +    for (addr = 0xfffff000; addr >= 0x80000000; addr -= TARGET_PAGE_SIZE) {
>>> +        paddr = cpu_get_phys_page_debug(env, addr);
>>> +        if (paddr != APIC_DEFAULT_ADDRESS) {
>>> +            continue;
>>> +        }
>>> +        s->real_tpr_addr = addr + 0x80;
>>> +        update_guest_rom_state(s);
>>> +        return 0;
>>> +    }
>>
>> This loop looks odd, what should it do, probe for unused address?
>
> Seems to deserve a comment: We have to scan for the guest's mapping of
> the APIC as we enter here without a hint from an TPR accessing
> instruction. So we probe the potential range, trying to find the page
> that maps to that known physical address (known in the sense that
> Windows does not remap the APIC physically - nor does QEMU support that
> so far).

Yes, more comments would be nice, especially on theory of operation.

>>> +static int evaluate_tpr_instruction(VAPICROMState *s, CPUState *env,
>>> +                                    target_ulong *pip, int access)
>>> +{
>>> +    const TPRInstruction *instr;
>>> +    target_ulong ip = *pip;
>>> +    uint8_t opcode[2];
>>> +    uint32_t real_tpr_addr;
>>> +    int i;
>>> +
>>> +    if ((ip & 0xf0000000) != 0x80000000 && (ip & 0xf0000000) != 
>>> 0xe0000000) {
>>
>> The constants should be using ULL suffix because target_ulong could be
>> 64 bit, though maybe this is more optimal.
>
> target_ulong is 64-bit unconditionally on x86. I'll add this.

No, target_phys_addr_t is now 64 bits, but target_ulong (register
size) is 32 bits for i386-softmmu.

>>> +
>>> +/*
>>> + * Tries to read the unique processor number from the Kernel Processor 
>>> Control
>>> + * Region (KPCR) of 32-bit Windows. Returns -1 if the KPCR cannot be 
>>> accessed
>>> + * or is considered invalid.
>>> + */
>>
>> Horrible hack. Is guest OS type or version checked somewhere?
>
> This is all about hacking Windows 32-bit. And this check encodes that
> even stronger. The other important binding is the expected virtual
> address of the ROM mapping under Windows. I would have preferred
> checking the version directly, but no one has a complete list of
> supported guests and their codes.

Then it would be nice to only enable this with a command line switch,
so that some random poor non-Windows OS is not patched incorrectly.

>>
>>> +static int get_kpcr_number(CPUState *env)
>>> +{
>>> +    struct kpcr {
>>> +        uint8_t  fill1[0x1c];
>>> +        uint32_t self;
>>> +        uint8_t  fill2[0x31];
>>> +        uint8_t  number;
>>> +    } QEMU_PACKED kpcr;
>>
>> KPCR. Pointers to Windows documentation would be nice.
>
> Oops, yes.
>
> Unfortunately, this is only an internal structure, not officially
> documented by MS. However, all supported OS versions a legacy by now, no
> longer changing its structure.

This and a note about the supported OS versions could be added as comment.

>>> +
>>> +static int patch_hypercalls(VAPICROMState *s)
>>> +{
>>> +    target_phys_addr_t rom_paddr = s->rom_state_paddr & ROM_BLOCK_MASK;
>>> +    static uint8_t vmcall_pattern[] = {
>>
>> const
>>
>>> +        0xb8, 0x1, 0, 0, 0, 0xf, 0x1, 0xc1
>>> +    };
>>> +    static uint8_t outl_pattern[] = {
>>
>> const
>
> Yep.
>
>>> +static void vapic_write(void *opaque, target_phys_addr_t addr, uint64_t 
>>> data,
>>> +                        unsigned int size)
>>> +{
>>> +    CPUState *env = cpu_single_env;
>>> +    target_phys_addr_t rom_paddr;
>>> +    VAPICROMState *s = opaque;
>>> +
>>> +    cpu_synchronize_state(env);
>>> +
>>> +    /*
>>> +     * The VAPIC supports two PIO-based hypercalls, both via port 0x7E.
>>> +     *  o 16-bit write access:
>>> +     *    Reports the option ROM initialization to the hypervisor. Written
>>> +     *    value is the offset of the state structure in the ROM.
>>> +     *  o 8-bit write access:
>>> +     *    Reactivates the VAPIC after a guest hibernation, i.e. after the
>>> +     *    option ROM content has been re-initialized by a guest power 
>>> cycle.
>>> +     *  o 32-bit write access:
>>> +     *    Poll for pending IRQs, considering the current VAPIC state.
>>> +     */
>>
>> Different operation depending on size? Interesting.
>
> Originally not my idea, just added the third case. :)
>
>>
>>> +    switch (size) {
>>> +    case 2:
>>> +        if (s->state != VAPIC_INACTIVE) {
>>> +            patch_hypercalls(s);
>>> +            break;
>>> +        }
>>> +
>>> +        rom_paddr = (env->segs[R_CS].base + env->eip) & ROM_BLOCK_MASK;
>>> +        s->rom_state_paddr = rom_paddr + data;
>>> +
>>> +        if (vapic_prepare(s) < 0) {
>>> +            break;
>>> +        }
>>> +        s->state = VAPIC_STANDBY;
>>> +        break;
>>> +    case 1:
>>> +        if (kvm_enabled()) {
>>> +            /*
>>> +             * Disable triggering instruction in ROM by writing a NOP.
>>> +             *
>>> +             * We cannot do this in TCG mode as the reported IP is not
>>> +             * reliable.
>>
>> Given the hack level of the whole, it would not be impossible to find
>> the IP using search PC.
>
> Is there a specific pre-existing service you have in mind? Otherwise,
> the complexity might not be worth the gain.

There's gen_intermediate_code() vs. gen_intermediate_code_pc().
Probably not worth it, but it would increase the hack level nicely.

> Thanks for having a look,
> Jan
>
> --
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux



reply via email to

[Prev in Thread] Current Thread [Next in Thread]