qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] hvf: arm: Allow creating VMs with > 63GB of RAM on macOS 15+


From: Danny Canter
Subject: Re: [PATCH] hvf: arm: Allow creating VMs with > 63GB of RAM on macOS 15+
Date: Mon, 12 Aug 2024 18:18:25 -0400

Peter, thanks for review! Will work on splitting this up a bit to support the 
plumbing you mentioned KVM does today on ARM. 

> On Aug 12, 2024, at 10:52 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> 
> On Fri, 19 Jul 2024 at 00:03, Danny Canter <danny_canter@apple.com> wrote:
>> 
>> This patch's main focus is to enable creating VMs with > 63GB
>> of RAM on Apple Silicon machines by using some new HVF APIs. In
>> pursuit of this a couple of things related to how we handle the
>> physical address range we expose to guests were altered:
>> 
>> The default IPA size on all Apple Silicon machines for HVF is
>> currently 36 bits. This bars making a VM with > 63GB (as RAM
>> starts at 1GB in the memory map) of RAM. Currently, to get the
>> IPA size we were reading id_aa64mmfr0_el1's PARange field
>> from a newly made vcpu. Unfortunately HVF just returns the
>> hosts PARange directly for the initial value and not the IPA
>> size that will actually back the VM, so we believe we have much
>> more address space than we actually do today it seems.
> 
> So just to check my understanding, this means that with current
> QEMU, on all Apple hardware, attempting to create a VM with
> more than 63 GB of RAM will always fail in the same way,
> regardless of whether that CPU's hardware has a 36 bit IPA
> or a larger IPA? That is, we don't change the default IPA for the
> VM, so it's 36 bits, and then the hvf command to map in the RAM
> to the guest address space fails with HV_BAD_ARGUMENT, per
> https://gitlab.com/qemu-project/qemu/-/issues/1816 .

Spot on, yes. We default to a lower (36 bit) IPA space always, and expose the 
knobs starting in 13 to raise this on a per-VM
basis. We aren’t raising it today so we’d always fail when the kernel gets a 
hv_vm_map with an IPA past the end of our address
space.

> 
>> Starting in macOS 13.0 some APIs were introduced to be able to
>> query the maximum IPA size the kernel supports, and to set the IPA
>> size for a given VM. However, this still has a couple of issues
>> on < macOS 15. Up until macOS 15 (and if the hardware supported
>> it) the max IPA size was 39 bits which is not a valid PARange
>> value, so we can't clamp down what we advertise in the vcpu's
>> id_aa64mmfr0_el1 to our IPA size. Starting in macOS 15 however,
>> the maximum IPA size is 40 bits (if it's supported in the hardware
>> as well) which is also a valid PARange value, so we can set our IPA
>> size to the maximum as well as clamp down the PARange we advertise
>> to the guest. This allows VMs with 63+ GB of ram and should fix the
>> oddness of the PARange situation as well.
> 
> So (again to clarify that I understand what's happening here)
> for macos 13-14 we'll effectively continue to use a 36-bit
> IPA range because we clamp the "39" value down to the next
> lowest actually-valid value of 36 ? And so if you want >63GB
> of memory you'll need all of:
> * a host CPU which supports at least a 40 bit IPA
>   (is there a definition somewhere of which these are?)
> * macos 15
> * a QEMU with these changes
> 
> ?
> 
> (That seems fine to me: I'm happy to say "get macos 15 if you
> want this" rather than trying to cope with the non-standard
> 39 bit IPA in QEMU. We should make sure the error message in
> the IPA-too-small case is comprehensible -- I think at the
> moment we somewhat unhelpfully assert()...)
> 

Spot on again. We didn’t want to advertise a larger PA range to the guest than 
what is actually backing the VM, so for a “correct” world
macOS 15 would be required. You’d get 40 bits of IPA space, and we finally line 
up with a valid ARM PARange value. As for whether
there’s a list of what SoC’s support what IPA size, I don’t believe so. All of 
the pro/max SoC’s do iirc, but I’d just direct folks to write a
program that calls `hv_vm_config_get_max_ipa_size` to confirm. There’s `sysctl 
-a kern.hv`, but I’d avoid recommending this in case
the hv API does some extra munging with the value this reports. That 
hv_vm_config_ API should truly be the source of truth on any given
machine.

As for the error message we report, I’d have to remember what the error message 
is today, but it seemed somewhat reasonable after this patch
if I recall. It was in a codepath that already existed (and was added for kvm 
it seemed) and was checking if any part of the memory map exceeded the
maximum IPA size.

>> For the implementation of this I've decided to only bump the IPA
>> size if the amount of RAM requested is encroaching on the default IPA
>> size of 36 bits, as at 40 bits of IPA space we have to have one extra
>> level of stage2 page tables.
>> 
>> Signed-off-by: Danny Canter <danny_canter@apple.com>
>> Reviewed-by: Cameron Esfahani <dirty@apple.com>
> 
>> @@ -929,6 +977,66 @@ void hvf_arch_vcpu_destroy(CPUState *cpu)
>> {
>> }
>> 
>> +hv_return_t hvf_arch_vm_create(MachineState *ms)
>> +{
>> +    uint32_t default_ipa_size = hvf_get_default_ipa_bit_size();
>> +    uint32_t max_ipa_size = hvf_get_max_ipa_bit_size();
>> +    hv_return_t ret;
>> +
>> +    chosen_ipa_bit_size = default_ipa_size;
>> +
>> +    /*
>> +     * Set the IPA size for the VM:
>> +     *
>> +     * Starting from macOS 13 a new set of APIs were introduced that allow 
>> you
>> +     * to query for the maximum IPA size that is supported on your system. 
>> macOS
>> +     * 13 and 14's kernel both return a value less than 40 bits (typically 
>> 39,
>> +     * but depends on hardware), however starting in macOS 15 and up the IPA
>> +     * size supported (in the kernel at least) is up to 40 bits. A common 
>> scheme
>> +     * for attempting to get the IPA size prior to the introduction of 
>> these new
>> +     * APIs was to read ID_AA64MMFR0.PARange from a vcpu in the hopes that 
>> HVF
>> +     * was returning the maximum IPA size in that. However, this was not the
>> +     * case. HVF would return the host's PARange value directly which is
>> +     * generally larger than 40 bits.
>> +     *
>> +     * Using that value we could set up our memory map with regions much 
>> outside
>> +     * the actually supported IPA size, and also advertise a much larger
>> +     * physical address space to the guest. On the hardware+OS combos where
>> +     * the IPA size is less than 40 bits, but greater than 36, we also don't
>> +     * have a valid PARange value to round down to before 36 bits which is
>> +     * already the default.
>> +     *
>> +     * With that in mind, before we make the VM lets grab the maximum 
>> supported
>> +     * IPA size and clamp it down to the first valid PARange value so we can
>> +     * advertise the correct address size for the guest later on. Then if 
>> it's
>> +     * >= 40 set this as the IPA size for the VM using the new APIs. 
>> There's a
>> +     * small heuristic for actually altering the IPA size for the VM which 
>> is
>> +     * if our requested RAM is encroaching on the top of our default IPA 
>> size.
>> +     * This is just an optimization, as at 40 bits we need to create one 
>> more
>> +     * level of stage2 page tables.
>> +     */
>> +#if defined(MAC_OS_VERSION_13_0) && \
>> +    MAC_OS_X_VERSION_MIN_REQUIRED >= MAC_OS_VERSION_13_0
>> +    hv_vm_config_t config = hv_vm_config_create();
>> +
>> +    /* In our memory map RAM starts at 1GB. */
> 
> This is not board-specific code, so you can't assume that.
> The board gets to pick the memory map and where RAM starts in it.
> 
> You probably need to do something similar to what we do
> in hw/arm/virt.c:virt_kvm_type() where we find out what
> the best IPA the hypervisor supports is, set the board memory
> map to respect that, diagnose an error if the user asked for
> more RAM than fits into that IPA range, and then arrange for
> the actual VM/vcpu creation to be done with the required IPA.
> 
> This is unfortunately probably going to imply a bit of extra
> plumbing to be implemented for hvf -- that MachineClass::kvm_type
> method is (as the name suggests) KVM specific. (Multi-patch
> patchset for that, where we add the plumbing in as its own
> separate patch (and/or whatever other split of functionality
> into coherent chunks makes sense), rather than one-big-patch, please.)

That’s perfectly fine, I’ll try and see how the plumbing was done for KVM and 
try and emulate where it makes sense
for HVF. Agree though, that’d definitely push this into multi-patch territory. 
Curious if you think what’s here today should
be multiple patches or the current work seems fine in one?

> 
>> +    uint64_t threshold = (1ull << default_ipa_size) - (1 * GiB);
>> +    if (ms->ram_size >= threshold && max_ipa_size >= FIRST_HIGHMEM_PARANGE) 
>> {
>> +        ret = hv_vm_config_set_ipa_size(config, max_ipa_size);
>> +        assert_hvf_ok(ret);
>> +
>> +        chosen_ipa_bit_size = max_ipa_size;
>> +    }
>> +
>> +    ret = hv_vm_create(config);
>> +    os_release(config);
>> +#else
>> +    ret = hv_vm_create(NULL);
>> +#endif
>> +
>> +    return ret;
>> +}
> 
>> +uint8_t round_down_to_parange_index(uint8_t bit_size)
>> +{
>> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
>> +        if (pamax_map[i] <= bit_size) {
>> +            return i;
>> +        }
>> +    }
>> +    g_assert_not_reached();
>> +}
>> +
>> +uint8_t round_down_to_parange_bit_size(uint8_t bit_size)
>> +{
>> +    for (int i = ARRAY_SIZE(pamax_map) - 1; i >= 0; i--) {
>> +        if (pamax_map[i] <= bit_size) {
>> +            return pamax_map[i];
>> +        }
>> +    }
>> +    g_assert_not_reached();
> 
> We could implement this as
>       return pamax_map[round_down_to_parange_index(bit_size)];
> 
> and avoid having to code the loop twice, right?

Yes, my copy and paste seems dumb reading it back now :)

> 
> thanks
> -- PMM




reply via email to

[Prev in Thread] Current Thread [Next in Thread]