qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH 0/6] Enable shared device assignment


From: David Hildenbrand
Subject: Re: [RFC PATCH 0/6] Enable shared device assignment
Date: Wed, 31 Jul 2024 13:18:35 +0200
User-agent: Mozilla Thunderbird

Sorry for the late reply!

Current users must skip it, yes. How private memory would have to be
handled, and who would handle it, is rather unclear.

Again, maybe we'd want separate RamDiscardManager for private and shared
memory (after all, these are two separate memory backends).

We also considered distinguishing the populate and discard operation for
private and shared memory separately. As in method 2 above, we mentioned
to add a new argument to indicate the memory attribute to operate on.
They seem to have a similar idea.

Yes. Likely it's just some implementation detail. I think the following states would be possible:

* Discarded in shared + discarded in private (not populated)
* Discarded in shared + populated in private (private populated)
* Populated in shared + discarded in private (shared populated)

One could map these to states discarded/private/shared indeed.

[...]

I've had this talk with Intel, because the 4K granularity is a pain. I
was told that ship has sailed ... and we have to cope with random 4K
conversions :(

The many mappings will likely add both memory and runtime overheads in
the kernel. But we only know once we measure.

In the normal case, the main runtime overhead comes from
private<->shared flip in SWIOTLB, which defaults to 6% of memory with a
maximum of 1Gbyte. I think this overhead is acceptable. In non-default
case, e.g. dynamic allocated DMA buffer, the runtime overhead will
increase. As for the memory overheads, It is indeed unavoidable.

Will these performance issues be a deal breaker for enabling shared
device assignment in this way?

I see the most problematic part being the dma_entry_limit and all of these individual MAP/UNMAP calls on 4KiB granularity.

dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the possible maximum should be 4294967296, and the default is 65535.

So we should be able to have a maximum of 16 TiB shared memory all in 4KiB chunks.

sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying a per-page overhead of ~2.4%, excluding the actual rbtree.

Tree lookup/modifications with that many nodes might also get a bit slower, but likely still tolerable as you note.

Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable for your use case?

--
Cheers,

David / dhildenb




reply via email to

[Prev in Thread] Current Thread [Next in Thread]