|
From: | David Hildenbrand |
Subject: | Re: [RFC PATCH 0/6] Enable shared device assignment |
Date: | Wed, 31 Jul 2024 13:18:35 +0200 |
User-agent: | Mozilla Thunderbird |
Sorry for the late reply!
Current users must skip it, yes. How private memory would have to be handled, and who would handle it, is rather unclear. Again, maybe we'd want separate RamDiscardManager for private and shared memory (after all, these are two separate memory backends).We also considered distinguishing the populate and discard operation for private and shared memory separately. As in method 2 above, we mentioned to add a new argument to indicate the memory attribute to operate on. They seem to have a similar idea.
Yes. Likely it's just some implementation detail. I think the following states would be possible:
* Discarded in shared + discarded in private (not populated) * Discarded in shared + populated in private (private populated) * Populated in shared + discarded in private (shared populated) One could map these to states discarded/private/shared indeed. [...]
I've had this talk with Intel, because the 4K granularity is a pain. I was told that ship has sailed ... and we have to cope with random 4K conversions :( The many mappings will likely add both memory and runtime overheads in the kernel. But we only know once we measure.In the normal case, the main runtime overhead comes from private<->shared flip in SWIOTLB, which defaults to 6% of memory with a maximum of 1Gbyte. I think this overhead is acceptable. In non-default case, e.g. dynamic allocated DMA buffer, the runtime overhead will increase. As for the memory overheads, It is indeed unavoidable. Will these performance issues be a deal breaker for enabling shared device assignment in this way?
I see the most problematic part being the dma_entry_limit and all of these individual MAP/UNMAP calls on 4KiB granularity.
dma_entry_limit is "unsigned int", and defaults to U16_MAX. So the possible maximum should be 4294967296, and the default is 65535.
So we should be able to have a maximum of 16 TiB shared memory all in 4KiB chunks.
sizeof(struct vfio_dma) is probably something like <= 96 bytes, implying a per-page overhead of ~2.4%, excluding the actual rbtree.
Tree lookup/modifications with that many nodes might also get a bit slower, but likely still tolerable as you note.
Deal breaker? Not sure. Rather "suboptimal" :) ... but maybe unavoidable for your use case?
-- Cheers, David / dhildenb
[Prev in Thread] | Current Thread | [Next in Thread] |