[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree
From: |
Eugenio Perez Martin |
Subject: |
Re: [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree |
Date: |
Tue, 30 Jul 2024 13:00:03 +0200 |
On Tue, Jul 30, 2024 at 10:48 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Mon, Jul 29, 2024 at 6:05 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Jul 24, 2024 at 7:00 PM Jonah Palmer <jonah.palmer@oracle.com>
> > wrote:
> > >
> > >
> > >
> > > On 5/13/24 11:56 PM, Jason Wang wrote:
> > > > On Mon, May 13, 2024 at 5:58 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > >>
> > > >> On Mon, May 13, 2024 at 10:28 AM Jason Wang <jasowang@redhat.com>
> > > >> wrote:
> > > >>>
> > > >>> On Mon, May 13, 2024 at 2:28 PM Eugenio Perez Martin
> > > >>> <eperezma@redhat.com> wrote:
> > > >>>>
> > > >>>> On Sat, May 11, 2024 at 6:07 AM Jason Wang <jasowang@redhat.com>
> > > >>>> wrote:
> > > >>>>>
> > > >>>>> On Fri, May 10, 2024 at 3:16 PM Eugenio Perez Martin
> > > >>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>
> > > >>>>>> On Fri, May 10, 2024 at 6:29 AM Jason Wang <jasowang@redhat.com>
> > > >>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> On Thu, May 9, 2024 at 3:10 PM Eugenio Perez Martin
> > > >>>>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>>>
> > > >>>>>>>> On Thu, May 9, 2024 at 8:27 AM Jason Wang <jasowang@redhat.com>
> > > >>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> On Thu, May 9, 2024 at 1:16 AM Eugenio Perez Martin
> > > >>>>>>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Wed, May 8, 2024 at 4:29 AM Jason Wang
> > > >>>>>>>>>> <jasowang@redhat.com> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Tue, May 7, 2024 at 6:57 PM Eugenio Perez Martin
> > > >>>>>>>>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Tue, May 7, 2024 at 9:29 AM Jason Wang
> > > >>>>>>>>>>>> <jasowang@redhat.com> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Fri, Apr 12, 2024 at 3:56 PM Eugenio Perez Martin
> > > >>>>>>>>>>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> On Fri, Apr 12, 2024 at 8:47 AM Jason Wang
> > > >>>>>>>>>>>>>> <jasowang@redhat.com> wrote:
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> On Wed, Apr 10, 2024 at 6:03 PM Eugenio Pérez
> > > >>>>>>>>>>>>>>> <eperezma@redhat.com> wrote:
> > > >>>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>>> The guest may have overlapped memory regions, where
> > > >>>>>>>>>>>>>>>> different GPA leads
> > > >>>>>>>>>>>>>>>> to the same HVA. This causes a problem when overlapped
> > > >>>>>>>>>>>>>>>> regions
> > > >>>>>>>>>>>>>>>> (different GPA but same translated HVA) exists in the
> > > >>>>>>>>>>>>>>>> tree, as looking
> > > >>>>>>>>>>>>>>>> them by HVA will return them twice.
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>> I think I don't understand if there's any side effect for
> > > >>>>>>>>>>>>>>> shadow virtqueue?
> > > >>>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> My bad, I totally forgot to put a reference to where this
> > > >>>>>>>>>>>>>> comes from.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Si-Wei found that during initialization this sequences of
> > > >>>>>>>>>>>>>> maps /
> > > >>>>>>>>>>>>>> unmaps happens [1]:
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> HVA GPA IOVA
> > > >>>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------
> > > >>>>>>>>>>>>>> Map
> > > >>>>>>>>>>>>>> [0x7f7903e00000, 0x7f7983e00000) [0x0, 0x80000000)
> > > >>>>>>>>>>>>>> [0x1000, 0x80000000)
> > > >>>>>>>>>>>>>> [0x7f7983e00000, 0x7f9903e00000) [0x100000000,
> > > >>>>>>>>>>>>>> 0x2080000000)
> > > >>>>>>>>>>>>>> [0x80001000, 0x2000001000)
> > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000,
> > > >>>>>>>>>>>>>> 0xfedc0000)
> > > >>>>>>>>>>>>>> [0x2000001000, 0x2000021000)
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Unmap
> > > >>>>>>>>>>>>>> [0x7f7903ea0000, 0x7f7903ec0000) [0xfeda0000,
> > > >>>>>>>>>>>>>> 0xfedc0000) [0x1000,
> > > >>>>>>>>>>>>>> 0x20000) ???
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> The third HVA range is contained in the first one, but
> > > >>>>>>>>>>>>>> exposed under a
> > > >>>>>>>>>>>>>> different GVA (aliased). This is not "flattened" by QEMU,
> > > >>>>>>>>>>>>>> as GPA does
> > > >>>>>>>>>>>>>> not overlap, only HVA.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> At the third chunk unmap, the current algorithm finds the
> > > >>>>>>>>>>>>>> first chunk,
> > > >>>>>>>>>>>>>> not the second one. This series is the way to tell the
> > > >>>>>>>>>>>>>> difference at
> > > >>>>>>>>>>>>>> unmap time.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> [1]
> > > >>>>>>>>>>>>>> https://urldefense.com/v3/__https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html__;!!ACWV5N9M2RV99hQ!MXbGSFHVbqRf0rzyWamOdnBLHP0FUh3r3BnTvGe6Mn5VzXTsajVp3BB7VqlklkRCr5aKazC5xxTCScuR_BY$
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Thanks!
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Ok, I was wondering if we need to store GPA(GIOVA) to HVA
> > > >>>>>>>>>>>>> mappings in
> > > >>>>>>>>>>>>> the iova tree to solve this issue completely. Then there
> > > >>>>>>>>>>>>> won't be
> > > >>>>>>>>>>>>> aliasing issues.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> I'm ok to explore that route but this has another problem.
> > > >>>>>>>>>>>> Both SVQ
> > > >>>>>>>>>>>> vrings and CVQ buffers also need to be addressable by
> > > >>>>>>>>>>>> VhostIOVATree,
> > > >>>>>>>>>>>> and they do not have GPA.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> At this moment vhost_svq_translate_addr is able to handle
> > > >>>>>>>>>>>> this
> > > >>>>>>>>>>>> transparently as we translate vaddr to SVQ IOVA. How can we
> > > >>>>>>>>>>>> store
> > > >>>>>>>>>>>> these new entries? Maybe a (hwaddr)-1 GPA to signal it has
> > > >>>>>>>>>>>> no GPA and
> > > >>>>>>>>>>>> then a list to go through other entries (SVQ vaddr and CVQ
> > > >>>>>>>>>>>> buffers).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> This seems to be tricky.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> As discussed, it could be another iova tree.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> Yes but there are many ways to add another IOVATree. Let me
> > > >>>>>>>>>> expand & recap.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Option 1 is to simply add another iova tree to
> > > >>>>>>>>>> VhostShadowVirtqueue.
> > > >>>>>>>>>> Let's call it gpa_iova_tree, as opposed to the current
> > > >>>>>>>>>> iova_tree that
> > > >>>>>>>>>> translates from vaddr to SVQ IOVA. To know which one to use is
> > > >>>>>>>>>> easy at
> > > >>>>>>>>>> adding or removing, like in the memory listener, but how to
> > > >>>>>>>>>> know at
> > > >>>>>>>>>> vhost_svq_translate_addr?
> > > >>>>>>>>>
> > > >>>>>>>>> Then we won't use virtqueue_pop() at all, we need a SVQ version
> > > >>>>>>>>> of
> > > >>>>>>>>> virtqueue_pop() to translate GPA to SVQ IOVA directly?
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> The problem is not virtqueue_pop, that's out of the
> > > >>>>>>>> vhost_svq_translate_addr. The problem is the need of adding
> > > >>>>>>>> conditionals / complexity in all the callers of
> > > >>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> The easiest way for me is to rely on
> > > >>>>>>>>>> memory_region_from_host(). When
> > > >>>>>>>>>> vaddr is from the guest, it returns a valid MemoryRegion. When
> > > >>>>>>>>>> it is
> > > >>>>>>>>>> not, it returns NULL. I'm not sure if this is a valid use
> > > >>>>>>>>>> case, it
> > > >>>>>>>>>> just worked in my tests so far.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Now we have the second problem: The GPA values of the regions
> > > >>>>>>>>>> of the
> > > >>>>>>>>>> two IOVA tree must be unique. We need to be able to find
> > > >>>>>>>>>> unallocated
> > > >>>>>>>>>> regions in SVQ IOVA. At this moment there is only one
> > > >>>>>>>>>> IOVATree, so
> > > >>>>>>>>>> this is done easily by vhost_iova_tree_map_alloc. But it is
> > > >>>>>>>>>> very
> > > >>>>>>>>>> complicated with two trees.
> > > >>>>>>>>>
> > > >>>>>>>>> Would it be simpler if we decouple the IOVA allocator? For
> > > >>>>>>>>> example, we
> > > >>>>>>>>> can have a dedicated gtree to track the allocated IOVA ranges.
> > > >>>>>>>>> It is
> > > >>>>>>>>> shared by both
> > > >>>>>>>>>
> > > >>>>>>>>> 1) Guest memory (GPA)
> > > >>>>>>>>> 2) SVQ virtqueue and buffers
> > > >>>>>>>>>
> > > >>>>>>>>> And another gtree to track the GPA to IOVA.
> > > >>>>>>>>>
> > > >>>>>>>>> The SVQ code could use either
> > > >>>>>>>>>
> > > >>>>>>>>> 1) one linear mappings that contains both SVQ virtqueue and
> > > >>>>>>>>> buffers
> > > >>>>>>>>>
> > > >>>>>>>>> or
> > > >>>>>>>>>
> > > >>>>>>>>> 2) dynamic IOVA allocation/deallocation helpers
> > > >>>>>>>>>
> > > >>>>>>>>> So we don't actually need the third gtree for SVQ HVA -> SVQ
> > > >>>>>>>>> IOVA?
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> That's possible, but that scatters the IOVA handling code
> > > >>>>>>>> instead of
> > > >>>>>>>> keeping it self-contained in VhostIOVATree.
> > > >>>>>>>
> > > >>>>>>> To me, the IOVA range/allocation is orthogonal to how IOVA is
> > > >>>>>>> used.
> > > >>>>>>>
> > > >>>>>>> An example is the iova allocator in the kernel.
> > > >>>>>>>
> > > >>>>>>> Note that there's an even simpler IOVA "allocator" in NVME
> > > >>>>>>> passthrough
> > > >>>>>>> code, not sure it is useful here though (haven't had a deep look
> > > >>>>>>> at
> > > >>>>>>> that).
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> I don't know enough about them to have an opinion. I keep seeing
> > > >>>>>> the
> > > >>>>>> drawback of needing to synchronize both allocation & adding in all
> > > >>>>>> the
> > > >>>>>> places we want to modify the IOVATree. At this moment, these are
> > > >>>>>> the
> > > >>>>>> vhost-vdpa memory listener, the SVQ vring creation and removal, and
> > > >>>>>> net CVQ buffers. But it may be more in the future.
> > > >>>>>>
> > > >>>>>> What are the advantages of keeping these separated that justifies
> > > >>>>>> needing to synchronize in all these places, compared with keeping
> > > >>>>>> them
> > > >>>>>> synchronized in VhostIOVATree?
> > > >>>>>
> > > >>>>> It doesn't need to be synchronized.
> > > >>>>>
> > > >>>>> Assuming guest and SVQ shares IOVA range. IOVA only needs to track
> > > >>>>> which part of the range has been used.
> > > >>>>>
> > > >>>>
> > > >>>> Not sure if I follow, that is what I mean with "synchronized".
> > > >>>
> > > >>> Oh right.
> > > >>>
> > > >>>>
> > > >>>>> This simplifies things, we can store GPA->IOVA mappings and SVQ ->
> > > >>>>> IOVA mappings separately.
> > > >>>>>
> > > >>>>
> > > >>>> Sorry, I still cannot see the whole picture :).
> > > >>>>
> > > >>>> Let's assume we have all the GPA mapped to specific IOVA regions, so
> > > >>>> we have the first IOVA tree (GPA -> IOVA) filled. Now we enable SVQ
> > > >>>> because of the migration. How can we know where we can place SVQ
> > > >>>> vrings without having them synchronized?
> > > >>>
> > > >>> Just allocating a new IOVA range for SVQ?
> > > >>>
> > > >>>>
> > > >>>> At this moment we're using a tree. The tree nature of the current SVQ
> > > >>>> IOVA -> VA makes all nodes ordered so it is more or less easy to look
> > > >>>> for holes.
> > > >>>
> > > >>> Yes, iova allocate could still be implemented via a tree.
> > > >>>
> > > >>>>
> > > >>>> Now your proposal uses the SVQ IOVA as tree values. Should we iterate
> > > >>>> over all of them, order them, of the two trees, and then look for
> > > >>>> holes there?
> > > >>>
> > > >>> Let me clarify, correct me if I was wrong:
> > > >>>
> > > >>> 1) IOVA allocator is still implemented via a tree, we just don't need
> > > >>> to store how the IOVA is used
> > > >>> 2) A dedicated GPA -> IOVA tree, updated via listeners and is used in
> > > >>> the datapath SVQ translation
> > > >>> 3) A linear mapping or another SVQ -> IOVA tree used for SVQ
> > > >>>
> > > >>
> > > >> Ok, so the part I was missing is that now we have 3 whole trees, with
> > > >> somehow redundant information :).
> > > >
> > > > Somehow, it decouples the IOVA usage out of the IOVA allocator. This
> > > > might be simple as guests and SVQ may try to share a single IOVA
> > > > address space.
> > > >
> > >
> > > I'm working on implementing your three suggestions above but I'm a bit
> > > confused with some of the wording and I was hoping you could clarify
> > > some of it for me when you get the chance.
> > >
> > > ---
> > > For your first suggestion (1) you mention decoupling the IOVA allocator
> > > and "don't need to store how the IOVA is used."
> > >
> > > By this, do you mean to not save the IOVA->HVA mapping and instead only
> > > save the allocated IOVA ranges?
>
> Yes.
>
> > > In other words, are you suggesting to
> > > create a dedicated "IOVA->IOVA" tree like:
> > >
> > > struct VhostIOVATree {
> > > uint64_t iova_first;
> > > uint64_t iova_last;
> > > IOVATree *iova_map;
> > > };
>
> It could be this or other. I think the point is the allocator is only
> used for IOVA allocation but it doesn't have any information about the
> mapping.
>
> > >
> > > Where the mapping might look something like (where translated_addr is
> > > given some kind of 0 value):
> > >
> > > iova_region = (DMAMap) {
> > > .iova = iova_first,
> > > .translated_addr = 0,
> > > .size = region_size - 1,
> > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > };
> > >
> > > Also, if this is what you mean by decoupling the IOVA allocator, what
> > > happens to the IOVA->HVA tree?
>
> We will have two structures:
>
> 1) IOVA domain (in charge of IOVA range allocation/deallocation)
> 2) How the IOVA is used, e.g an IOVA->HVA tree
>
> > Are we no longer saving these mappings in
> > > a tree?
> > >
> > > ---
> > > In your second suggestion (2) with a dedicated GPA->IOVA tree, were you
> > > thinking something like this? Just adding on to VhostIOVATree here:
> > >
> > > struct VhostIOVATree {
> > > uint64_t iova_first;
> > > uint64_t iova_last;
> > > IOVATree *iova_map;
> > > IOVATree *gpa_map;
> > > };
>
> I'd split the mappings if it's possible.
>
> > >
> > > Where the mapping might look something like:
> > >
> > > gpa_region = (DMAMap) {
> > > .iova = iova_first,
> > > .translated_addr = gpa_first,
> > > .size = region_size - 1,
> > > .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > };
> > >
> > > Also, when you say "used in the datapath SVQ translation", we still need
> > > to build the GPA->IOVA tree when vhost_vdpa_listener_region_add() is
> > > called, right?
>
> Right.
>
> > >
> > > ---
> > > Lastly, in your third suggestion (3) you mention implementing a
> > > SVQ->IOVA tree, making the total number of IOVATrees/GTrees 3: one for
> > > just IOVAs, one for GPA->IOVA, and the last one for SVQ->IOVA. E.g.
> > >
> > > struct VhostIOVATree {
> > > uint64_t iova_first;
> > > uint64_t iova_last;
> > > IOVATree *iova_map;
> > > IOVATree *gpa_map;
> > > IOVATree *svq_map;
> > > };
> > >
> > > ---
> > >
> > > Let me know if I'm understanding this correctly. If I am, this would
> > > require a pretty good amount of refactoring on the IOVA allocation,
> > > searching, destroying, etc. code to account for these new trees.
> > >
> >
> > Ok I think I understand your previous question better, sorry for the delay
> > :).
> >
> > If I'm not wrong, Jason did not enumerate these as alternatives but as
> > part of the same solution. Jason please correct me if I'm wrong here.
> >
> > His solution is composed of three trees:
> > 1) One for the IOVA allocations, so we know where to allocate new ranges
> > 2) One of the GPA -> SVQ IOVA translations.
> > 3) Another one for SVQ vrings translations.
> >
>
> Exactly.
>
> > In my opinion to use GPA this is problematic as we force all of the
> > callers to know if the address we want to translate comes from the
> > guest or not. Current code does now know it, as
> > vhost_svq_translate_addr is called to translate both buffer dataplane
> > addresses and control virtqueue buffers, which are also shadowed. To
> > transmit that information to the caller code would require either
> > duplicate functions, to add a boolean, etc, as it is in a different
> > layer (net specific net/vhost-vdpa.c vs generic
> > hw/virtio/vhost-shadow-virtqueue.c).
> >
> > In my opinion is easier to keep just two trees (or similar structs):
> > 1) One for the IOVA allocations, so we know where to allocate new
> > ranges. We don't actually need to store the translations here.
> > 2) One for HVA -> SVQ IOVA translations.
> >
> > This way we can accommodate both SVQ vrings, CVQ buffers, and guest
> > memory buffers, all on the second tree, and take care of the HVA
> > duplications.
>
> Probably, but how to take care of HVA duplications (I may miss the
> context, sorry)?
>
What to do with the duplicated HVA is an added problem that deserves
more discussion, right. If a duplicated entry gets deleted, the other
one should take its place in both trees.
Store them in an ordered list and look for replacements on tree
deletion maybe? It would be great to locate a similar usage and see
what is done there.
> Thanks
>
> >
> > Thanks!
> >
> > > Thanks Jason!
> > >
> > > >>
> > > >> In some sense this is simpler than trying to get all the information
> > > >> from only two trees. On the bad side, all SVQ calls that allocate some
> > > >> region need to remember to add to one of the two other threes. That is
> > > >> what I mean by synchronized. But sure, we can go that way.
> > > >
> > > > Just a suggestion, if it turns out to complicate the issue, I'm fine
> > > > to go the other way.
> > > >
> > > > Thanks
> > > >
> > > >>
> > > >>> Thanks
> > > >>>
> > > >>>>
> > > >>>>> Thanks
> > > >>>>>
> > > >>>>>>
> > > >>>>>> Thanks!
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>