qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backe


From: Eduardo Habkost
Subject: Re: [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backend-ram
Date: Thu, 1 Feb 2018 17:21:08 -0200
User-agent: Mutt/1.9.1 (2017-09-22)

On Thu, Feb 01, 2018 at 08:58:32PM +0200, Marcel Apfelbaum wrote:
> On 01/02/2018 20:51, Eduardo Habkost wrote:
> > On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
> >> On 01/02/2018 20:21, Eduardo Habkost wrote:
> >>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
> >>>> On 01/02/2018 15:53, Eduardo Habkost wrote:
> >>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
> >>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
> >>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
> >>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
> >>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
> >>>>>>> [...]
> >>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
> >>>>>>>>>
> >>>>>>>>> It's a side effect of the kernel/userspace API which always wants
> >>>>>>>>> a single HVA/len pair to map memory for the application.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Eduardo and Michael,
> >>>>>>>>
> >>>>>>>>>>  Can
> >>>>>>>>>> this be fixed?
> >>>>>>>>>
> >>>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
> >>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then 
> >>>>>>>>> just
> >>>>>>>>> be used to refer to the region, without creating the two mappings.
> >>>>>>>>>
> >>>>>>>>> Something like splitting the register mr into
> >>>>>>>>>
> >>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
> >>>>>>>>>
> >>>>>>>>> addmemory(mr, offset, hva, len) - pin memory
> >>>>>>>>>
> >>>>>>>>> register mr - pass it to HW
> >>>>>>>>>
> >>>>>>>>> As a nice side effect we won't burn so much virtual address space.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> We would still need a contiguous virtual address space range (for 
> >>>>>>>> post-send)
> >>>>>>>> which we don't have since guest contiguous virtual address space
> >>>>>>>> will always end up as non-contiguous host virtual address space.
> >>>>>>>>
> >>>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
> >>>>>>>
> >>>>>>> I'm confused.  Why would the hardware see and care about virtual
> >>>>>>> addresses? 
> >>>>>>
> >>>>>> The post-send operations bypasses the kernel, and the process
> >>>>>> puts in the work request GVA addresses.
> >>>>>>
> >>>>>>> How exactly does the hardware translates VAs to
> >>>>>>> PAs? 
> >>>>>>
> >>>>>> The HW maintains a page-directory like structure different form MMU
> >>>>>> VA -> phys pages
> >>>>>>
> >>>>>>> What if the process page tables change?
> >>>>>>>
> >>>>>>
> >>>>>> Since the page tables the HW uses are their own, we just need the phys
> >>>>>> page to be pinned.
> >>>>>
> >>>>> So there's no hardware-imposed requirement that the hardware VAs
> >>>>> (mapped by the HW page directory) match the VAs in QEMU
> >>>>> address-space, right? 
> >>>>
> >>>> Actually there is. Today it works exactly as you described.
> >>>
> >>> Are you sure there's such hardware-imposed requirement?
> >>>
> >>
> >> Yes.
> >>
> >>> Why would the hardware require VAs to match the ones in the
> >>> userspace address-space, if it doesn't use the CPU MMU at all?
> >>>
> >>
> >> It works like that:
> >>
> >> 1. We register a buffer from the process address space
> >>    giving its base address and length.
> >>    This call goes to kernel which in turn pins the phys pages
> >>    and registers them with the device *together* with the base
> >>    address (virtual address!)
> >> 2. The device builds its own page tables to be able to translate
> >>    the virtual addresses to actual phys pages.
> > 
> > How would the device be able to do that?  It would require the
> > device to look at the process page tables, wouldn't it?  Isn't
> > the HW IOVA->PA translation table built by the OS?
> > 
> 
> As stated above, these are tables private for the device.
> (They even have a hw vendor specific layout I think,
>  since the device holds some cache)
> 
> The device looks at its own private page tables, and not
> to the OS ones.

I'm still confused by your statement that the device builds its
own [IOVA->PA] page table.  How would the device do that if it
doesn't have access to the CPU MMU state?  Isn't the IOVA->PA
translation table built by the OS?

> 
> > 
> >> 3. The process executes post-send requests directly to hw by-passing
> >>    the kernel giving process virtual addresses in work requests.
> >> 4. The device uses its own page tables to translate the virtual
> >>    addresses to phys pages and sending them.
> >>
> >> Theoretically is possible to send any contiguous IOVA instead of
> >> process's one but is not how is working today.
> >>
> >> Makes sense?
> > 
> 

-- 
Eduardo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]