qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backe


From: Marcel Apfelbaum
Subject: Re: [Qemu-devel] [PATCH V8 1/4] mem: add share parameter to memory-backend-ram
Date: Thu, 1 Feb 2018 21:28:05 +0200
User-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

On 01/02/2018 21:21, Eduardo Habkost wrote:
> On Thu, Feb 01, 2018 at 08:58:32PM +0200, Marcel Apfelbaum wrote:
>> On 01/02/2018 20:51, Eduardo Habkost wrote:
>>> On Thu, Feb 01, 2018 at 08:31:09PM +0200, Marcel Apfelbaum wrote:
>>>> On 01/02/2018 20:21, Eduardo Habkost wrote:
>>>>> On Thu, Feb 01, 2018 at 08:03:53PM +0200, Marcel Apfelbaum wrote:
>>>>>> On 01/02/2018 15:53, Eduardo Habkost wrote:
>>>>>>> On Thu, Feb 01, 2018 at 02:29:25PM +0200, Marcel Apfelbaum wrote:
>>>>>>>> On 01/02/2018 14:10, Eduardo Habkost wrote:
>>>>>>>>> On Thu, Feb 01, 2018 at 07:36:50AM +0200, Marcel Apfelbaum wrote:
>>>>>>>>>> On 01/02/2018 4:22, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Wed, Jan 31, 2018 at 09:34:22PM -0200, Eduardo Habkost wrote:
>>>>>>>>> [...]
>>>>>>>>>>>> BTW, what's the root cause for requiring HVAs in the buffer?
>>>>>>>>>>>
>>>>>>>>>>> It's a side effect of the kernel/userspace API which always wants
>>>>>>>>>>> a single HVA/len pair to map memory for the application.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Eduardo and Michael,
>>>>>>>>>>
>>>>>>>>>>>>  Can
>>>>>>>>>>>> this be fixed?
>>>>>>>>>>>
>>>>>>>>>>> I think yes.  It'd need to be a kernel patch for the RDMA subsystem
>>>>>>>>>>> mapping an s/g list with actual memory. The HVA/len pair would then 
>>>>>>>>>>> just
>>>>>>>>>>> be used to refer to the region, without creating the two mappings.
>>>>>>>>>>>
>>>>>>>>>>> Something like splitting the register mr into
>>>>>>>>>>>
>>>>>>>>>>> mr = create mr (va/len) - allocate a handle and record the va/len
>>>>>>>>>>>
>>>>>>>>>>> addmemory(mr, offset, hva, len) - pin memory
>>>>>>>>>>>
>>>>>>>>>>> register mr - pass it to HW
>>>>>>>>>>>
>>>>>>>>>>> As a nice side effect we won't burn so much virtual address space.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We would still need a contiguous virtual address space range (for 
>>>>>>>>>> post-send)
>>>>>>>>>> which we don't have since guest contiguous virtual address space
>>>>>>>>>> will always end up as non-contiguous host virtual address space.
>>>>>>>>>>
>>>>>>>>>> I am not sure the RDMA HW can handle a large VA with holes.
>>>>>>>>>
>>>>>>>>> I'm confused.  Why would the hardware see and care about virtual
>>>>>>>>> addresses? 
>>>>>>>>
>>>>>>>> The post-send operations bypasses the kernel, and the process
>>>>>>>> puts in the work request GVA addresses.
>>>>>>>>
>>>>>>>>> How exactly does the hardware translates VAs to
>>>>>>>>> PAs? 
>>>>>>>>
>>>>>>>> The HW maintains a page-directory like structure different form MMU
>>>>>>>> VA -> phys pages
>>>>>>>>
>>>>>>>>> What if the process page tables change?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Since the page tables the HW uses are their own, we just need the phys
>>>>>>>> page to be pinned.
>>>>>>>
>>>>>>> So there's no hardware-imposed requirement that the hardware VAs
>>>>>>> (mapped by the HW page directory) match the VAs in QEMU
>>>>>>> address-space, right? 
>>>>>>
>>>>>> Actually there is. Today it works exactly as you described.
>>>>>
>>>>> Are you sure there's such hardware-imposed requirement?
>>>>>
>>>>
>>>> Yes.
>>>>
>>>>> Why would the hardware require VAs to match the ones in the
>>>>> userspace address-space, if it doesn't use the CPU MMU at all?
>>>>>
>>>>
>>>> It works like that:
>>>>
>>>> 1. We register a buffer from the process address space
>>>>    giving its base address and length.
>>>>    This call goes to kernel which in turn pins the phys pages
>>>>    and registers them with the device *together* with the base
>>>>    address (virtual address!)
>>>> 2. The device builds its own page tables to be able to translate
>>>>    the virtual addresses to actual phys pages.
>>>
>>> How would the device be able to do that?  It would require the
>>> device to look at the process page tables, wouldn't it?  Isn't
>>> the HW IOVA->PA translation table built by the OS?
>>>
>>
>> As stated above, these are tables private for the device.
>> (They even have a hw vendor specific layout I think,
>>  since the device holds some cache)
>>
>> The device looks at its own private page tables, and not
>> to the OS ones.
> 
> I'm still confused by your statement that the device builds its
> own [IOVA->PA] page table.  How would the device do that if it
> doesn't have access to the CPU MMU state?  Isn't the IOVA->PA
> translation table built by the OS?
> 

Sorry about the confusion. The device gets a base virtual address,
the memory region length and a list of phys pages.
This is enough information to create its own kind of tables
which will tell, for example, if the IOVA starts at address
0x1000, that address 0x1001 is at page 0 and address 0x2000 is at page 1.

Be aware this base virtual address can be from any address space, not only
from the process address, the process address space is only the current software
implementation.

Thanks,
Marcel

>>
>>>
>>>> 3. The process executes post-send requests directly to hw by-passing
>>>>    the kernel giving process virtual addresses in work requests.
>>>> 4. The device uses its own page tables to translate the virtual
>>>>    addresses to phys pages and sending them.
>>>>
>>>> Theoretically is possible to send any contiguous IOVA instead of
>>>> process's one but is not how is working today.
>>>>
>>>> Makes sense?
>>>
>>
> 




reply via email to

[Prev in Thread] Current Thread [Next in Thread]