qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS


From: ronnie sahlberg
Subject: Re: [Qemu-devel] [PATCHv2] block: add native support for NFS
Date: Wed, 18 Dec 2013 09:50:35 -0800

On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven <address@hidden> wrote:
>
> Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <address@hidden>:
>
>> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <address@hidden> wrote:
>>>
>>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <address@hidden>:
>>>
>>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <address@hidden> wrote:
>>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>>> <address@hidden>:
>>>>>>>
>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>>> without
>>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>>>
>>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>>>
>>>>>>>> For example:
>>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>>>
>>>>>>>
>>>>>>> Does it support other config tunables, eg specifying which
>>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>>> URI parameters in the obvious manner ?
>>>>>>
>>>>>>
>>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>>> like to see?
>>>>>>
>>>>>
>>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>>> O_DIRECT sadly),
>>>>> will it be supported? or will it be the default?
>>>>>
>>>>
>>>> If you use the high-level API that provides posix like functions, such
>>>> as nfs_open() then libnfs does.
>>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>>> the O_SYNC flag in modes.
>>>>
>>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>>> cache/memory.
>>>>
>>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>>> then libnfs will flag this handle as sync and any calls to
>>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>>>
>>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>>>
>>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>>> actually need.
>>
>> You have that guarantee in NFS/COMMIT3
>> NFS/COMMIT3 will not return until the server has flushed the specified
>> range to disk.
>>
>> However, while the NFS protocol allows you to specify a range for the
>> COMMIT3 call so that you can do things like
>> WRITE3 Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>> many/most nfs servers will ignore the offset/length arguments to the
>> COMMIT3 call and always unconditionally make an fsync() for the whole
>> file.
>>
>> This can make the COMMIT3 call very expensive for large files.
>>
>>
>> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
>> specify O_SYNC to nfs_open*()
>> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
>> which means that the server will guarantee to write the data to stable
>> storage before responding back to the client.
>> In this mode there is no real need to do anything at all or even call
>> COMMIT3  since there is never any writeback data on the server that
>> needs to be destaged.
>>
>>
>> Since many servers treat COMMIT3 as "unconditionally walk all blocks
>> for the whole file and make sure they are destaged" it is not clear
>> whether how
>>
>> WRITE3-normal Offset:foo Length:bar
>> COMMIT3 Offset:foo Length:bar
>>
>> will compare to
>>
>> WRITE3+O_SYNC Offset:foo Length:bar
>>
>> I would not be surprised if the second mode would have higher
>> (potentially significantly) performance than the former.
>
> The qemu block layer currently is designed to send a bdrv_flush after every 
> single
> write if the write cache is not enabled. This means that the unwritten data 
> is just
> the data of the single write operation.

I understand that, there is only a single WRITE3 worth of data to
actually destage each time.

But what I meant is that for a lot of servers, for large files,   the
server might need to spend non-trivial amount of time
crunching file metadata and check every single page for the file in
order to discover the "I only need to destage pages x,y,z"

On many nfs servers this "figure out which blocks to flush" can take a
lot of time and affect performance greatly.



> However, changing this to issue a sync
> write call would require to change the whole API. The major problem is that
> the write cache setting can be changed while the device is open otherwise
> we could just ignore all calls to bdrv flush if the device was opened without
> enabled write cache.
>
> In the very popular case of using Virtio as Driver it is the case that the 
> device
> is always opened with disabled write cache and the write cache is only
> enabled after the host has negotiated with the guest that the guest is
> able to send flushed.
>
> We can keep in mind for a later version of the driver that we manually craft
> a write call with O_SYNC if the write cache is disabled and ignore bdrv_flush.
> And we use async write + commit via bdrv_flush in the case of an enabled
> write cache.
>
> Peter
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]