qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCHv2] block: add native support for NFS


From: Peter Lieven
Subject: Re: [Qemu-devel] [PATCHv2] block: add native support for NFS
Date: Wed, 18 Dec 2013 18:55:22 +0100

Am 18.12.2013 um 18:50 schrieb ronnie sahlberg <address@hidden>:

> On Wed, Dec 18, 2013 at 9:42 AM, Peter Lieven <address@hidden> wrote:
>> 
>> Am 18.12.2013 um 18:33 schrieb ronnie sahlberg <address@hidden>:
>> 
>>> On Wed, Dec 18, 2013 at 8:59 AM, Peter Lieven <address@hidden> wrote:
>>>> 
>>>> Am 18.12.2013 um 15:42 schrieb ronnie sahlberg <address@hidden>:
>>>> 
>>>>> On Wed, Dec 18, 2013 at 2:00 AM, Orit Wasserman <address@hidden> wrote:
>>>>>> On 12/18/2013 01:03 AM, Peter Lieven wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> Am 17.12.2013 um 18:32 schrieb "Daniel P. Berrange"
>>>>>>>> <address@hidden>:
>>>>>>>> 
>>>>>>>>> On Tue, Dec 17, 2013 at 10:15:25AM +0100, Peter Lieven wrote:
>>>>>>>>> This patch adds native support for accessing images on NFS shares
>>>>>>>>> without
>>>>>>>>> the requirement to actually mount the entire NFS share on the host.
>>>>>>>>> 
>>>>>>>>> NFS Images can simply be specified by an url of the form:
>>>>>>>>> nfs://<host>/<export>/<filename>
>>>>>>>>> 
>>>>>>>>> For example:
>>>>>>>>> qemu-img create -f qcow2 nfs://10.0.0.1/qemu-images/test.qcow2
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Does it support other config tunables, eg specifying which
>>>>>>>> NFS version to use 2/3/4 ? If so will they be available as
>>>>>>>> URI parameters in the obvious manner ?
>>>>>>> 
>>>>>>> 
>>>>>>> currently only v3 is supported by libnfs. what other tunables would you
>>>>>>> like to see?
>>>>>>> 
>>>>>> 
>>>>>> For live migration we need the sync option (async ignores O_SYNC and
>>>>>> O_DIRECT sadly),
>>>>>> will it be supported? or will it be the default?
>>>>>> 
>>>>> 
>>>>> If you use the high-level API that provides posix like functions, such
>>>>> as nfs_open() then libnfs does.
>>>>> nfs_open()/nfs_open_async() takes a mode parameter and libnfs checks
>>>>> the O_SYNC flag in modes.
>>>>> 
>>>>> By default libnfs will translate any nfs_write*() or nfs_pwrite*() to
>>>>> NFS/WRITE3+UNSTABLE that allows the server to just write to
>>>>> cache/memory.
>>>>> 
>>>>> IF you specify O_SYNC in the mode argument to nfds_open/nfs_open_async
>>>>> then libnfs will flag this handle as sync and any calls to
>>>>> nfs_write/nfs_pwrite will translate to NFS/WRITE3+FILE_SYNC
>>>>> 
>>>>> Calls to nfs_fsync is translated to NFS/COMMIT3
>>>> 
>>>> If this NFS/COMMIT3 would issue a sync on the server that would be all we
>>>> actually need.
>>> 
>>> You have that guarantee in NFS/COMMIT3
>>> NFS/COMMIT3 will not return until the server has flushed the specified
>>> range to disk.
>>> 
>>> However, while the NFS protocol allows you to specify a range for the
>>> COMMIT3 call so that you can do things like
>>> WRITE3 Offset:foo Length:bar
>>> COMMIT3 Offset:foo Length:bar
>>> many/most nfs servers will ignore the offset/length arguments to the
>>> COMMIT3 call and always unconditionally make an fsync() for the whole
>>> file.
>>> 
>>> This can make the COMMIT3 call very expensive for large files.
>>> 
>>> 
>>> NFSv3 also supports FILE_SYNC write mode, which libnfs triggers if you
>>> specify O_SYNC to nfs_open*()
>>> In this mode every single NFS/WRITE3 is sent with the FILE_SYNC mode
>>> which means that the server will guarantee to write the data to stable
>>> storage before responding back to the client.
>>> In this mode there is no real need to do anything at all or even call
>>> COMMIT3  since there is never any writeback data on the server that
>>> needs to be destaged.
>>> 
>>> 
>>> Since many servers treat COMMIT3 as "unconditionally walk all blocks
>>> for the whole file and make sure they are destaged" it is not clear
>>> whether how
>>> 
>>> WRITE3-normal Offset:foo Length:bar
>>> COMMIT3 Offset:foo Length:bar
>>> 
>>> will compare to
>>> 
>>> WRITE3+O_SYNC Offset:foo Length:bar
>>> 
>>> I would not be surprised if the second mode would have higher
>>> (potentially significantly) performance than the former.
>> 
>> The qemu block layer currently is designed to send a bdrv_flush after every 
>> single
>> write if the write cache is not enabled. This means that the unwritten data 
>> is just
>> the data of the single write operation.
> 
> I understand that, there is only a single WRITE3 worth of data to
> actually destage each time.
> 
> But what I meant is that for a lot of servers, for large files,   the
> server might need to spend non-trivial amount of time
> crunching file metadata and check every single page for the file in
> order to discover the "I only need to destage pages x,y,z"
> 
> On many nfs servers this "figure out which blocks to flush" can take a
> lot of time and affect performance greatly.

But this is only the case for disabled write cache. I would not expect great 
write
performance at all if the write cache is disabled.

As an improvement for the future we could improve the write operation with 
disabled
write cache by sending a write + file_sync call for every write and ignore the 
bdrv_flush
if this is faster.

Peter




reply via email to

[Prev in Thread] Current Thread [Next in Thread]