Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking

qemu-block
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking

From:	Max Reitz
Subject:	Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking
Date:	Wed, 8 Feb 2017 04:05:52 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0
On 23.01.2017 13:30, Fam Zheng wrote:
> This implements open flag sensible image locking for local file
> and host device protocol.
> 
> virtlockd in libvirt locks the first byte, so we start looking at the
> file bytes from 1.
> 
> Quoting what was proposed by Kevin Wolf <address@hidden>, there are
> four locking modes by combining two bits (BDRV_O_RDWR and
> BDRV_O_SHARE_RW), and implemented by taking two locks.
> 
> The complication is in the transactional reopen.  To make the reopen
> logic managable, and allow better reuse, the code is internally
> organized with a table from old mode to the new one.
> 
> Signed-off-by: Fam Zheng <address@hidden>
> ---
>  block/file-posix.c | 681 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 678 insertions(+), 3 deletions(-)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index 28b47d9..a8c76d6 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -131,8 +131,45 @@ do { \
>  
>  #define MAX_BLOCKSIZE        4096
>  
> +/* Posix file locking bytes. Libvirt takes byte 0, we start from byte 0x10,
> + * leaving a few more bytes for its future use. */
> +#define RAW_LOCK_BYTE_MIN             0x10
> +#define RAW_LOCK_BYTE_NO_OTHER_WRITER 0x10
> +#define RAW_LOCK_BYTE_WRITE           0x11
> +#ifdef F_OFD_SETLK
> +#define RAW_LOCK_SUPPORTED 1
> +#else
> +#define RAW_LOCK_SUPPORTED 0
> +#endif
> +
> +/*
> + ** reader that can tolerate writers: Don't do anything
> + *
> + ** reader that can't tolerate writers: Take shared lock on byte 1. Test
> + *  byte 2 is unlocked.

Byte 0x10 and 0x11 now -- or you call them byte 0 and byte 1. Or "the
first byte" and "the second byte".

Also, it should probably be "Test whether byte 2 is unlocked" or "Affirm
that byte 2 is unlocked" (this is what my sense of the English language
is telling me, may be wrong).

> + *
> + ** shared writer: Take shared lock on byte 2. Test byte 1 is unlocked.
> + *
> + ** exclusive writer: Take exclusive locks on both bytes.
> + */
> +
> +typedef enum {
> +    /* Read only and accept other writers. */
> +    RAW_L_READ_SHARE_RW,
> +    /* Read only and try to forbid other writers. */
> +    RAW_L_READ,
> +    /* Read write and accept other writers. */
> +    RAW_L_WRITE_SHARE_RW,
> +    /* Read write and try to forbid other writers. */

While fully comprehensible and I didn't nag about this in the last
revision, it isn't real English so let me complain now: May be better as
"Read+write", "Read & write" or "Read/write".

("Read and write" is kind of bad because of the immediate "and" afterwards.)

> +    RAW_L_WRITE,
> +} BDRVRawLockMode;
> +
>  typedef struct BDRVRawState {
>      int fd;
> +    /* A dup of @fd to make manipulating lock easier, especially during 
> reopen,
> +     * where this will accept BDRVRawReopenState.lock_fd. */
> +    int lock_fd;
> +    bool disable_lock;
>      int type;
>      int open_flags;
>      size_t buf_align;

[...]

> @@ -393,10 +487,88 @@ static QemuOptsList raw_runtime_opts = {

[...]

> +static int raw_apply_image_lock(BlockDriverState *bs, int bdrv_flags,
> +                                Error **errp)
> +{
> +    int ret;
> +    BDRVRawState *s = bs->opaque;
> +    BDRVRawLockMode lock_mode;
> +
> +    if (!raw_lock_enabled(bs)) {
> +        return 0;
> +    }
> +    assert(s->cur_lock_mode == RAW_L_READ_SHARE_RW);
> +    lock_mode = raw_get_lock_mode(bdrv_flags);
> +    ret = raw_open_lockfd(bs->exact_filename, s->open_flags, &lock_mode,
> +                          errp);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +    s->lock_fd = ret;
> +    if (lock_mode == RAW_L_READ_SHARE_RW) {
> +        return 0;
> +    }

Not really sure why this needs to be special-cased. It doesn't hurt, but
it doesn't really improve anything either, does it?

> +    ret = raw_lock_fd(s->lock_fd, lock_mode, errp);
> +    if (ret) {
> +        return ret;
> +    }
> +    s->cur_lock_mode = lock_mode;
> +    return 0;
> +}
> +
>  static int raw_open_common(BlockDriverState *bs, QDict *options,
>                             int bdrv_flags, int open_flags, Error **errp)
>  {

[...]

> @@ -538,6 +720,465 @@ static int raw_open(BlockDriverState *bs, QDict 
> *options, int flags,
>      return raw_open_common(bs, options, flags, 0, errp);
>  }
>  
> +typedef enum {
> +    RAW_LT_PREPARE,
> +    RAW_LT_COMMIT,
> +    RAW_LT_ABORT
> +} RawLockTransOp;
> +
> +typedef int (*RawReopenFunc)(RawLockTransOp op,
> +                             int old_lock_fd, int new_lock_fd,
> +                             BDRVRawLockMode old_lock,
> +                             BDRVRawLockMode new_lock,
> +                             Error **errp);
> +
> +static int raw_lt_nop(RawLockTransOp op,
> +                      int old_lock_fd, int new_lock_fd,
> +                      BDRVRawLockMode old_lock,
> +                      BDRVRawLockMode new_lock,
> +                      Error **errp)
> +{
> +    assert(old_lock == new_lock || new_lock == RAW_L_READ_SHARE_RW);
> +    return 0;
> +}
> +
> +static int raw_lt_from_unlock(RawLockTransOp op,
> +                              int old_lock_fd, int new_lock_fd,
> +                              BDRVRawLockMode old_lock,
> +                              BDRVRawLockMode new_lock,
> +                              Error **errp)
> +{
> +    assert(old_lock != new_lock);
> +    assert(old_lock == RAW_L_READ_SHARE_RW);
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        return raw_lock_fd(new_lock_fd, new_lock, errp);
> +    case RAW_LT_COMMIT:
> +        break;

Ah, that's what the break was meant for.

(It was one line above in v10.)

> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +
> +    return 0;
> +}
> +
> +static int raw_lt_read_to_write_share(RawLockTransOp op,
> +                                      int old_lock_fd, int new_lock_fd,
> +                                      BDRVRawLockMode old_lock,
> +                                      BDRVRawLockMode new_lock,
> +                                      Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_READ);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +
> +    /*
> +     *        lock byte "no other writer"      lock byte "write"
> +     * old                S                           0
> +     * new                0                           S
> +     *
> +     * (0 = unlocked; S = shared; X = exclusive.)
> +     */

Thanks, these comments are nice.

> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to lock new fd (write 
> byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> false);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to lock new fd (share 
> byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to unlock old fd (share 
> byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to upgrade new fd (share 
> byte)");
> +            break;
> +        }
> +        ret = qemu_unlock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1);
> +        if (ret) {
> +            /* This is very unlikely, but catch it anyway. */
> +            error_setg_errno(errp, -ret, "Failed to unlock new fd (share 
> byte)");

Let's say we fail here, however unlikely it is...

> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to downgrade new fd (write 
> byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> false);
> +        if (ret) {
> +            error_report("Failed to restore lock on old fd (share byte)");
> +        }

...will this not fail then?

(exclusive lock is still present on new_lock_fd.)

> +        break;
> +    }
> +    return ret;

By the way, couldn't this function use the same logic as
raw_lt_write_share_to_read()? (i.e. lock old_lock_fd's
RAW_LOCK_BYTE_NO_OTHER_WRITER exclusively and then lock new_lock_fd's
RAW_LOCK_BYTE_WRITE in shared mode)

> +}

[...]

> +static int raw_lt_write_share_to_read(RawLockTransOp op,
> +                                      int old_lock_fd, int new_lock_fd,
> +                                      BDRVRawLockMode old_lock,
> +                                      BDRVRawLockMode new_lock,
> +                                      Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE_SHARE_RW);
> +    assert(new_lock == RAW_L_READ);
> +    /*
> +     *        lock byte "no other writer"      lock byte "write"
> +     * old                0                           S
> +     * new                S                           0
> +     *
> +     * (0 = unlocked; S = shared; X = exclusive.)
> +     */
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        /* Make sure there are no other writers. */
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to lock old fd (write 
> byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> false);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to lock new fd (share 
> byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        break;

Shouldn't the abort path downgrade the exclusive lock on old_lock_fd to
a shared lock?

> +    }
> +    return ret;
> +}

[...]

> +static int raw_lt_write_to_read(RawLockTransOp op,
> +                                int old_lock_fd, int new_lock_fd,
> +                                BDRVRawLockMode old_lock,
> +                                BDRVRawLockMode new_lock,
> +                                Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_READ);
> +    /*
> +     *        lock byte "no other writer"      lock byte "write"
> +     * old                X                           X
> +     * new                S                           0
> +     *
> +     * (0 = unlocked; S = shared; X = exclusive.)
> +     */
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> false);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to downgrade old fd (share 
> byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> false);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to lock new fd (share 
> byte)");
> +            break;
> +        }
> +        break;
> +    case RAW_LT_COMMIT:
> +        break;
> +    case RAW_LT_ABORT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_NO_OTHER_WRITER, 1, 
> true);
> +        if (ret) {
> +            error_setg_errno(errp, -ret, "Failed to restore old fd (share 
> byte)");
> +        }

I think you should release the lock on new_lock_fd first.

> +        break;
> +    }
> +    return ret;
> +}
> +
> +static int raw_lt_write_to_write_share(RawLockTransOp op,
> +                                       int old_lock_fd, int new_lock_fd,
> +                                       BDRVRawLockMode old_lock,
> +                                       BDRVRawLockMode new_lock,
> +                                       Error **errp)
> +{
> +    int ret = 0;
> +
> +    assert(old_lock == RAW_L_WRITE);
> +    assert(new_lock == RAW_L_WRITE_SHARE_RW);
> +    /*
> +     *        lock byte "no other writer"      lock byte "write"
> +     * old                X                           X
> +     * new                0                           S
> +     *
> +     * (0 = unlocked; S = shared; X = exclusive.)
> +     */
> +    switch (op) {
> +    case RAW_LT_PREPARE:
> +        break;
> +    case RAW_LT_COMMIT:
> +        ret = qemu_lock_fd(old_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_report("Failed to downgrade old fd (share byte)");
> +            break;
> +        }
> +        ret = qemu_lock_fd(new_lock_fd, RAW_LOCK_BYTE_WRITE, 1, false);
> +        if (ret) {
> +            error_report("Failed to unlock new fd (share byte)");
> +            break;
> +        }

The second one is not an "unlock", but a new shared lock. Which brings
me to the point that both of these commands can fail and thus should be
in the prepare path.

(This function should be a mirror of raw_lt_write_to_read, if I'm not
mistaken.)

> +        break;
> +    case RAW_LT_ABORT:
> +        break;
> +    }
> +    return ret;
> +}
> +
> +/**
> + * Transactionally moving between possible locking states is tricky and must 
> be
> + * done carefully. That is mostly because downgrading an exclusive lock to
> + * shared or unlocked is not guaranteed to be revertible. As a result, in 
> such

Interesting. Wiktionary says "revertible" means "able to be reverted",
which sounds reasonable, albeit I'm not sure I have ever heard
"revertible" before.

However, my favorite online dictionary gave me a German word I have
never heard before.

Note that Wiktionary also has the word "revertable" with the same
definition. Of course, it also has "reversible". Now I understand there
is a difference between "to revert" and "to reverse", but maybe
"reversible" is still the better choice considering it has a unique
meaning and scores more than thousand times as many results on Google.

(For anyone wondering, the German word is "heimfällig" and it means
"designated to go back to the original owner" (e.g. after death). It's
apparently related to "anheimfallen", which I do know, which means "to
become someone's property" or "to become a victim of something"
("something" being a process of some sorts, usually, such as a mishap).)

((Apparently "heimfällig" is used in Austria and Swiss, mostly.))

Max

> + * cases we have to defer the downgrading to "commit", given that no revert 
> will
> + * happen after that point, and that downgrading a lock should never fail.
> + *
> + * On the other hand, upgrading a lock (e.g. from unlocked or shared to
> + * exclusive lock) must happen in "prepare" because it may fail.
> + *
> + * Manage the operation matrix with this state transition table to make
> + * fulfilling above conditions easier.
> + */

[...]
signature.asc
Description: OpenPGP digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking, Max Reitz <=
- Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking, Fam Zheng, 2017/02/08
  - Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking, Max Reitz, 2017/02/08
    - Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking, Fam Zheng, 2017/02/08
Prev by Date: Re: [Qemu-block] [PATCH v12 13/16] tests: Disable image lock in test-replication
Next by Date: Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking
Previous by thread: Re: [Qemu-block] [PATCH v12 13/16] tests: Disable image lock in test-replication
Next by thread: Re: [Qemu-block] [PATCH v12 14/16] file-posix: Implement image locking
Index(es):
- Date
- Thread