Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Verita

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Verita

From:	Jeff Cody
Subject:	Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support
Date:	Wed, 28 Sep 2016 21:46:28 -0400
User-agent:	Mutt/1.5.24 (2015-08-30)

On Tue, Sep 27, 2016 at 09:09:49PM -0700, Ashish Mittal wrote:
> This patch adds support for a new block device type called "vxhs".
> Source code for the library that this code loads can be downloaded from:
> https://github.com/MittalAshish/libqnio.git
> 
> Sample command line using JSON syntax:
> ./qemu-system-x86_64 -name instance-00000008 -S -vnc 0.0.0.0:0 -k en-us -vga 
> cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg 
> timestamp=on 
> 'json:{"driver":"vxhs","vdisk_id":"{c3e9095a-a5ee-4dce-afeb-2a59fb387410}","server":[{"host":"172.172.17.4","port":"9999"},{"host":"172.172.17.2","port":"9999"}]}'
> 
> Sample command line using URI syntax:
> qemu-img convert -f raw -O raw -n 
> /var/lib/nova/instances/_base/0c5eacd5ebea5ed914b6a3e7b18f1ce734c386ad 
> vxhs://192.168.0.1:9999/%7Bc6718f6b-0401-441d-a8c3-1f0064d75ee0%7D
> 
> Signed-off-by: Ashish Mittal <address@hidden>

It would be very useful for reviewing the code if there is any sort of
technical documentation (whitepaper, text file, email, etc..) that describes
the details of VXHS.

[...]


There is still a good hunk of this patch that I have yet to fully go
through, but there is enough here to address first (both from my comments,
and others).


> + * IO specific flags
> + */
> +#define IIO_FLAG_ASYNC              0x00000001
> +#define IIO_FLAG_DONE               0x00000010
> +#define IIO_FLAG_SYNC               0
> +
> +#define VDISK_FD_READ               0
> +#define VDISK_FD_WRITE              1
> +#define VXHS_MAX_HOSTS              4
> +
> +#define VXHS_OPT_FILENAME           "filename"
> +#define VXHS_OPT_VDISK_ID           "vdisk_id"
> +#define VXHS_OPT_SERVER             "server."
> +#define VXHS_OPT_HOST               "host"
> +#define VXHS_OPT_PORT               "port"
> +
> +/* qnio client ioapi_ctx */
> +static void *global_qnio_ctx;
> +

The use of a global qnio_ctx means that you are only going to be able to
connect to one vxhs server.  I.e., QEMU will not be able to have multiple
drives with different VXHS servers, unless I am missing something.

Is that by design?

I don't see any reason why you could not contain this to the BDRVVXHSState
struct, so that it is limited in scope to each drive instance.


[...]


> +
> +static int32_t
> +vxhs_qnio_iio_ioctl(void *apictx, uint32_t rfd, uint32_t opcode, int64_t *in,
> +                    void *ctx, uint32_t flags)
> +{

I looked at the underlying iio_ioctl() code, and it is a bit odd that
vdisk_size is always required by it.  Especially since it doesn't allow a
NULL parameter for it, in case you don't care about it.

> +    int   ret = 0;
> +
> +    switch (opcode) {
> +    case VDISK_STAT:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_STAT,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_AIO_FLUSH:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_FLUSH,
> +                                     in, ctx, flags);
> +        break;
> +
> +    case VDISK_CHECK_IO_FAILOVER_READY:
> +        ret = iio_ioctl(apictx, rfd, IOR_VDISK_CHECK_IO_FAILOVER_READY,
> +                                     in, ctx, flags);
> +        break;
> +
> +    default:
> +        ret = -ENOTSUP;
> +        break;
> +    }

This whole function is necessary only because the underlying iio_ioctl()
returns success for unknown opcodes.  I think this is a mistake, and you
probably don't want to do it this way.

The iio_ioctl() function should return -ENOTSUP for unknown opcodes; this is
not a hard failure, and it allows us to determine what the underlying
library supports.  And then this function can disappear completely.

Since QEMU and libqnio are decoupled, you won't know at runtime what version
of libqnio is available on whatever system it is running on.  As the library
and driver evolve, there will likely come a time when you will want to know
in the QEMU driver if the QNIO version installed supports a certain feature.
If iio_ioctl (and other functions, as appropriate) return -ENOTSUP for
unsupported features, then it becomes easy to probe to see what is
supported.

I'd actually go a step further - have the iio_ioctl() function not filter on
opcodes either, and just blindly pass the opcodes to the server via
iio_ioctl_json().  That way you can let the server tell QEMU what it
supports, at least as far as ioctl operations go.


> +
> +    if (ret) {
> +        *in = 0;
> +        trace_vxhs_qnio_iio_ioctl(opcode);
> +    }
> +
> +    return ret;
> +}
> +
> +static void vxhs_qnio_iio_close(BDRVVXHSState *s, int idx)
> +{
> +    /*
> +     * Close vDisk device
> +     */
> +    if (s->vdisk_hostinfo[idx].vdisk_rfd >= 0) {
> +        iio_devclose(s->qnio_ctx, 0, s->vdisk_hostinfo[idx].vdisk_rfd);
> +        s->vdisk_hostinfo[idx].vdisk_rfd = -1;
> +    }
> +
> +    /*
> +     * Close QNIO channel against cached channel-fd
> +     */
> +    if (s->vdisk_hostinfo[idx].qnio_cfd >= 0) {
> +        iio_close(s->qnio_ctx, s->vdisk_hostinfo[idx].qnio_cfd);
> +        s->vdisk_hostinfo[idx].qnio_cfd = -1;
> +    }
> +}
> +
> +static int vxhs_qnio_iio_open(int *cfd, const char *of_vsa_addr,
> +                              int *rfd, const char *file_name)
> +{
> +    /*
> +     * Open qnio channel to storage agent if not opened before.
> +     */
> +    if (*cfd < 0) {
> +        *cfd = iio_open(global_qnio_ctx, of_vsa_addr, 0);
> +        if (*cfd < 0) {
> +            trace_vxhs_qnio_iio_open(of_vsa_addr);
> +            return -ENODEV;
> +        }
> +    }
> +
> +    /*
> +     * Open vdisk device
> +     */
> +    *rfd = iio_devopen(global_qnio_ctx, *cfd, file_name, 0);
> +
> +    if (*rfd < 0) {
> +        if (*cfd >= 0) {

This if conditional can be dropped, *cfd is always >= 0 here.

> +            iio_close(global_qnio_ctx, *cfd);
> +            *cfd = -1;
> +            *rfd = -1;
> +        }
> +
> +        trace_vxhs_qnio_iio_devopen(file_name);
> +        return -ENODEV;
> +    }
> +
> +    return 0;
> +}
> +

[...]

> +
> +static int vxhs_switch_storage_agent(BDRVVXHSState *s)
> +{
> +    int res = 0;
> +    int flags = (IIO_FLAG_ASYNC | IIO_FLAG_DONE);
> +
> +    trace_vxhs_switch_storage_agent(
> +              s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
> +              s->vdisk_guid);
> +
> +    res = vxhs_reopen_vdisk(s, s->vdisk_ask_failover_idx);
> +    if (res == 0) {
> +        res = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].vdisk_rfd,
> +                  VDISK_CHECK_IO_FAILOVER_READY, NULL, s, flags);

Segfault here.  The libqnio library doesn't allow NULL for the vdisk size
argument (although it should).


> +    } else {
> +        trace_vxhs_switch_storage_agent_failed(
> +                  s->vdisk_hostinfo[s->vdisk_ask_failover_idx].hostip,
> +                  s->vdisk_guid, res, errno);
> +        /*
> +         * Try the next host.
> +         * Calling vxhs_check_failover_status from here ties up the qnio
> +         * epoll loop if vxhs_qnio_iio_ioctl fails synchronously (-1)
> +         * for all the hosts in the IO target list.
> +         */
> +
> +        vxhs_check_failover_status(res, s);
> +    }
> +    return res;
> +}
> +
> +static void vxhs_check_failover_status(int res, void *ctx)
> +{
> +    BDRVVXHSState *s = ctx;
> +
> +    if (res == 0) {
> +        /* found failover target */
> +        s->vdisk_cur_host_idx = s->vdisk_ask_failover_idx;
> +        s->vdisk_ask_failover_idx = 0;
> +        trace_vxhs_check_failover_status(
> +                   s->vdisk_hostinfo[s->vdisk_cur_host_idx].hostip,
> +                   s->vdisk_guid);
> +        qemu_spin_lock(&s->vdisk_lock);
> +        OF_VDISK_RESET_IOFAILOVER_IN_PROGRESS(s);
> +        qemu_spin_unlock(&s->vdisk_lock);
> +        vxhs_handle_queued_ios(s);
> +    } else {
> +        /* keep looking */
> +        trace_vxhs_check_failover_status_retry(s->vdisk_guid);
> +        s->vdisk_ask_failover_idx++;
> +        if (s->vdisk_ask_failover_idx == s->vdisk_nhosts) {
> +            /* pause and cycle through list again */
> +            sleep(QNIO_CONNECT_RETRY_SECS);

Repeat from the my v6 review comments: this absolutely cannot happen here.
This is not just called from your callback, but also from your
.bdrv_aio_readv, .bdrv_aio_writev implementations.  Don't sleep() QEMU.

To resolve this will probably require some redesign of the failover code,
and when it is called; because of the variety of code paths that can invoke
this code, you also cannot do a coroutine yield here, either.


[...]


> +static int32_t
> +vxhs_qnio_iio_writev(void *qnio_ctx, uint32_t rfd, QEMUIOVector *qiov,
> +                     uint64_t offset, void *ctx, uint32_t flags)
> +{
> +    struct iovec cur;
> +    uint64_t cur_offset = 0;
> +    uint64_t cur_write_len = 0;
> +    int segcount = 0;
> +    int ret = 0;
> +    int i, nsio = 0;
> +    int iovcnt = qiov->niov;
> +    struct iovec *iov = qiov->iov;
> +
> +    errno = 0;
> +    cur.iov_base = 0;
> +    cur.iov_len = 0;
> +
> +    ret = iio_writev(qnio_ctx, rfd, iov, iovcnt, offset, ctx, flags);
> +
> +    if (ret == -1 && errno == EFBIG) {
> +        trace_vxhs_qnio_iio_writev(ret);
> +        /*
> +         * IO size is larger than IIO_IO_BUF_SIZE hence need to
> +         * split the I/O at IIO_IO_BUF_SIZE boundary
> +         * There are two cases here:
> +         *  1. iovcnt is 1 and IO size is greater than IIO_IO_BUF_SIZE
> +         *  2. iovcnt is greater than 1 and IO size is greater than
> +         *     IIO_IO_BUF_SIZE.
> +         *
> +         * Need to adjust the segment count, for that we need to compute
> +         * the segment count and increase the segment count in one shot
> +         * instead of setting iteratively in for loop. It is required to
> +         * prevent any race between the splitted IO submission and IO
> +         * completion.
> +         */

If I understand the for loop below correctly, it is all done to set nsio to the
correct value, right?

> +        cur_offset = offset;
> +        for (i = 0; i < iovcnt; i++) {
> +            if (iov[i].iov_len <= IIO_IO_BUF_SIZE && iov[i].iov_len > 0) {
> +                cur_offset += iov[i].iov_len;
> +                nsio++;

Is this chunk:

> +            } else if (iov[i].iov_len > 0) {
> +                cur.iov_base = iov[i].iov_base;
> +                cur.iov_len = IIO_IO_BUF_SIZE;
> +                cur_write_len = 0;
> +                while (1) {
> +                    nsio++;
> +                    cur_write_len += cur.iov_len;
> +                    if (cur_write_len == iov[i].iov_len) {
> +                        break;
> +                    }
> +                    cur_offset += cur.iov_len;
> +                    cur.iov_base += cur.iov_len;
> +                    if ((iov[i].iov_len - cur_write_len) > IIO_IO_BUF_SIZE) {
> +                        cur.iov_len = IIO_IO_BUF_SIZE;
> +                    } else {
> +                        cur.iov_len = (iov[i].iov_len - cur_write_len);
> +                    }
> +                }
> +            }

... effectively just doing this?

tmp = iov[1].iov_len / IIO_IO_BUF_SIZE;
nsio += tmp;
nsio += iov[1].iov_len % IIO_IO_BUF_SIZE ? 1: 0;

> +        }
> +
> +        segcount = nsio - 1;
> +        vxhs_inc_acb_segment_count(ctx, segcount);
> +


[...]


> +static int vxhs_open(BlockDriverState *bs, QDict *options,
> +              int bdrv_flags, Error **errp)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    AioContext *aio_context;
> +    int qemu_qnio_cfd = -1;
> +    int device_opened = 0;
> +    int qemu_rfd = -1;
> +    int ret = 0;
> +    int i;
> +
> +    ret = vxhs_qemu_init(options, s, &qemu_qnio_cfd, &qemu_rfd, errp);
> +    if (ret < 0) {
> +        trace_vxhs_open_fail(ret);
> +        return ret;
> +    }
> +
> +    device_opened = 1;

This is still unneeded; it is always == 1 in any path that can hit errout.

> +    s->qnio_ctx = global_qnio_ctx;
> +    s->vdisk_hostinfo[0].qnio_cfd = qemu_qnio_cfd;
> +    s->vdisk_hostinfo[0].vdisk_rfd = qemu_rfd;
> +    s->vdisk_size = 0;
> +    QSIMPLEQ_INIT(&s->vdisk_aio_retryq);


[...]

> +
> +/*
> + * This is called by QEMU when a flush gets triggered from within
> + * a guest at the block layer, either for IDE or SCSI disks.
> + */
> +static int vxhs_co_flush(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t size = 0;
> +    int ret = 0;
> +
> +    /*
> +     * VDISK_AIO_FLUSH ioctl is a no-op at present and will
> +     * always return success. This could change in the future.
> +     */

Rather than always return success, since it is a no-op how about have it
return -ENOTSUP?  That can then be filtered out here, but it gives us the
ability to determine later if a version of vxhs supports flush or not.

Or, you could just not implement vxhs_co_flush() at all; the block layer
will assume success in that case.

> +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> +            VDISK_AIO_FLUSH, &size, NULL, IIO_FLAG_SYNC);
> +
> +    if (ret < 0) {
> +        trace_vxhs_co_flush(s->vdisk_guid, ret, errno);
> +        vxhs_close(bs);

But if we leave this here, and if you are using the vxhs_close() for cleanup
(and I like that you do), you need to make sure to set bs->drv = NULL.
Otherwise, 1) subsequent I/O will not have a good day, and 2) the final
invocation of bdrv_close() will double free resources.

> +    }
> +
> +    return ret;
> +}
> +
> +static unsigned long vxhs_get_vdisk_stat(BDRVVXHSState *s)
> +{
> +    int64_t vdisk_size = 0;
> +    int ret = 0;
> +
> +    ret = vxhs_qnio_iio_ioctl(s->qnio_ctx,
> +            s->vdisk_hostinfo[s->vdisk_cur_host_idx].vdisk_rfd,
> +            VDISK_STAT, &vdisk_size, NULL, 0);
> +
> +    if (ret < 0) {
> +        trace_vxhs_get_vdisk_stat_err(s->vdisk_guid, ret, errno);
> +        return 0;

Why return 0, rather than the error?

> +    }
> +
> +    trace_vxhs_get_vdisk_stat(s->vdisk_guid, vdisk_size);
> +    return vdisk_size;
> +}
> +
> +/*
> + * Returns the size of vDisk in bytes. This is required
> + * by QEMU block upper block layer so that it is visible
> + * to guest.
> + */
> +static int64_t vxhs_getlength(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * Fetch the vDisk size using stat ioctl
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {

If vxhs_get_vdisk_stat() returned the error rather than 0, then this check
is unnecessary (assuming vxhs_qnio_iio_ioctl() returns useful errors
itself).

> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;
> +}
> +
> +/*
> + * Returns actual blocks allocated for the vDisk.
> + * This is required by qemu-img utility.
> + */

This returns bytes, not blocks.

> +static int64_t vxhs_get_allocated_blocks(BlockDriverState *bs)
> +{
> +    BDRVVXHSState *s = bs->opaque;
> +    int64_t vdisk_size = 0;
> +
> +    if (s->vdisk_size > 0) {
> +        vdisk_size = s->vdisk_size;
> +    } else {
> +        /*
> +         * TODO:
> +         * Once HyperScale storage-virtualizer provides
> +         * actual physical allocation of blocks then
> +         * fetch that information and return back to the
> +         * caller but for now just get the full size.
> +         */
> +        vdisk_size = vxhs_get_vdisk_stat(s);
> +        if (vdisk_size > 0) {
> +            s->vdisk_size = vdisk_size;
> +        }
> +    }
> +
> +    if (vdisk_size > 0) {
> +        return vdisk_size; /* return size in bytes */
> +    }
> +
> +    return -EIO;
> +}

1. This is identical to vxhs_getlength(), so you could just call that, but:

2. Don't, because you are not returning what is expected by this function.
   If it is not supported yet by VXHS, just don't implement
   bdrv_get_allocated_file_size.  The block driver code will do the right
   thing, and return -ENOTSUP (which ends up showing up as this value is
   unavailable).  This is much more useful than having the wrong value
   returned.  
   
3. I guess unless, of course, 100% file allocation is always true, then you
   can ignore point #2, and just follow #1.

-Jeff

[Prev in Thread]

Current Thread

[Next in Thread]

[Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Ashish Mittal, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Paolo Bonzini, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Stefan Hajnoczi, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Stefan Hajnoczi, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Daniel P. Berrange, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Stefan Hajnoczi, 2016/09/28
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Jeff Cody <=
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Jeff Cody, 2016/09/28
  - Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, ashish mittal, 2016/09/29
- Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, Stefan Hajnoczi, 2016/09/30
  - Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support, ashish mittal, 2016/09/30

Prev by Date: Re: [Qemu-devel] [PATCH] tap-bsd: OpenBSD uses tap(4) now
Next by Date: Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support
Previous by thread: Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support
Next by thread: Re: [Qemu-devel] [PATCH v7 RFC] block/vxhs: Initial commit to add Veritas HyperScale VxHS block device support
Index(es):
- Date
- Thread