Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on conne

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on conne

From:	Eric Blake
Subject:	Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect()
Date:	Wed, 19 Aug 2020 12:52:55 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0

On 8/12/20 9:52 AM, Vladimir Sementsov-Ogievskiy wrote:

This make nbd connection_co to yield during reconnects, so that
reconnect doesn't hang up the main thread. This is very important in
case of unavailable nbd server host: connect() call may take a long
time, blocking the main thread (and due to reconnect, it will hang
again and again with small gaps of working time during pauses between
connection attempts).

How to reproduce the bug, fixed with this commit:

1. Create an image on node1:
    qemu-img create -f qcow2 xx 100M

2. Start NBD server on node1:
    qemu-nbd xx

3. Start vm with second nbd disk on node2, like this:

   ./x86_64-softmmu/qemu-system-x86_64 -nodefaults -drive \
     file=/work/images/cent7.qcow2 -drive file=nbd+tcp://192.168.100.2 \
     -vnc :0 -qmp stdio -m 2G -enable-kvm -vga std

Where is the configuration to set up retry on the nbd connection? Iwonder if you have a non-upstream patch that turns it on by default inyour builds; for upstream, I would have expected something more alongthe lines of -blockdevdriver=nbd,reconnect-delay=20,server.type=inet,server.data.hostname=192.168.100.2,server.data.port=10809(typing off the top of my head, rather than actually tested).


4. Access the vm through vnc (or some other way?), and check that NBD
    drive works:

    dd if=/dev/sdb of=/dev/null bs=1M count=10

    - the command should succeed.

5. Now, let's trigger nbd-reconnect loop in Qemu process. For this:

5.1 Kill NBD server on node1

5.2 run "dd if=/dev/sdb of=/dev/null bs=1M count=10" in the guest
     again. The command should fail and a lot of error messages about
     failing disk may appear as well.

Why does the guest access fail when the server goes away? Shouldn't thepending guest requests merely be queued for retry (where the guest hasnot seen a failure yet, but may do so if timeouts are reached), ratherthan being instant errors?


     Now NBD client driver in Qemu tries to reconnect.
     Still, VM works well.

6. Make node1 unavailable on NBD port, so connect() from node2 will
    last for a long time:

    On node1 (Note, that 10809 is just a default NBD port):

    sudo iptables -A INPUT -p tcp --dport 10809 -j DROP

    After some time the guest hangs, and you may check in gdb that Qemu
    hangs in connect() call, issued from the main thread. This is the
    BUG.

7. Don't forget to drop iptables rule from your node1:

    sudo iptables -D INPUT -p tcp --dport 10809 -j DROP


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

[PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect(), Vladimir Sementsov-Ogievskiy, 2020/08/12
- Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect(), Eric Blake, 2020/08/19
  - Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect(), Vladimir Sementsov-Ogievskiy, 2020/08/20
- Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect(), Eric Blake <=
  - Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect(), Vladimir Sementsov-Ogievskiy, 2020/08/20

Prev by Date: Re: device compatibility interface for live migration with assigned devices
Next by Date: Re: [RFC v4 2/2] memory: Skip bad range assertion if notifier is DEVIOTLB type
Previous by thread: Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect()
Next by thread: Re: [PATCH v3] block/nbd: use non-blocking connect: fix vm hang on connect()
Index(es):
- Date
- Thread