qemu-s390x
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [qemu-s390x] [Qemu-devel] [PATCH 15/18] Boot Linux Console Test: add


From: Cleber Rosa
Subject: Re: [qemu-s390x] [Qemu-devel] [PATCH 15/18] Boot Linux Console Test: add a test for aarch64 + virt
Date: Mon, 10 Jun 2019 12:50:39 -0400
User-agent: Mutt/1.11.4 (2019-03-13)

On Mon, Jun 10, 2019 at 09:53:03AM +0100, Daniel P. Berrangé wrote:
> On Fri, Jun 07, 2019 at 03:58:57PM -0300, Eduardo Habkost wrote:
> > CCing Daniel, who wrote commit 6ab3fc32ea64.
> > 
> > On Fri, Jun 07, 2019 at 11:44:32AM -0400, Cleber Rosa wrote:
> > > On Fri, Jun 07, 2019 at 12:42:14AM -0300, Eduardo Habkost wrote:
> > > > On Fri, Jun 07, 2019 at 12:26:48AM -0300, Eduardo Habkost wrote:
> > > > > On Fri, Feb 01, 2019 at 11:10:31AM -0500, Cleber Rosa wrote:
> > > > > > 
> > > > > > 
> > > > > > On 1/31/19 4:26 PM, Cleber Rosa wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 1/31/19 3:21 PM, Cleber Rosa wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> On 1/31/19 3:02 PM, Wainer dos Santos Moschetta wrote:
> > > > > > >>>
> > > > > > >>> On 01/17/2019 04:56 PM, Cleber Rosa wrote:
> > > > > > >>>> Just like the previous tests, boots a Linux kernel on a 
> > > > > > >>>> aarch64 target
> > > > > > >>>> using the virt machine.
> > > > > > >>>>
> > > > > > >>>> One special option added is the CPU type, given that the kernel
> > > > > > >>>> selected fails to boot on the virt machine's default CPU 
> > > > > > >>>> (cortex-a15).
> > > > > > >>>>
> > > > > > >>>> Signed-off-by: Cleber Rosa <address@hidden>
> > > > > > >>>> ---
> > > > > > >>>>   .travis.yml                            |  2 +-
> > > > > > >>>>   tests/acceptance/boot_linux_console.py | 20 
> > > > > > >>>> ++++++++++++++++++++
> > > > > > >>>>   2 files changed, 21 insertions(+), 1 deletion(-)
> > > > > > >>>>
> > > > > > >>>> diff --git a/.travis.yml b/.travis.yml
> > > > > > >>>> index 54100eea5a..595e8c0b6c 100644
> > > > > > >>>> --- a/.travis.yml
> > > > > > >>>> +++ b/.travis.yml
> > > > > > >>>> @@ -187,7 +187,7 @@ matrix:
> > > > > > >>>>         # Acceptance (Functional) tests
> > > > > > >>>>       - env:
> > > > > > >>>> -        - CONFIG="--python=/usr/bin/python3
> > > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu"
> > > > > > >>>> +        - CONFIG="--python=/usr/bin/python3
> > > > > > >>>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu"
> > > > > > >>>>
> > > > > > >>>>           - TEST_CMD="make check-acceptance"
> > > > > > >>>>         addons:
> > > > > > >>>>           apt:
> > > > > > >>>> diff --git a/tests/acceptance/boot_linux_console.py
> > > > > > >>>> b/tests/acceptance/boot_linux_console.py
> > > > > > >>>> index f3ccd23a7a..107700b517 100644
> > > > > > >>>> --- a/tests/acceptance/boot_linux_console.py
> > > > > > >>>> +++ b/tests/acceptance/boot_linux_console.py
> > > > > > >>>> @@ -138,3 +138,23 @@ class BootLinuxConsole(Test):
> > > > > > >>>>           self.vm.launch()
> > > > > > >>>>           console_pattern = 'Kernel command line: %s' %
> > > > > > >>>> kernel_command_line
> > > > > > >>>>           self.wait_for_console_pattern(console_pattern)
> > > > > > >>>> +
> > > > > > >>>> +    def test_aarch64_virt(self):
> > > > > > >>>
> > > > > > >>> That test case fails on my system (Fedora 29 x86_64). Avocado 
> > > > > > >>> seems
> > > > > > >>> unable to kill the VM so it  reaches the timeout.
> > > > > > >>>
> > > > > > >>> I compiled QEMU with default configuration:
> > > > > > >>>
> > > > > > >>> $ configure --python=/usr/bin/python3 
> > > > > > >>> --target-list=x86_64-softmmu
> > > > > > >>> --target-list=x86_64-softmmu,mips-softmmu,mips64el-softmmu,ppc64-softmmu,aarch64-softmmu)
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Follows a snippet of the Avocado's job.log file:
> > > > > > >>> ----
> > > > > > >>> 2019-01-31 14:41:34,912 test             L0602 INFO | START
> > > > > > >>> 07-/root/src/qemu/tests/acceptance/boot_linux_console.py:BootLinuxConsole.test_aarch64_virt
> > > > > > >>>
> > > > > > >>> 2019-01-31 14:41:34,912 test             L0298 DEBUG| DATA
> > > > > > >>> (filename=output.expected) => NOT FOUND (data sources: variant, 
> > > > > > >>> test, file)
> > > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS 
> > > > > > >>> (key=arch,
> > > > > > >>> path=*, default=aarch64) => 'aarch64'
> > > > > > >>> 2019-01-31 14:41:34,913 parameters       L0146 DEBUG| PARAMS
> > > > > > >>> (key=qemu_bin, path=*, 
> > > > > > >>> default=aarch64-softmmu/qemu-system-aarch64) =>
> > > > > > >>> 'aarch64-softmmu/qemu-system-aarch64'
> > > > > > >>> 2019-01-31 14:41:34,915 download         L0070 INFO | Fetching
> > > > > > >>> https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz
> > > > > > >>> -> /var/lib/avocado/data/cache/by_name/vmlinuz.3upct2pr
> > > > > > >>> 2019-01-31 14:41:35,490 download         L0054 DEBUG| Retrieved 
> > > > > > >>> URL
> > > > > > >>> "https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/releases/29/Server/aarch64/os/images/pxeboot/vmlinuz":
> > > > > > >>> content-length 8623423, date: "Thu, 31 Jan 2019 19:41:35 GMT",
> > > > > > >>> last-modified: "Sun, 21 Oct 2018 00:43:09 GMT"
> > > > > > >>> 2019-01-31 14:41:41,765 qemu             L0317 DEBUG| VM launch 
> > > > > > >>> command:
> > > > > > >>> 'aarch64-softmmu/qemu-system-aarch64 -chardev
> > > > > > >>> socket,id=mon,path=/var/tmp/tmpizirkcud/qemu-32609-monitor.sock 
> > > > > > >>> -mon
> > > > > > >>> chardev=mon,mode=control -display none -vga none -machine virt 
> > > > > > >>> -chardev
> > > > > > >>> socket,id=console,path=/var/tmp/tmpizirkcud/qemu-32609-console.sock,server,nowait
> > > > > > >>> -serial chardev:console -cpu cortex-a53 -kernel
> > > > > > >>> /var/lib/avocado/data/cache/by_name/vmlinuz -append 
> > > > > > >>> console=ttyAMA0'
> > > > > > >>> 2019-01-31 14:41:41,779 qmp              L0167 DEBUG| >>> 
> > > > > > >>> {'execute':
> > > > > > >>> 'qmp_capabilities'}
> > > > > > >>> 2019-01-31 14:41:41,931 qmp              L0175 DEBUG| <<< 
> > > > > > >>> {'return': {}}
> > > > > > >>> 2019-01-31 14:41:42,830 boot_linux_conso L0041 DEBUG| [    
> > > > > > >>> 0.000000]
> > > > > > >>> Booting Linux on physical CPU 0x0000000000 [0x410fd034]
> > > > > > >>>
> > > > > > >>> (...)
> > > > > > >>>
> > > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    
> > > > > > >>> 0.000000]
> > > > > > >>> Policy zone: DMA32
> > > > > > >>> 2019-01-31 14:41:42,833 boot_linux_conso L0041 DEBUG| [    
> > > > > > >>> 0.000000]
> > > > > > >>> Kernel command line: console=ttyAMA0
> > > > > > >>> 2019-01-31 14:41:42,833 qmp              L0167 DEBUG| >>> 
> > > > > > >>> {'execute':
> > > > > > >>> 'quit'}
> > > > > > >>
> > > > > > >> Here, a QMP response like "<<< {'return': {}}" would be expected.
> > > > > > >>
> > > > > > >> Since I can not reproduce this on my system (or on Travis-CI 
> > > > > > >> jobs I've
> > > > > > >> sent), can you tell me on top of which commit you've applied 
> > > > > > >> these patches?
> > > > > > >>
> > > > > > > 
> > > > > > > I spoke too soon:
> > > > > > > 
> > > > > > > https://travis-ci.org/clebergnu/qemu/jobs/487121425#L3033
> > > > > > > 
> > > > > > > This looks like a recent regression, and I'm guessing it's not on 
> > > > > > > the
> > > > > > > test's side.  I'll try to bisect it and let you know.
> > > > > > > 
> > > > > > 
> > > > > > On a fresh environment, I am able to get this reproduced on every 2 
> > > > > > of
> > > > > > runs, more or less.  When I hit it, I attached GDB to it, and the
> > > > > > backtrace shows:
> > > > > > 
> > > > > > Thread debugging using libthread_db enabled]
> > > > > > Using host libthread_db library "/lib64/libthread_db.so.1".
> > > > > > warning: Loadable section ".note.gnu.property" outside of ELF 
> > > > > > segments
> > > > > > warning: Loadable section ".note.gnu.property" outside of ELF 
> > > > > > segments
> > > > > > __lll_lock_wait () at 
> > > > > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > > 103     2:      movl    %edx, %eax
> > > > > > (gdb) bt
> > > > > > #0  __lll_lock_wait () at
> > > > > > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:103
> > > > > > #1  0x00007fc6ba1a2e09 in __GI___pthread_mutex_lock
> > > > > > (mutex=mutex@entry=0x5615a233d020 <qemu_global_mutex>) at
> > > > > > ../nptl/pthread_mutex_lock.c:80
> > > > > > #2  0x00005615a1bb7593 in qemu_mutex_lock_impl (mutex=0x5615a233d020
> > > > > > <qemu_global_mutex>, file=0x5615a1db2d4c "util/main-loop.c", 
> > > > > > line=236)
> > > > > > at util/qemu-thread-posix.c:66
> > > > > > #3  0x00005615a171125e in qemu_mutex_lock_iothread_impl
> > > > > > (file=file@entry=0x5615a1db2d4c "util/main-loop.c", 
> > > > > > line=line@entry=236)
> > > > > > at /home/cleber/src/qemu/cpus.c:1849
> > > > > > #4  0x00005615a1bb415d in os_host_main_loop_wait (timeout=<optimized
> > > > > > out>) at util/main-loop.c:236
> > > > > > #5  main_loop_wait (nonblocking=<optimized out>) at 
> > > > > > util/main-loop.c:497
> > > > > > #6  0x00005615a18fdd39 in main_loop () at vl.c:1928
> > > > > > #7  0x00005615a16c9ee9 in main (argc=<optimized out>, 
> > > > > > argv=<optimized
> > > > > > out>, envp=<optimized out>) at vl.c:4665
> > > > > 
> > > > > Tip: run "thread apply all bt" so you can get a backtrace of all
> > > > > threads.
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Running it with `taskset -c 1` prevents this issue from happening, 
> > > > > > which
> > > > > > AFAICT, contributes even further towards this being a QEMU race 
> > > > > > condition.
> > > > > > 
> > > > > > I'm CC'ing Peter and Claudio (listed maintainers of aarch64), as 
> > > > > > this
> > > > > > seems to limited to that target.  Any tips on what to do here?
> > > > > 
> > > > > I am hitting this on Travis, too, and I finally could reproduce
> > > > > it locally,
> > > > > 
> > > > > The guest is still writing on the serial console, but nobody is
> > > > > reading the data on the other side.  A VCPU thread is stuck
> > > > > inside the EAGAIN/nanosleep loop at qemu_chr_write_buffer(),
> > > > > holding the QEMU global lock.
> > > > 
> > > > Experimental fix below.
> > > > 
> > > > Signed-off-by: Eduardo Habkost <address@hidden>
> > > > ---
> > > >  python/qemu/__init__.py | 12 ++++++++----
> > > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/python/qemu/__init__.py b/python/qemu/__init__.py
> > > > index 81d9657ec0..4a691f34da 100644
> > > > --- a/python/qemu/__init__.py
> > > > +++ b/python/qemu/__init__.py
> > > > @@ -274,10 +274,6 @@ class QEMUMachine(object):
> > > >  
> > > >          self._qemu_log_path = None
> > > >  
> > > > -        if self._console_socket is not None:
> > > > -            self._console_socket.close()
> > > > -            self._console_socket = None
> > > > -
> > > >          if self._temp_dir is not None:
> > > >              shutil.rmtree(self._temp_dir)
> > > >              self._temp_dir = None
> > > > @@ -336,6 +332,14 @@ class QEMUMachine(object):
> > > >          """
> > > >          Terminate the VM and clean up
> > > >          """
> > > > +
> > > > +        # If we keep the console socket open, we may deadlock waiting
> > > > +        # for QEMU to exit, while QEMU is waiting for the socket to
> > > > +        # become writeable.
> > > > +        if self._console_socket is not None:
> > > > +            self._console_socket.close()
> > > > +            self._console_socket = None
> > > > +
> > > 
> > > Right, this is somewhat equivalent to the following "workaround":
> > > 
> > >    
> > > https://github.com/clebergnu/qemu/commit/e1713f3b91972ad57c089f276c54db3f3fa63423
> > > 
> > > I could not tell at the moment, though, if closing (or shutting down)
> > > the console socket was the appropriate fix.
> > > 
> > > What I see is that Rick's commit pointed to by Lazlo is dated from
> > > 2016, and it says that a virtio-console fix is necessary.  I'd imagine
> > > that, if a fix was ever proposed, it'd have been incorporated in the
> > > F29 kernel used in this test... and that this workaround/fix would
> > > not be the best solution.
> > 
> > If I understood this correctly, fixing the guest driver wouldn't
> > help here.  The commit mentioned by Laszlo (6ab3fc32ea64 "hw:
> > replace most use of qemu_chr_fe_write with qemu_chr_fe_write_all")
> > fixes data loss bugs but introduces the potential for deadlocks
> > like the one we are seeing.
> > 
> > Unless we replace the existing ~30 qemu_chr_fe_write_all() calls
> > in device code, we need to make sure we are constantly reading
> > data from the console socket.
> 
> Yes, you must *always* have something reading from the chardev backend.
> This really sucks, but it is preferrable to letting data end up in
> /dev/null.
> 
> If this is being a problem for tests then consider it motivation to
> fix the root cause problem and make the devices properly do async
> I/O for chardevs.  Only a handful of them properly do this right
> now.
>
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Daniel,

Thanks for the explanation.  I don't feel completely capable of tackling
the root cause, but I'll give it a go anyway.

Eduardo,

Looks like we'll need this type of workaround now anyways.  As seen in
the reproducer I wrote, a test can cause this deadlock before it gets
to the test tearDown(), and either implicitly or explicitly gets to
QEMUMachine::shutdown().

I would also like to avoid exposing the user (test writer) to that
kind of situation though, which seems like what you tried to do.  But,
maybe the most honest approach right now is to make a explicit call to
read from and close the socket, paired with documentation/comment about
why that is being done?  This would be similar to the "XXX" comments on
6ab3fc32ea64.

Regards!
- Cleber.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]