qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH 3/3] hw/char/serial: Don't retry on serial_xmit


From: Sergio Lopez
Subject: Re: [Qemu-devel] [PATCH 3/3] hw/char/serial: Don't retry on serial_xmit if errno == EPIPE
Date: Fri, 1 Jun 2018 18:20:11 +0200
User-agent: NeoMutt/20180512

On Fri, Jun 01, 2018 at 01:43:56PM +0100, Peter Maydell wrote:
> On 1 June 2018 at 13:28, Sergio Lopez <address@hidden> wrote:
> > On Fri, Jun 01, 2018 at 02:16:59PM +0200, Paolo Bonzini wrote:
> >> On 01/06/2018 14:05, Sergio Lopez wrote:
> >> >>> Instead of adding explicit handling of EPIPE, shouldn't the code be
> >> >>> rewritten to treat -1 return && errno != EAGAIN as fatal?
> >> >> Yes, exactly this code is already broken for every single errno
> >> >> value, not simply EPIPE. It needs fixing to treat '-1' return code
> >> >> correctly instead of retrying everything.
> >> > Given that EAGAIN is already taken care of in
> >> > chardev/char.c:qemu_chr_write_buffer, in which cases should we retry? Or
> >> > should we just drop all the tsr_retry logic?
> >>
> >> It's not handled there, the call from qemu_chr_fe_write has write_all ==
> >> false.
> >
> > You're right. So should we just retry only for -1 && errno == EAGAIN and
> > just ignore the error otherwise?
> 
> I don't think we should make all of qemu_chr_fe_write()'s callers
> have to handle EAGAIN. That function already has a way to tell
> the caller that it didn't consume any data, which is to return 0.
> 
> (EAGAIN is a curse and we should strive to avoid it spreading its
> poison into areas of the codebase that we can keep it out of.)
> 
> In general there are three cases I think qemu_chr_fe_write() callers
> might care about:
>  * data was consumed (return >0)
>  * data wasn't consumed, try again later (return 0)
>  * data wasn't consumed, and a later attempt won't work either
>    (return -1)
> 
> NB that hw/char/serial.c is not the only serial device using
> this pattern and probably needing adjustment (though I think it's
> the only one with the complicated tsr_retry logic).
> 
> In the patch that started this thread the report is that we
> use lots of CPU. Can you explain why that happens? I was expecting
> that we would set up the qemu_chr_fe_add_watch() on the dead
> chr FE and it would just never fire since the FE can never
> accept further data...

The callback is indeed triggered (I haven't looked why yet), and both
the main thread and the vCPU that issued the write to the serial port
(which is now polling the device to see if the transmitter is empty) are
fighting for the qemu_global_mutex.

This issue is heavily amplified if both threads are sharing the same
pCPU (by chance or pinning).

Some debugging info from a simulation:


 - vCPU thread waiting to acquire qemu_global_mutex:

Thread 3 (Thread 0x7fb6c7fff700 (LWP 31641)):
#0  0x00007fb6e591c4cd in __lll_lock_wait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fb6e5917dcb in _L_lock_812 () at /lib64/libpthread.so.0
#2  0x00007fb6e5917c98 in __GI___pthread_mutex_lock (address@hidden 
<qemu_global_mutex>) at ../nptl/pthread_mutex_lock.c:79
#3  0x0000556dfaebcfd7 in qemu_mutex_lock_impl (address@hidden 
<qemu_global_mutex>, address@hidden "/root/Projects/qemu/cpus.c", 
address@hidden)
    at util/qemu-thread-posix.c:67
#4  0x0000556dfaacfe58 in qemu_mutex_lock_iothread () at 
/root/Projects/qemu/cpus.c:1765
#5  0x0000556dfaa90bcd in prepare_mmio_access (mr=0x556dfe71c0e0, 
mr=0x556dfe71c0e0)
    at /root/Projects/qemu/exec.c:3068
#6  0x0000556dfaa95f98 in flatview_read_continue (address@hidden, 
address@hidden, attrs=..., address@hidden, address@hidden "", address@hidden, 
addr1=5, l=1, mr=0x556dfe71c0e0)
    at /root/Projects/qemu/exec.c:3189
---Type <return> to continue, or q <return> to quit---
#7  0x0000556dfaa9617c in flatview_read (fv=0x7fb6c028aa80, addr=1021, 
attrs=..., buf=0x7fb6e739f000 "", len=1) at /root/Projects/qemu/exec.c:3255
#8  0x0000556dfaa9629f in address_space_read_full (as=<optimized out>, 
address@hidden, attrs=..., buf=<optimized out>, address@hidden) at 
/root/Projects/qemu/exec.c:3268
#9  0x0000556dfaa963fa in address_space_rw (as=<optimized out>, address@hidden, 
attrs=...,
    address@hidden, buf=<optimized out>, address@hidden, address@hidden)
    at /root/Projects/qemu/exec.c:3298
#10 0x0000556dfaaf6536 in kvm_cpu_exec (count=1, size=1, direction=<optimized 
out>, data=<optimized out>, attrs=..., port=1021) at 
/root/Projects/qemu/accel/kvm/kvm-all.c:1730
#11 0x0000556dfaaf6536 in kvm_cpu_exec (address@hidden)
    at /root/Projects/qemu/accel/kvm/kvm-all.c:1970
#12 0x0000556dfaad0006 in qemu_kvm_cpu_thread_fn (arg=0x556dfd3d8020) at 
/root/Projects/qemu/cpus.c:1215
#13 0x00007fb6e5915dd5 in start_thread (arg=0x7fb6c7fff700) at 
pthread_create.c:308
#14 0x00007fb6d9e97b3d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:113


 - The owner is LWP 31634, the main thread

 (gdb) p qemu_global_mutex
$1 = {lock = {__data = {__lock = 2, __count = 0, __owner = 31634, __nusers = 1, 
__kind = 0, __spins = 0,
      __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
    __size = "\002\000\000\000\000\000\000\000\222{\000\000\001", '\000' 
<repeats 26 times>,
    __align = 2}, initialized = true}


 - The main thead is in the callback

 Thread 1 (Thread 0x7fb6e7355cc0 (LWP 31634)):
---Type <return> to continue, or q <return> to quit---
#0  0x0000556dfac59420 in serial_watch_cb (chan=0x556dfd321400, cond=G_IO_OUT, 
opaque=0x556dfe71c000)
    at hw/char/serial.c:233
#1  0x00007fb6e68638f9 in g_main_context_dispatch (context=0x556dfd314210) at 
gmain.c:3146
#2  0x00007fb6e68638f9 in g_main_context_dispatch (address@hidden) at 
gmain.c:3811
#3  0x0000556dfaeba126 in main_loop_wait () at util/main-loop.c:215
#4  0x0000556dfaeba126 in main_loop_wait (timeout=<optimized out>) at 
util/main-loop.c:263
#5  0x0000556dfaeba126 in main_loop_wait (address@hidden) at 
util/main-loop.c:522
#6  0x0000556dfaa89a2f in main () at vl.c:1943
#7  0x0000556dfaa89a2f in main (argc=<optimized out>, argv=<optimized out>, 
envp=<optimized out>)
    at vl.c:4679


Given your question, I was wondering what would happen if the callback
never fired. The result is much, much worse, as QEMU doesn't set
UART_LSR_TEMT, and the Guest OS (RHEL7.5) keeps polling the serial
device for ~10 seconds. Of course this can be regarded as a issue if the
Guest's serial driver, but I'm pretty sure we don't want to break it ;-)

This is not the first time I'm fighting the retry logic in serial.c.
Some versions back, serial_xmit would happily add up watchers with no
limit, eventually exceeding FD_SETSIZE limit. I'm starting to thing that
tsr_retry logic does more harm than good.

In any case, I think the question on the table is when should
serial_xmit retry writing to the FE, and we seem to have two options:

 1. When ret != 1 and errno == EAGAIN.

 2. When ret == 0.

I'm good with either, so I leave the decision to you.

Sergio.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]