Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of in

qemu-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of in

From:	Anthony Liguori
Subject:	Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop
Date:	Thu, 28 Jun 2012 09:10:39 -0500
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:11.0) Gecko/20120329 Thunderbird/11.0.1

On 06/26/2012 02:34 PM, Marcelo Tosatti wrote:

On Sat, Jun 23, 2012 at 12:55:49AM +0200, Jan Kiszka wrote:

Should have declared this [RFC] in the subject and CC'ed kvm...

On 2012-06-23 00:45, Jan Kiszka wrote:

This sketches a possible path to get rid of the iothread lock on vmexits
in KVM mode. On x86, the the in-kernel irqchips has to be used because
we otherwise need to synchronize APIC and other per-cpu state accesses
that could be changed concurrently. Not yet fully analyzed is the NMI
injection path in the absence of an APIC.

s390x should be fine without specific locking as their pre/post-run
callbacks are empty. Power requires locking for the pre-run callback.

This patch is untested, but a similar version was successfully used in
a x86 setup with a network I/O path that needed no central iothread
locking anymore (required special MMIO exit handling).
---
  kvm-all.c         |   18 ++++++++++++++++--
  target-i386/kvm.c |    7 +++++++
  target-ppc/kvm.c  |    4 ++++
  3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index f8e4328..9c3e26f 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1460,6 +1460,8 @@ int kvm_cpu_exec(CPUArchState *env)
          return EXCP_HLT;
      }

+    qemu_mutex_unlock_iothread();
+
      do {
          if (env->kvm_vcpu_dirty) {
              kvm_arch_put_registers(env, KVM_PUT_RUNTIME_STATE);
@@ -1476,14 +1478,16 @@ int kvm_cpu_exec(CPUArchState *env)
               */
              qemu_cpu_kick_self();
          }
-        qemu_mutex_unlock_iothread();

          run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);

-        qemu_mutex_lock_iothread();
          kvm_arch_post_run(env, run);

+        /* TODO: push coalesced mmio flushing to the point where we access
+         * devices that are using it (currently VGA and E1000). */
+        qemu_mutex_lock_iothread();
          kvm_flush_coalesced_mmio_buffer();
+        qemu_mutex_unlock_iothread();

          if (run_ret<  0) {
              if (run_ret == -EINTR || run_ret == -EAGAIN) {
@@ -1499,19 +1503,23 @@ int kvm_cpu_exec(CPUArchState *env)
          switch (run->exit_reason) {
          case KVM_EXIT_IO:
              DPRINTF("handle_io\n");
+            qemu_mutex_lock_iothread();
              kvm_handle_io(run->io.port,
                            (uint8_t *)run + run->io.data_offset,
                            run->io.direction,
                            run->io.size,
                            run->io.count);
+            qemu_mutex_unlock_iothread();
              ret = 0;
              break;
          case KVM_EXIT_MMIO:
              DPRINTF("handle_mmio\n");
+            qemu_mutex_lock_iothread();
              cpu_physical_memory_rw(run->mmio.phys_addr,
                                     run->mmio.data,
                                     run->mmio.len,
                                     run->mmio.is_write);
+            qemu_mutex_unlock_iothread();
              ret = 0;
              break;
          case KVM_EXIT_IRQ_WINDOW_OPEN:
@@ -1520,7 +1528,9 @@ int kvm_cpu_exec(CPUArchState *env)
              break;
          case KVM_EXIT_SHUTDOWN:
              DPRINTF("shutdown\n");
+            qemu_mutex_lock_iothread();
              qemu_system_reset_request();
+            qemu_mutex_unlock_iothread();
              ret = EXCP_INTERRUPT;
              break;
          case KVM_EXIT_UNKNOWN:
@@ -1533,11 +1543,15 @@ int kvm_cpu_exec(CPUArchState *env)
              break;
          default:
              DPRINTF("kvm_arch_handle_exit\n");
+            qemu_mutex_lock_iothread();
              ret = kvm_arch_handle_exit(env, run);
+            qemu_mutex_unlock_iothread();
              break;
          }
      } while (ret == 0);

+    qemu_mutex_lock_iothread();
+
      if (ret<  0) {
          cpu_dump_state(env, stderr, fprintf, CPU_DUMP_CODE);
          vm_stop(RUN_STATE_INTERNAL_ERROR);
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 0d0d8f6..0ad64d1 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -1631,7 +1631,10 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)

      /* Inject NMI */
      if (env->interrupt_request&  CPU_INTERRUPT_NMI) {
+        qemu_mutex_lock_iothread();
          env->interrupt_request&= ~CPU_INTERRUPT_NMI;
+        qemu_mutex_unlock_iothread();
+
          DPRINTF("injected NMI\n");
          ret = kvm_vcpu_ioctl(env, KVM_NMI);
          if (ret<  0) {
@@ -1641,6 +1644,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)
      }

      if (!kvm_irqchip_in_kernel()) {
+        qemu_mutex_lock_iothread();
+
          /* Force the VCPU out of its inner loop to process any INIT requests
           * or pending TPR access reports. */
          if (env->interrupt_request&
@@ -1682,6 +1687,8 @@ void kvm_arch_pre_run(CPUX86State *env, struct kvm_run 
*run)

          DPRINTF("setting tpr\n");
          run->cr8 = cpu_get_apic_tpr(env->apic_state);
+
+        qemu_mutex_unlock_iothread();
      }
  }

diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index c09cc39..60d91a5 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -471,6 +471,8 @@ void kvm_arch_pre_run(CPUPPCState *env, struct kvm_run *run)
      int r;
      unsigned irq;

+    qemu_mutex_lock_iothread();
+
      /* PowerPC QEMU tracks the various core input pins (interrupt, critical
       * interrupt, reset, etc) in PPC-specific env->irq_input_state. */
      if (!cap_interrupt_level&&
@@ -497,6 +499,8 @@ void kvm_arch_pre_run(CPUPPCState *env, struct kvm_run *run)
      /* We don't know if there are more interrupts pending after this. However,
       * the guest will return to userspace in the course of handling this one
       * anyways, so we will get a chance to deliver the rest. */
+
+    qemu_mutex_unlock_iothread();
  }

  void kvm_arch_post_run(CPUPPCState *env, struct kvm_run *run)


The following plan would allow progressive convertion to parallel
operation.

Jan mentioned the MMIO handler->MMIO handler deadlock in a private message.

Jan: if there is recursive MMIO accesses, you can detect that and skip
such MMIO handlers in dev_can_use_lock() ? Or blacklist.

What i mentioned earlier about "unsolved issues" are the possibilities
in "iothread flow" (belief is that it would be better determined at
with execution experience / tuning).

OK, so, each document starts with xyz.txt. This is compatible with TCG,
the critical part is not dropping/acquire locks for decent performance
(that is, only drop memmap_lock from TCG VCPUS thread if IOTHREAD
requests so).

This came out from discussions with Avi.

immediate-plan.txt
------------------

1.  read_lock(memmap_lock)
2.  MemoryRegionSection mrs = lookup(addr)
3.  qom_ref(mrs.mr->dev)
4.  read_unlock(memmap_lock)

5.  mutex_lock(dev->lock)
6.  dispatch(&mrs, addr, data, size)
7.  mutex_unlock(dev->lock)

Just a detail, I don't think we should acquire a device specific lock in globalcode. Rather, I think we should acquire the global lock before dispatch unlessa MemoryRegion is marked as being unlocked.


Regards,

Anthony Liguori


8.  qom_unref(mrs.mr->object)


A) Add Device object to MemoryRegions.

memory_region_add_subregion
memory_region_add_eventfd
memory_region_add_subregion_overlap


B) qom_ref details

Z) see device-hotplug.txt items



lock-model.txt
--------------


lock_device(dev) {
     if (dev_can_use_lock(dev))
         mutex_lock(dev->lock)
     else
         mutex_lock(qemu_global_mutex)
}

dev_can_use_lock(dev)
{
     if (all services used by dev mmio handler have)
         - qemu_global_mutex
         - trylock(dev->lock) and fail

         then yes
     else
         then no
}


That is, from the moment in which the device uses the order
(dev->lock ->  qemu_global_mutex), the qemu_global_mutex ->  dev->lock
is forbidden by the IOTHREAD or anyone else.


convert-to-unlocked.txt
-----------------------


1.  read_lock(memmap_lock)
2.  MemoryRegionSection mrs = lookup(addr)
3.  qom_ref(mrs.mr->dev)
4.  read_unlock(memmap_lock)

5.  mutex_lock(dev->lock)
6.  dispatch(&mrs, addr, data, size)
7.  mutex_unlock(dev->lock)

8.  qom_unref(mrs.mr->object)

THE ITEMS BELOW SHOULD BE DOCUMENTATION TO CONVERT DEVICE
TO UNLOCKED OPERATION.

1) qom_ref guarantees that the Device object does not disappear. However,
these operations are allowed in parallel:

     a) after 4. device memory and ioports might change or be unregistered.

        Concurrent accesses of a device that is being hotunplugged or
        whose mappings change are responsability of the OS, so
        incorrect emulation is not QEMU's problem. However, device
        emulation is now subject to:
             - cpu_physical_memory_map failing.
             - stl_phys* failing.
             - cpu_physical_memory_rw failing/reading from unassigned mem.

     b) what should happen with a pending cpu_physical_memory_map pointer,
        when deleting the corresponding memory region?


net.txt
--------


iothread flow
=============

1) Skip-work-if-device-locked

select(tap fd ready)
tap_send
     if (trylock(TAPState->NetClientState->dev))
         proceed;
     else
         continue; (data remains in queue)
     tap_read_packet
     qemu_send_packet_async

DRAWBACK: lock intensive.
DRAWBACK: VCPU threads can starve IOTHREAD (can be fixed with
a scheme such as trylock() marks a flag saying "iothread wants lock".

Checking each subsystem for possibility:

- tap_send: OK
     - user,slirp: if not possible, use device->lock.
- qemu-char: OK
- interrupts: lock with qemu_global_mutex.
- qemutimer:
- read time: qemu_get_clock_ns (see later).


2) Queue-work-to-vcpu-context

select(tap fd ready)
tap_send
     if (trylock(TAPState->NetClientState->dev)) {
         proceed;
     } else {
         AddToDeviceWorkQueue(work);
     }
     tap_read_packet
     qemu_send_packet_async

DRAWBACK: lock intensive
DRAWBACK: cache effects

3) VCPU-thread-notify-device-unlocked

select(tap fd ready)
tap_send
     if (trylock(TAPState->NetClientState->dev)) {
         proceed;
     } else {
         signal VCPU thread to notify IOTHREAD when unlocked.
     }
     tap_read_packet


GENERAL OBJECTIVE: to avoid lock inversion. IOTHREAD should follow
same order.


PCI HOTPLUG EJECT FLOW
----------------------

1) Monitor initiated.

monitor ->  qmp_device_del ->  qdev_unplug ->  pci_unplug_device ->
piix4_device_hotplug ->  SCI interrupt ->  OS removes
application users from device ->  OS runs _EJ0 ->

For hot removal, the device must be immediately ejected when OSPM calls
the _EJ0 control method. The _EJ0 control method does not return until
ejection is complete. After calling _EJ0, OSPM verifies the device no
longer exists to determine if the eject succeeded. For _HID devices,
OSPM evaluates the _STA method. For _ADR devices, OSPM checks with the
bus driver for that device.


Avi:  A pci eject guarantees that all accesses
started after the eject completes do not see the device.

Can you do:

- qobject_remove()

and have the device dispatch running safely? NO, therefore pci_unplug_device
must hold dev->lock.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop, (continued)
- Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop, Anthony Liguori, 2012/06/22
  - Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop, Jan Kiszka, 2012/06/23

Prev by Date: Re: [Qemu-devel] [PATCH v7 11/16] target-or32: Add a IIS dummy board
Next by Date: Re: [Qemu-devel] qemu-system-ppc64 hanging occasionally in disk writes
Previous by thread: Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop
Next by thread: Re: [Qemu-devel] [PATCH] kvm: First step to push iothread lock out of inner run loop
Index(es):
- Date
- Thread