qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v2 6/6] i8259: add -no-spurious-interrupt-hack o


From: Jan Kiszka
Subject: Re: [Qemu-devel] [PATCH v2 6/6] i8259: add -no-spurious-interrupt-hack option
Date: Fri, 24 Aug 2012 07:40:36 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); de; rv:1.8.1.12) Gecko/20080226 SUSE/2.0.0.12-1.1 Thunderbird/2.0.0.12 Mnenhy/0.7.5.666

On 2012-08-23 08:24, Matthew Ogilvie wrote:
> This patch provides a way to optionally suppress spurious interrupts,
> as a workaround for systems described below:
> 
> Some old operating systems do not handle spurious interrupts well,
> and qemu tends to generate them significantly more often than
> real hardware.
> 
> Examples:
>   - Microport UNIX System V/386 v 2.1 (ca 1987)
>     (The main problem I'm fixing: Without this patch, it panics
>     sporadically when accessing the hard disk.)
>   - AT&T UNIX System V/386 Release 4.0 Version 2.1a (ca 1991)
>     See screenshot in "QEMU Official OS Support List":
>     http://www.claunia.com/qemu/objectManager.php?sClass=application&iId=9
>     (I don't have this system to test.)
>   - A report about OS/2 boot lockup from 2004 by Hampa Hug:
>     http://lists.nongnu.org/archive/html/qemu-devel/2004-09/msg00367.html
>     (My patch was partially inspired by his.)
>     Also: 
> http://lists.nongnu.org/archive/html/qemu-devel/2005-06/msg00243.html
>     (I don't have this system to test.)
> 
> Signed-off-by: Matthew Ogilvie <address@hidden>
> ---
> 
> Note: checkpatches.pl gives an error about initializing the global 
> "int no_spurious_interrupt_hack = 0;", even though existing lines
> near it are doing the same thing.  Should I give precedence to
> checkpatches.pl, or nearby code?
> 
> There was no version 1 of this patch; this was the last thing I had to
> work around to get UNIX running.
> 
> High level symptoms:
>    1. Despite using this UNIX system for nearly 10 years (ca 1987-1996)
>       on an early 80386, I don't remember ever seeing any crash like
>       this.  I vaguely remember I may have had one or two crashes for
>       which I don't have other explanations that perhaps could have
>       been this, but I don't remember the error messages to confirm it.
>    2. It is somewhat random when UNIX crashes when running in qemu.
>        - Sometimes it crashes the first time the floppy-based installer
>          tries to access the hard disk (partition table?).
>        - Other times (though fairly rarely), it actually finishes
>          formatting and copying the first disk's files to the
>          hard disk without crashing.
>        - On the other hand, I've never seen it successfully boot from
>          the hard disk without this patch.  An attempt to boot from
>          the hard drive always panics quite early.
>    3. I tried -win2k-hack instead, thinking maybe the hard disk is just
>       responding faster than UNIX expected.  But it doesn't seem
>       to have any effect.  UNIX still panics sporadically the same way.
>        - TANGENT: I was going to see if my patch provides an
>          alternative fix for installing Windows 2000, but
>          I was unable to reproduce the original -win2k-hack problem at
>          all (with neither -win2k-hack NOR this patch).  Maybe
>          some other change has fixed it some other way?  Or maybe
>          it is only an issue in configurations I didn't test?
>          (KVM instead of TCG?  Less RAM?  Something else?)
>             It might be worth doing a little more investigation,
>          and eliminating the -win2k-hack option if appropriate.
>    4. If I enable KVM, I get a different error very early in
>       bootup (in splx function instead of splint), and this patch
>       doesn't help.
> 
> ============
> My low level analysis of what is going on:
> 
> It is hard to track down all the details, but based on logging a
> lot of qemu IRQ stuff, and setting a breakpoint in the earliest
> panic-related UNIX function using gdb, it looks like:
> 
>    1. It is near the end of servicing a previous IRQ14 from the
>       hard disk.
>    2. The processor has interrupts disabled (I think), while UNIX
>       clears the slave 8259's IMR (mask) register (sets it to 0), allowing
>       all interrupts to be passed on to the master.
>    3. While in that state, IRQ14 is raised (on the slave), which
>       gets propagated to the master (IRQ2), but the CPU
>       is not interrupted yet.
>    4. UNIX then masks the slave 8259's IMR register
>       completely (sets to 0xff).
>    5. Because the master elcr register is set (by BIOS; UNIX never
>       touches it) to edge trigger for IRQ2, the master latched on
>       to IRQ2 earlier, and continues to assert the processors INT line
>       (the env->interrupt_request&CPU_INTERRUPT_HARD bit) even
>       after all slave IRQs have been masked off (clearing the input
>       IRQ2).
>    6. Finally, UNIX enables CPU interrupts and the interrupt is delivered
>       to the CPU, which ends up as a spurious IRQ15 due to the
>       slave's imr register.  UNIX doesn't know what to do with
>       that, and panics/halts.
> 
> I'm not sure why it only sporadically hits this sequence of events.
> There doesn't seem to be other IRQs asserted or serviced anywhere
> in the near past; the last several were all IRQ14's.  But I can't
> help feeling I'm not reading the log output correctly or something,
> because that doesn't make sense.  Maybe there is there some kind
> of a-few-instructions delay before a CPU interrupt is actually
> deliviered after interrupts are enabled, or some delay in raising
> IRQ14 after a hard drive operation is requested, and such delays
> need to fall into a narrow window of opportunity left by UNIX?
> 
> I can get a disassembly of the UNIX kernel using a "coff"-enabled
> build of GNU objdump, giving function names but not much else.
> But I haven't studied it in enough detail to actually find the
> relevant code path that is manipulating imr as described above.
> However, this old post outlines some of the high level theory
> of UNIX spl*() functions:
> http://www.linuxmisc.com/29-unix-internals/4e6c1f6fa2e41670.htm
> 
> If anyone wants to look into this further, I can provide access to the
> initial boot install floppy, at least.  Email me.  (Without the rest
> of the install disks, it isn't much use for anything except testing
> virtual machines like qemu against rare corner cases...)
> 
> ============
> Alternative Approaches:
> 
> An alternative to this patch that might work (I haven't tried) would
> be to have BIOS set the master's elcr register 0x04 bit, making IRQ2
> level triggered instead of edge triggered.  I'm not sure what other
> effects this might have.  Maybe it would actually be a more accurate
> model (I haven't checked documentation; maybe "slave mode" of a
> IRQ line into the master is supposed to be level triggered?)
> 
> Or perhaps find a way to model the minimum timescale that a interrupt
> request needs to be active to be recognized?
> 
> Or maybe my analysis isn't correct; I wasn't able to find the
> relevant code path in the UNIX kernel.
> 
> ============
> 
>  cpu-exec.c      | 12 +++++++-----
>  hw/i8259.c      | 18 ++++++++++++++++++
>  qemu-options.hx | 12 ++++++++++++
>  sysemu.h        |  1 +
>  vl.c            |  4 ++++
>  5 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/cpu-exec.c b/cpu-exec.c
> index 134b3c4..c309847 100644
> --- a/cpu-exec.c
> +++ b/cpu-exec.c
> @@ -329,11 +329,15 @@ int cpu_exec(CPUArchState *env)
>                                                            0);
>                              env->interrupt_request &= ~(CPU_INTERRUPT_HARD | 
> CPU_INTERRUPT_VIRQ);
>                              intno = cpu_get_pic_interrupt(env);
> -                            qemu_log_mask(CPU_LOG_TB_IN_ASM, "Servicing 
> hardware INT=0x%02x\n", intno);
> -                            do_interrupt_x86_hardirq(env, intno, 1);
> -                            /* ensure that no TB jump will be modified as
> -                               the program flow was changed */
> -                            next_tb = 0;
> +                            if (intno >= 0) {
> +                                qemu_log_mask(CPU_LOG_TB_IN_ASM,
> +                                              "Servicing hardware 
> INT=0x%02x\n",
> +                                              intno);
> +                                do_interrupt_x86_hardirq(env, intno, 1);
> +                                /* ensure that no TB jump will be modified as
> +                                   the program flow was changed */
> +                                next_tb = 0;
> +                            }
>  #if !defined(CONFIG_USER_ONLY)
>                          } else if ((interrupt_request & CPU_INTERRUPT_VIRQ) 
> &&
>                                     (env->eflags & IF_MASK) && 
> diff --git a/hw/i8259.c b/hw/i8259.c
> index 6587666..7ecb7e1 100644
> --- a/hw/i8259.c
> +++ b/hw/i8259.c
> @@ -26,6 +26,7 @@
>  #include "isa.h"
>  #include "monitor.h"
>  #include "qemu-timer.h"
> +#include "sysemu.h"
>  #include "i8259_internal.h"
>  
>  /* debug PIC */
> @@ -193,6 +194,20 @@ int pic_read_irq(DeviceState *d)
>                  pic_intack(slave_pic, irq2);
>              } else {
>                  /* spurious IRQ on slave controller */
> +                if (no_spurious_interrupt_hack) {
> +                    /* Pretend it was delivered and acknowledged.  If
> +                     * it was spurious due to slave_pic->imr, then
> +                     * as soon as the mask is cleared, the slave will
> +                     * re-trigger IRQ2 on the master.  If it is spurious for
> +                     * some other reason, make sure we don't keep trying
> +                     * to half-process the same spurious interrupt over
> +                     * and over again.
> +                     */
> +                    s->irr &= ~(1<<irq);
> +                    s->last_irr &= ~(1<<irq);
> +                    s->isr &= ~(1<<irq);
> +                    return -1;
> +                }
>                  irq2 = 7;
>              }
>              intno = slave_pic->irq_base + irq2;
> @@ -202,6 +217,9 @@ int pic_read_irq(DeviceState *d)
>          pic_intack(s, irq);
>      } else {
>          /* spurious IRQ on host controller */
> +        if (no_spurious_interrupt_hack) {
> +            return -1;
> +        }
>          irq = 7;
>          intno = s->irq_base + irq;
>      }
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 03e13ec..57bb0b4 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -1188,6 +1188,18 @@ Windows 2000 is installed, you no longer need this 
> option (this option
>  slows down the IDE transfers).
>  ETEXI
>  
> +DEF("no-spurious-interrupt-hack", 0, QEMU_OPTION_no_spurious_interrupt_hack,
> +    "-no-spurious-interrupt-hack     disable delivery of spurious 
> interrupts\n",
> +    QEMU_ARCH_I386)
> +STEXI
> address@hidden -no-spurious-interrupt-hack
> address@hidden -no-spurious-interrupt-hack
> +Use it as a workaround for operating systems that drive PICs in a way that
> +can generate spurious interrupts, but the OS doesn't handle spurious
> +interrupts gracefully.  (e.g. late 80s/early 90s versions of ATT UNIX
> +and derivatives)

Has to mention or even actively warn that it doesn't work with KVM and
its in-kernel irqchip (as that PIC model lacks your hack).

However, I strongly suspect you are nastily papering over an issue in
some device model. So I would prefer to dig deeper before installing
this in upstream (also due to its dependency on the userspace PIC model).

Jan

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]