qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with


From: Gleb Natapov
Subject: Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
Date: Sat, 29 May 2010 17:38:57 +0300

On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> There is no code, because we're still at architecture design stage.
> >>
> > Try to write test code to understand the problem better.
> 
> I will.
> 
Please do ASAP. This discussion shows that you don't understand what is the
problem that we are dialing with.

> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> after at least N instructions have passed.
> >> >> > Guest instructions cannot be easily counted in KVM (it can be done 
> >> >> > more
> >> >> > or less reliably using perf counters, may be).
> >> >>
> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> an interrupt after some number of instructions have been executed?
> >> > Don't think debug registers have something like that and they are
> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> > to CPU (even between two CPUs of the same manufacturer), and we want to
> >> > keep using them for profiling guests. And I don't see what problem it
> >> > will solve anyway that can be solved by simple delay between irq
> >> > reinjection.
> >>
> >> This would allow counting the executed instructions and limit it. Thus
> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >>
> > Why would you want to limit number of instruction executed by guest if
> > CPU has nothing else to do anyway? The problem occurs not when we have
> > spare cycles so give to a guest, but in opposite case.
> 
> I think one problem is that the guest has executed too much compared
> to what would happen with real HW with a lesser CPU. That explains the
> RTC frequency reprogramming case.
You think wrong. The problem is exactly opposite: the guest haven't
had enough execution time between two time interrupts. I don't know what
RTC frequency reprogramming case you are talking about here.

> 
> >
> >> >>
> >> >> >>
> >> >> >> > And even if the rate did not matter, the APIC woult still have to 
> >> >> >> > now
> >> >> >> > about the fact that an IRQ is really periodic and does not only 
> >> >> >> > appear
> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> > simplifying things or even make them cleaner.
> >> >> >>
> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> just like any other device. With the bidirectional irqs, this voodoo
> >> >> >> would probably eventually spread to many other devices. The logical
> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> careful not to disturb the guest at wrong moment because that would
> >> >> >> trigger a bug.
> >> >> >>
> >> >> > This voodoo will be so complex and unreliable that it will make RTC 
> >> >> > hack
> >> >> > pale in comparison (and I still don't see how you are going to make it
> >> >> > actually work).
> >> >>
> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> > APIC has zero info needed to implement reinjection correctly as was
> >> > shown to you several time in this thread and you simply keep ignoring
> >> > it.
> >>
> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> information. RTC hack would not work without APIC (or the
> >> bidirectional IRQ) passing this info to RTC.
> >>
> >> What APIC doesn't have now is the timer frequency or period info. This
> >> is known by RTC and also higher levels managing the clocks.
> >>
> > So APIC has one bit of information and RTC everything else.
> 
> The information known by RTC (timer period) is also known by higher levels.
> 
What do you mean by higher level here? vl.c or APIC.

> > The current
> > approach (and proposed patch) brings this one bit of information to RTC,
> > you are arguing that RTC should be able to communicate all its info to
> > APIC. Sorry I don't see that your way has any advantage. Just more
> > complex interface and it is much easier to get it wrong for other time
> > sources.
> 
> I don't think anymore that APIC should be handling this but the
> generic stuff, like vl.c or exec.c. Then there would be only
> information passing from APIC to higher levels.
Handling reinjection by general timer code makes infinitely more sense
then handling it in APIC. One thing (from the top of my head) that can't
be implemented at that level is injection of interrupt back to back (i.e
injecting next interrupt immediately after guest acknowledge previous
one to RTC).

> 
> >> I keep ignoring the idea that the current model, where both RTC and
> >> APIC must somehow work together to make coalescing work, is the only
> >> possible just because it is committed and it happens to work in some
> >> cases. It would be much better to concentrate this to one place, APIC
> >> or preferably higher level where it may benefit other timers too.
> >> Provided of course that the other models can be made to work.
> >>
> > So write the code and show us. You haven't show any evidence that RTC is
> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> > know when clock frequency changes, it know when device reset happened.
> > APIC knows only that interrupt was coalesced. It doesn't even know that
> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> > are masked not considered coalesced).
> 
> Oh, I thought interrupt masking was the reason for coalescing! What
> exactly is the reason then?
> 
The reason is that guest has no time to process previous interrupt
before it is time to inject next one.

> > Time source knows only when
> > frequency changes and may be when device reset happens if timer is
> > stopped by device on reset. So RTC is actually a sweet spot if you want
> > to minimize amount of info you need to pass between various layers.
> >
> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> cater for buggy hosts.
> >> >>
> >> > You mean "buggy guests"?
> >>
> >> Yes, sorry.
> >>
> >> > What guests are not buggy in your opinion?
> >> > Linux tries hard to be smart and as a result the only way to have stable
> >> > clock with it is to go paravirt.
> >>
> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> cause of the problem.
> > OS should never crash in the absence of HW bugs? I doubt you can design
> > an OS that can run in a face of any HW failure. Anyway here we are
> > trying to solve guests time keeping problem not crashes. Do you think
> > you can design OS that can keep time accurately no matter how crazy all
> > HW clock behaves?
> 
> I think my OS design skills are not relevant in this discussion, but
> IIRC there are fault tolerant operating systems for extreme conditions
> so it can be done.
> 
> >
> >>
> >> >> > The fact is that timer device is not "just like any
> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> > implement spec as close as possible and everything works. For time
> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> > letter and your guest will drift like crazy.
> >> >>
> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> without any voodoo. The interrupts would come after executing the same
> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> guests in any sufficiently desperate low resource conditions, this may
> >> >> be the only option that will always work.
> >> >>
> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't strive
> >> > to be one. On the contrary KVM runs at native host CPU speed most of the
> >> > time, so any emulation done between two instruction is theoretically
> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> > keeping all time source in perfect sync is also impossible.
> >>
> >> That is actually another cause of the problem. KVM gives the guest an
> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> match, especially in critical code, there can be problems. It would be
> >> better to tell the guest a lower speed, which also can be guaranteed.
> >>
> > Not possible. It's that simple. You should take it into account in your
> > architecture design stage. In case of KVM real physical CPU executes guest
> > instruction and it does this as fast as it can. The only way we can hide
> > that from a guest is by intercepting each access to TSC and at that
> > point we can use bochs instead.
> 
> Well, as Paul pointed out, there's also icount option.
> 
icount is not an option for KVM.

> >> Maybe we should also offline the device emulation to another host CPU
> >> with threading. A load from a device will always be much slower than
> >> on real HW though.
> > Time drift problem start to happen on loaded servers, so you do not have
> > spare CPU to offload device emulation too.
> >
> > --
> >                        Gleb.
> >

--
                        Gleb.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]