qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with


From: Gleb Natapov
Subject: Re: [Qemu-devel] Re: [RFT][PATCH 07/15] qemu_irq: Add IRQ handlers with delivery feedback
Date: Sat, 29 May 2010 19:32:35 +0300

On Sat, May 29, 2010 at 04:03:22PM +0000, Blue Swirl wrote:
> 2010/5/29 Gleb Natapov <address@hidden>:
> > On Sat, May 29, 2010 at 09:15:11AM +0000, Blue Swirl wrote:
> >> >> There is no code, because we're still at architecture design stage.
> >> >>
> >> > Try to write test code to understand the problem better.
> >>
> >> I will.
> >>
> > Please do ASAP. This discussion shows that you don't understand what is the
> > problem that we are dialing with.
> 
> Which part of the problem you think I don't understand?
> 
It seams to me you don't understand how Windows uses RTC for time
keeping and how the QEMU solves the problem today.

> >> >> >> >> guests could also be assisted with special handling (like win2k
> >> >> >> >> install hack), for example guest instructions could be counted
> >> >> >> >> (approximately, for example using TB size or TSC) and only inject
> >> >> >> >> after at least N instructions have passed.
> >> >> >> > Guest instructions cannot be easily counted in KVM (it can be done 
> >> >> >> > more
> >> >> >> > or less reliably using perf counters, may be).
> >> >> >>
> >> >> >> Aren't there any debug registers or perf counters, which can generate
> >> >> >> an interrupt after some number of instructions have been executed?
> >> >> > Don't think debug registers have something like that and they are
> >> >> > available for guest use anyway. Perf counters differs greatly from CPU
> >> >> > to CPU (even between two CPUs of the same manufacturer), and we want 
> >> >> > to
> >> >> > keep using them for profiling guests. And I don't see what problem it
> >> >> > will solve anyway that can be solved by simple delay between irq
> >> >> > reinjection.
> >> >>
> >> >> This would allow counting the executed instructions and limit it. Thus
> >> >> we could emulate a 500MHz CPU on a 2GHz CPU more accurately.
> >> >>
> >> > Why would you want to limit number of instruction executed by guest if
> >> > CPU has nothing else to do anyway? The problem occurs not when we have
> >> > spare cycles so give to a guest, but in opposite case.
> >>
> >> I think one problem is that the guest has executed too much compared
> >> to what would happen with real HW with a lesser CPU. That explains the
> >> RTC frequency reprogramming case.
> > You think wrong. The problem is exactly opposite: the guest haven't
> > had enough execution time between two time interrupts. I don't know what
> > RTC frequency reprogramming case you are talking about here.
> 
> The case you told me where N pending tick IRQs exist but the guest
> wants to change the RTC frequency from 64Hz to 1024Hz.
> 
> Let's make this more concrete. 1 GHz CPU, initially 100Hz RTC, so
> 10Mcycles/tick or 10ms/tick. At T = 30Mcycles, guest wants to change
> the frequency to 1000Hz.
> 
> The problem for emulation is that for the same 3 ticks, there has been
> so little execution power that the ticks have been coalesced. But
> isn't the guest cycle count then much lower than 30Mcyc?
> 
> Isn't it so that the guest must be above 30Mcyc to be able to want the
> change? But if we reach that point,  the problem must have not been
> too little execution time, but too much.
> 
Sorry I tried hard to understand what have you said above but failed.
What do you mean "to be able to want the change"? Guest sometimes wants
to get 64 timer interrupts per second and sometimes it wants to get 1024
timer interrupt per second. It wants it not as a result of time drift or
anything. It's just how guest behaves. You seams to be to fixated on
guest frequency change. It's just something you have to take into
account when you reinject interrupts.


> >
> >>
> >> >
> >> >> >>
> >> >> >> >>
> >> >> >> >> > And even if the rate did not matter, the APIC woult still have 
> >> >> >> >> > to now
> >> >> >> >> > about the fact that an IRQ is really periodic and does not only 
> >> >> >> >> > appear
> >> >> >> >> > as such for a certain interval. This really does not sound like
> >> >> >> >> > simplifying things or even make them cleaner.
> >> >> >> >>
> >> >> >> >> It would, the voodoo would be contained only in APIC, RTC would be
> >> >> >> >> just like any other device. With the bidirectional irqs, this 
> >> >> >> >> voodoo
> >> >> >> >> would probably eventually spread to many other devices. The 
> >> >> >> >> logical
> >> >> >> >> conclusion of that would be a system where all devices would be
> >> >> >> >> careful not to disturb the guest at wrong moment because that 
> >> >> >> >> would
> >> >> >> >> trigger a bug.
> >> >> >> >>
> >> >> >> > This voodoo will be so complex and unreliable that it will make 
> >> >> >> > RTC hack
> >> >> >> > pale in comparison (and I still don't see how you are going to 
> >> >> >> > make it
> >> >> >> > actually work).
> >> >> >>
> >> >> >> Implement everything inside APIC: only coalescing and reinjection.
> >> >> > APIC has zero info needed to implement reinjection correctly as was
> >> >> > shown to you several time in this thread and you simply keep ignoring
> >> >> > it.
> >> >>
> >> >> On the contrary, APIC is actually the only source of the IRQ ack
> >> >> information. RTC hack would not work without APIC (or the
> >> >> bidirectional IRQ) passing this info to RTC.
> >> >>
> >> >> What APIC doesn't have now is the timer frequency or period info. This
> >> >> is known by RTC and also higher levels managing the clocks.
> >> >>
> >> > So APIC has one bit of information and RTC everything else.
> >>
> >> The information known by RTC (timer period) is also known by higher levels.
> >>
> > What do you mean by higher level here? vl.c or APIC.
> 
> vl.c, qemu-timer.c.
> 
> >> > The current
> >> > approach (and proposed patch) brings this one bit of information to RTC,
> >> > you are arguing that RTC should be able to communicate all its info to
> >> > APIC. Sorry I don't see that your way has any advantage. Just more
> >> > complex interface and it is much easier to get it wrong for other time
> >> > sources.
> >>
> >> I don't think anymore that APIC should be handling this but the
> >> generic stuff, like vl.c or exec.c. Then there would be only
> >> information passing from APIC to higher levels.
> > Handling reinjection by general timer code makes infinitely more sense
> > then handling it in APIC.
> 
> I'm glad you agree, or did you mean 'less'?
> 
Compared to APIC I would agree that even putting it in IDE is better idea :)

> > One thing (from the top of my head) that can't
> > be implemented at that level is injection of interrupt back to back (i.e
> > injecting next interrupt immediately after guest acknowledge previous
> > one to RTC).
> 
> But Jan told this confuses some buggy OSes.
> 
You keep calling them buggy, but I don't agree. They are written with
certain assumption that are true on real HW, but hard to achieve on
virtual. Anyway we use this technique (back to back reinjection)
otherwise you can't solve drift problem if guest want to receive
timer interrupts with max frequency that host time source support.


> >
> >>
> >> >> I keep ignoring the idea that the current model, where both RTC and
> >> >> APIC must somehow work together to make coalescing work, is the only
> >> >> possible just because it is committed and it happens to work in some
> >> >> cases. It would be much better to concentrate this to one place, APIC
> >> >> or preferably higher level where it may benefit other timers too.
> >> >> Provided of course that the other models can be made to work.
> >> >>
> >> > So write the code and show us. You haven't show any evidence that RTC is
> >> > the wrong place. RTC knows when interrupt was acknowledge to RTC, it
> >> > know when clock frequency changes, it know when device reset happened.
> >> > APIC knows only that interrupt was coalesced. It doesn't even know that
> >> > it may be masked by a guest in IOAPIC (interrupts delivered while they
> >> > are masked not considered coalesced).
> >>
> >> Oh, I thought interrupt masking was the reason for coalescing! What
> >> exactly is the reason then?
> >>
> > The reason is that guest has no time to process previous interrupt
> > before it is time to inject next one.
> 
> Because of other host load or other emulation done by the same QEMU
> process, I suppose?
Yes, both.

> 
> >> > Time source knows only when
> >> > frequency changes and may be when device reset happens if timer is
> >> > stopped by device on reset. So RTC is actually a sweet spot if you want
> >> > to minimize amount of info you need to pass between various layers.
> >> >
> >> >> >> Maybe that version would not bend backwards as much as the current to
> >> >> >> cater for buggy hosts.
> >> >> >>
> >> >> > You mean "buggy guests"?
> >> >>
> >> >> Yes, sorry.
> >> >>
> >> >> > What guests are not buggy in your opinion?
> >> >> > Linux tries hard to be smart and as a result the only way to have 
> >> >> > stable
> >> >> > clock with it is to go paravirt.
> >> >>
> >> >> I'm not an OS designer, but I think an OS should never crash, even if
> >> >> a burst of IRQs is received. Reprogramming the timer should consider
> >> >> the pending IRQ situation (0 or 1 with real HW). Those bugs are one
> >> >> cause of the problem.
> >> > OS should never crash in the absence of HW bugs? I doubt you can design
> >> > an OS that can run in a face of any HW failure. Anyway here we are
> >> > trying to solve guests time keeping problem not crashes. Do you think
> >> > you can design OS that can keep time accurately no matter how crazy all
> >> > HW clock behaves?
> >>
> >> I think my OS design skills are not relevant in this discussion, but
> >> IIRC there are fault tolerant operating systems for extreme conditions
> >> so it can be done.
> >>
> >> >
> >> >>
> >> >> >> > The fact is that timer device is not "just like any
> >> >> >> > other device" in virtual world. Any other device is easy: you just
> >> >> >> > implement spec as close as possible and everything works. For time
> >> >> >> > source device this is not enough. You can implement RTC+HPET to the
> >> >> >> > letter and your guest will drift like crazy.
> >> >> >>
> >> >> >> It's doable: a cycle accurate emulator will not cause any drift,
> >> >> >> without any voodoo. The interrupts would come after executing the 
> >> >> >> same
> >> >> >> instruction as the real HW. For emulating any sufficiently buggy
> >> >> >> guests in any sufficiently desperate low resource conditions, this 
> >> >> >> may
> >> >> >> be the only option that will always work.
> >> >> >>
> >> >> > Yes, but qemu and kvm are not cycle accurate emulators and don't 
> >> >> > strive
> >> >> > to be one. On the contrary KVM runs at native host CPU speed most of 
> >> >> > the
> >> >> > time, so any emulation done between two instruction is theoretically
> >> >> > noticeable for a guest. TSC is bypassed directly to a guest too, so
> >> >> > keeping all time source in perfect sync is also impossible.
> >> >>
> >> >> That is actually another cause of the problem. KVM gives the guest an
> >> >> illusion that the VCPU speed is equal to host speed. When they don't
> >> >> match, especially in critical code, there can be problems. It would be
> >> >> better to tell the guest a lower speed, which also can be guaranteed.
> >> >>
> >> > Not possible. It's that simple. You should take it into account in your
> >> > architecture design stage. In case of KVM real physical CPU executes 
> >> > guest
> >> > instruction and it does this as fast as it can. The only way we can hide
> >> > that from a guest is by intercepting each access to TSC and at that
> >> > point we can use bochs instead.
> >>
> >> Well, as Paul pointed out, there's also icount option.
> >>
> > icount is not an option for KVM.
> 
> I think icount timer adjustment model might make sense for this work
> too. We'd then just need some figure of executed CPU instructions, TSC
> cycles or even kernel scheduler time slice information (how much time
> the process got).
>
And then? icount makes guest time flow dependant on amount of emulated
instructions. It relies on the fact that all time sources are
synchronized for a guest during emulation (including TSC). This is not
true for virtualization.
 
--
                        Gleb.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]