qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migrat


From: Sameeh Jubran
Subject: Re: [Qemu-devel] [PATCH] e1000e: Fix a bug where guest hangs upon migration
Date: Mon, 22 May 2017 14:05:50 +0300

On Mon, May 22, 2017 at 5:46 AM, Jason Wang <address@hidden> wrote:

>
>
> On 2017年05月19日 22:04, Sameeh Jubran wrote:
>
>> On Fri, May 19, 2017 at 9:25 AM, Jason Wang <address@hidden> wrote:
>>
>>
>>> On 2017年05月17日 19:46, Sameeh Jubran wrote:
>>>
>>> The bug was caused by the "receive overrun" (bit #6 of the ICR register)
>>>> interrupt
>>>> which would be triggered post migration in a heavy traffic environment.
>>>> Even though the
>>>> "receive overrun" bit (#6) is masked out by the IMS register (refer to
>>>> the log below)
>>>> the driver still receives an interrupt as the "receive overrun" bit (#6)
>>>> causes the
>>>> "Other" - bit #24 of the ICR register - bit to be set as documented
>>>> below. The driver
>>>> handles the interrupt and clears the "Other" bit (#24) but doesn't clear
>>>> the
>>>> "receive overrun" bit (#6) which leads to an infinite loop. Apparently
>>>> the Windows
>>>> driver expects that the "receive overrun" bit and other ones -
>>>> documented
>>>> below - to be
>>>> cleared when the "Other" bit (#24) is cleared.
>>>>
>>>> So to sum that up:
>>>> 1. Bit #6 of the ICR register is set by heavy traffic
>>>> 2. As a results of setting bit #6, bit #24 is set
>>>> 3. The driver receives an interrupt for bit 24 (it doesn't receieve an
>>>> interrupt for bit #6 as it is masked out by IMS)
>>>> 4. The driver handles and clears the interrupt of bit #24
>>>> 5. Bit #6 is still set.
>>>> 6. 2 happens all over again
>>>>
>>>> The Interrupt Cause Read - ICR register:
>>>>
>>>> The ICR has the "Other" bit - bit #24 - that is set when one or more of
>>>> the following
>>>> ICR register's bits are set:
>>>>
>>>> LSC - bit #2, RXO - bit #6, MDAC - bit #9, SRPD - bit #16, ACK - bit
>>>> #17,
>>>> MNG - bit #18
>>>>
>>>> Log sample of the storm:
>>>>
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING: 0x0
>>>> (ICR: 0x815000c2, IMS: 0xa00004)
>>>> address@hidden:e1000e_irq_pending_interrupts ICR PENDING:
>>>> 0x1000000 (ICR: 0x815000c2, IMS: 0x1a00004)
>>>>
>>>> This commit solves:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1447935
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1449490
>>>>
>>>> Signed-off-by: Sameeh Jubran <address@hidden>
>>>> ---
>>>>    hw/net/e1000e_core.c | 7 +++++--
>>>>    1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/hw/net/e1000e_core.c b/hw/net/e1000e_core.c
>>>> index 28c5be1..8174b53 100644
>>>> --- a/hw/net/e1000e_core.c
>>>> +++ b/hw/net/e1000e_core.c
>>>> @@ -2454,14 +2454,17 @@ e1000e_set_ics(E1000ECore *core, int index,
>>>> uint32_t val)
>>>>    static void
>>>>    e1000e_set_icr(E1000ECore *core, int index, uint32_t val)
>>>>    {
>>>> +    uint32_t icr = 0;
>>>>        if ((core->mac[ICR] & E1000_ICR_ASSERTED) &&
>>>>            (core->mac[CTRL_EXT] & E1000_CTRL_EXT_IAME)) {
>>>>            trace_e1000e_irq_icr_process_iame();
>>>>            e1000e_clear_ims_bits(core, core->mac[IAM]);
>>>>        }
>>>>    -    trace_e1000e_irq_icr_write(val, core->mac[ICR], core->mac[ICR]
>>>> &
>>>> ~val);
>>>> -    core->mac[ICR] &= ~val;
>>>> +    icr = core->mac[ICR] & ~val;
>>>> +    icr = (val & E1000_ICR_OTHER) ? (icr & ~E1000_ICR_OTHER_CAUSES) :
>>>> icr;
>>>> +    trace_e1000e_irq_icr_write(val, core->mac[ICR], icr);
>>>> +    core->mac[ICR] = icr;
>>>>        e1000e_update_interrupt_state(core);
>>>>    }
>>>>
>>>>
>>>> Thanks for the patch.
>>>
>>> So this is an undocumented behavior, we must be careful on this. Several
>>> question below:
>>>
>>> - have you verified this on real hardware?
>>>
>>> No I haven't
>>
>
> Any chance to verify the behavior on real hardware?

This can be very challenging to reproduce on a real hardware as the Windows
source code isn't available and the bug doesn't
reproduce with the Linux driver; Moreover, the Linux driver configures the
device differently than the Windows driver.

>
>
>
>> - is MSIX enabled in this case?
>>>
>>> Yes it is, I have tested the patch with msi disabled too.
>>
>
> Does the problem exist in no msi case. If not, is this better to hack EIAC
> behavior?

Even though it doesn't reproduce in no msi case, this certainly can occur
as the ICR register is present in both msi and
no msi cases while the EIAC register is for MSI only.

>
>> - according to the steps you've summed up above, it's not specific to
>>> migration?
>>>
>>> True
>>
>
> Then we probably need tweak the title.
>
No problem, I'll send a v2 soon.

> Thanks.
>
>
>> Thanks
>>>
>>>
>>
>>
>


-- 
Respectfully,
*Sameeh Jubran*
*Linkedin <https://il.linkedin.com/pub/sameeh-jubran/87/747/a8a>*
*Software Engineer @ Daynix <http://www.daynix.com>.*


reply via email to

[Prev in Thread] Current Thread [Next in Thread]