Re: [PATCH] e1000: Fix the unexpected assumption that the receive buffer

On Wed, Jul 10, 2024 at 3:36 PM Jason Wang <jasowang@redhat.com> wrote:

On Wed, Jul 10, 2024 at 2:26 PM Yong Huang <yong.huang@smartx.com> wrote:
>
>
>
> On Wed, Jul 10, 2024 at 11:44 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> On Tue, Jul 9, 2024 at 10:56 AM Yong Huang <yong.huang@smartx.com> wrote:
>> >
>> >
>> >
>> > On Tue, Jul 9, 2024 at 10:41 AM Jason Wang <jasowang@redhat.com> wrote:
>> >>
>> >> On Mon, Jul 8, 2024 at 1:17 PM Yong Huang <yong.huang@smartx.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Jul 8, 2024 at 11:21 AM Jason Wang <jasowang@redhat.com> wrote:
>> >> >>
>> >> >> On Sat, Jul 6, 2024 at 4:30 AM Hyman Huang <yong.huang@smartx.com> wrote:
>> >> >> >
>> >> >> > Unexpected work by certain Windows guests equipped with the e1000
>> >> >> > interface can cause the network to go down and never come back up
>> >> >> > again unless the guest's interface is reset.
>> >> >> >
>> >> >> > To reproduce the failure:
>> >> >> > 1. Set up two guests with a Windows 2016 or 2019 server operating
>> >> >> > system.
>> >> >>
>> >> >> I vaguely remember e1000 support for Windows has been deprecated for
>> >> >> several years...
>> >> >>
>> >> >> That's why e1000e or igb is implemented in Qemu.
>> >> >>
>> >> >> > 2. Set up the e1000 interface for the guests.
>> >> >> > 3. Pressurize the network slightly between two guests using the iPerf tool.
>> >> >> >
>> >> >> > The network goes down after a few days (2-5days), and the issue
>> >> >> > is the result of not adhering to the e1000 specification. Refer
>> >> >> > to the details of the specification at the following link:
>> >> >> > https://www.intel.com/content/dam/doc/manual/pci-pci-x-family-gbe-controllers-software-dev-manual.pdf
>> >> >> >
>> >> >> > Chapter 3.2.6 describe the Receive Descriptor Tail register(RDT)
>> >> >> > as following:
>> >> >> > This register holds a value that is an offset from the base, and
>> >> >> > identifies the location beyond the last descriptor hardware can
>> >> >> > process. Note that tail should still point to an area in the
>> >> >> > descriptor ring (somewhere between RDBA and RDBA + RDLEN).
>> >> >> > This is because tail points to the location where software writes
>> >> >> > the first new descriptor.
>> >> >> >
>> >> >> > This means that if the provider—in this case, QEMU—has not yet
>> >> >> > loaded the packet,
>> >> >>
>> >> >> What do you mean by "load" here?
>> >> >
>> >> >
>> >> > Sorry for failing to describe the details.
>> >> >
>> >> > The guest driver retrieves the packet from the receive ring buffer
>> >> > after QEMU forwards it from the tun/tap interface in the e1000
>> >> > emulation.
>> >> >
>> >> > I used "load" to express "putting packets into the receive ring buffer."
>> >> >
>> >> >>
>> >> >>
>> >> >> > RDT should never point to that place.
>> >> >>
>> >> >> And "that place"?
>> >> >
>> >> > If a descriptor in the receive ring buffer has not been filled with a
>> >> > packet address by QEMU, the descriptor therefore doesn't have any
>> >> > available packets. The location of the descriptor should not be referred
>> >> > to by RDT because the location is in the range that "hardware" handles.
>> >> >
>> >> > "that place" means the location of the descriptor in the ring buffer
>> >> > that QEMU hasn't set any available packets related to.
>> >> >
>> >> >>
>> >> >>
>> >> >> > When
>> >> >> > implementing the emulation of the e1000 interface, QEMU evaluates
>> >> >> > if the receive ring buffer is full once the RDT equals the RDH,
>> >> >> > based on the assumption that guest drivers adhere to this
>> >> >> > criterion strictly.
>> >> >> >
>> >> >> > We applied the following log patch to assist in analyzing the
>> >> >> > issue and eventually obtained the unexpected information.
>> >> >> >
>> >> >> > Log patch:
>> >> >> > -----------------------------------------------------------------
>> >> >> > |--- a/hw/net/e1000.c
>> >> >> > |+++ b/hw/net/e1000.c
>> >> >> > |@@ -836,6 +836,9 @@ e1000_set_link_status(NetClientState *nc)
>> >> >> > | static bool e1000_has_rxbufs(E1000State *s, size_t total_size)
>> >> >> > | {
>> >> >> > | int bufs;
>> >> >> > |+ DBGOUT(RX, "rxbuf_size = %u, s->mac_reg[RDLEN] = %u, s->mac_reg[RDH] = %u, s->mac_reg[RDT] = %u\n",
>> >> >> > |+ s->rxbuf_size, s->mac_reg[RDLEN], s->mac_reg[RDH], s->mac_reg[RDT]);
>> >> >> > |+
>> >> >> > | /* Fast-path short packets */
>> >> >> > | if (total_size <= s->rxbuf_size) {
>> >> >> > | if (s->mac_reg[RDH] == s->mac_reg[RDT] && s->last_overrun)
>> >> >> > |@@ -1022,6 +1025,9 @@ e1000_receive_iov(NetClientState *nc, const struct iovec *iov, int iovcnt)
>> >> >> > | s->rxbuf_min_shift)
>> >> >> > | n |= E1000_ICS_RXDMT0;
>> >> >> > |
>> >> >> > |+ DBGOUT(RX, "rxbuf_size = %u, s->mac_reg[RDLEN] = %u, s->mac_reg[RDH] = %u, s->mac_reg[RDT] = %u\n",
>> >> >> > |+ s->rxbuf_size, s->mac_reg[RDLEN], s->mac_reg[RDH], s->mac_reg[RDT]);
>> >> >> > |+
>> >> >> > -----------------------------------------------------------------
>> >> >> >
>> >> >> > The last few logs of information when the network is down:
>> >> >> >
>> >> >> > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 897, s->mac_reg[RDT] = 885
>> >> >> > <- the receive ring buffer is checked for fullness in the
>> >> >> > e1000_has_rxbufs function, not full.
>> >> >> >
>> >> >> > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 898, s->mac_reg[RDT] = 885
>> >> >> > <- RDT stays the same, RDH updates to 898, and 1 descriptor
>> >> >> > utilized after putting the packet to ring buffer.
>> >> >> >
>> >> >> > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 898, s->mac_reg[RDT] = 885
>> >> >> > <- the receive ring buffer is checked for fullness in the
>> >> >> > e1000_has_rxbufs function, not full.
>> >> >> >
>> >> >> > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 899, s->mac_reg[RDT] = 885
>> >> >> > <- RDT stays the same, RDH updates to 899, and 1 descriptor
>> >> >> > utilized after putting the packet to ring buffer.
>> >> >> >
>> >> >> > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 899, s->mac_reg[RDT] = 885
>> >> >> > <- the receive ring buffer is checked for fullness in the
>> >> >> > e1000_has_rxbufs function, not full.
>> >> >> >
>> >> >> > e1000: total_size = 64, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 900, s->mac_reg[RDT] = 885
>> >> >> > <- RDT stays the same, RDH updates to 900 , and 1 descriptor
>> >> >> > utilized after putting the packet to ring buffer.
>> >> >> >
>> >> >> > e1000: total_size = 1, rxbuf_size = 2048, s->mac_reg[RDLEN] = 16384, s->mac_reg[RDH] = 900, s->mac_reg[RDT] = 900
>> >> >> > <- The ring is full, according to e1000_has_rxbufs, because
>> >> >> > of the RDT update to 900 and equals RDH !
>> >> >>
>> >> >> Just to make sure I understand this, RDT==RDH means the ring is empty I think?
>> >> >>
>> >> >>
>> >> >> See commit:
>> >> >>
>> >> >> commit e5b8b0d4ba29fe1268ba049519a1b0cf8552a21a
>> >> >> Author: Dmitry Fleytman <dmitry@daynix.com>
>> >> >> Date: Fri Oct 19 07:56:55 2012 +0200
>> >> >>
>> >> >> e1000: drop check_rxov, always treat RX ring with RDH == RDT as empty
>> >> >>
>> >> >> Real HW always treats RX ring with RDH == RDT as empty.
>> >> >> Emulation is supposed to behave the same.
>> >> >
>> >> >
>> >> > Indeed, I'm confused :(, the description in the comment claims that RX
>> >> > rings with RDH == RDT as empty, but in implementation, it treats that as
>> >> > overrun.
>> >> >
>> >> > See the following 2 contexts:
>> >> >
>> >> > 1. e1000_can_receive:
>> >> > static bool e1000_can_receive(NetClientState *nc)
>> >> > {
>> >> > E1000State *s = qemu_get_nic_opaque(nc);
>> >> > // e1000_has_rxbufs return true means ring buffer has
>> >> > // available descriptors to use for QEMU.
>> >> > // false means ring buffer overrun and QEMU should queue the packet
>> >> > // and wait for the RDT update and available descriptors can be used.
>> >> >
>> >> > return e1000x_rx_ready(&s->parent_obj, s->mac_reg) &&
>> >> > e1000_has_rxbufs(s, 1) && !timer_pending(s->flush_queue_timer);
>> >> > }
>> >>
>> >> Well we had in e1000_has_rx_bufs
>> >>
>> >> if (total_size <= s->rxbuf_size) {
>> >> return s->mac_reg[RDH] != s->mac_reg[RDT];
>> >> }
>> >>
>> >> RDT!=RDH means RX ring has available descriptors for hardware?
>> >
>> >
>> > IMHO, Yes.
>>
>> Just to make sure we are on the same page, so
>>
>> RDT!=RDH, descriptors available for hardware
>> RDT==RDH, descriptor ring is empty for hardware
>>
>>
>> That is currently what the code did. Seems nothing wrong, or anything
>> I missed here?
>
>
> There are two cases for RDT == RDH.
>
> 1. Hardware has filled all available descriptors and overrun.
> In this case, hardware cannot add any new packets to the ring.
>
> 2. Software has consumed all descriptors, and all the descriptors
> on the ring can be used by hardware. (Let's name this case "empty.")
> In this case, hardware should keep putting new packets to the ring

Well this seems not what spec said. See Figure 3-2, when RDT==RDH,
nothing is owned by hardware. And this is what Dmitry said in the
commit mentioned above.

Yes, this is the main cause of network interruptions. Hardware should

never touch the location of the descriptor that RDT points to. This is

what the specification declares. IMHO, it also implies that the descriptor

referred to by the RDT has available packets.

In our case, hardware has not added any new packets to the descriptor,

and the expected behavior is that software never updates the RDT to

the location of that descriptor. But now, it does.

If hardware and software both work as expected under the e1000

specification, the issue does not exist. I have no objection that the root

cause is the Windows driver bug in e1000, and I'm not insisting that

QEMU should take the responsibility for fixing that.

Which version of the driver did you use in the guest? (Or have you
tried to download the one from Intel website) I'm asking since e1000
support has been deprecated by Microsoft for years.

I use Windows Server 2019 and the driver version is 8.4.13.0.

Since Microsoft no longer supports e1000, as you already mentioned,

this patch merely offers a workaround. Since some users still use

e1000 in production environments, it doesn't seem to have any side

effects.

It would be really appreciated if the patch was given some thought.

Thanks,

Yong

Thanks

>
> But at the moment, the logic of e1000_has_rx_bufs acts exactly like it was
> the first case, unable to differentiate between the two scenarios.
>
>>
>>
>> Thanks
>>
>> >
>> >>
>> >> Adding more people.
>> >>
>> >> Thanks
>> >>
>> >
>> >
>> > --
>> > Best regards
>>
>
> Yong
>
> --
> Best regards

Best regards

From:	Yong Huang
Subject:	Re: [PATCH] e1000: Fix the unexpected assumption that the receive buffer is full
Date:	Wed, 10 Jul 2024 17:04:37 +0800