[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PULL 02/10] pci-bridge/cxl_downstream: Add a CXL switch downstream
From: |
Jonathan Cameron |
Subject: |
Re: [PULL 02/10] pci-bridge/cxl_downstream: Add a CXL switch downstream port |
Date: |
Wed, 7 Dec 2022 13:21:19 +0000 |
On Mon, 05 Dec 2022 14:59:39 +0000
Alex Bennée <alex.bennee@linaro.org> wrote:
> Jonathan Cameron via <qemu-devel@nongnu.org> writes:
>
> > On Mon, 5 Dec 2022 10:54:03 +0000
> > Jonathan Cameron via <qemu-devel@nongnu.org> wrote:
> >
> >> On Sun, 4 Dec 2022 08:23:55 +0100
> >> Thomas Huth <thuth@redhat.com> wrote:
> >>
> >> > On 04/11/2022 07.47, Thomas Huth wrote:
> >> > > On 16/06/2022 18.57, Michael S. Tsirkin wrote:
> >> > >> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> > >>
> >> > >> Emulation of a simple CXL Switch downstream port.
> >> > >> The Device ID has been allocated for this use.
> >> > >>
> >> > >> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> > >> Message-Id: <20220616145126.8002-3-Jonathan.Cameron@huawei.com>
> >> > >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> > >> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> >> > >> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> > >> ---
> >> > >> hw/cxl/cxl-host.c | 43 +++++-
> >> > >> hw/pci-bridge/cxl_downstream.c | 249
> >> > >> +++++++++++++++++++++++++++++++++
> >> > >> hw/pci-bridge/meson.build | 2 +-
> >> > >> 3 files changed, 291 insertions(+), 3 deletions(-)
> >> > >> create mode 100644 hw/pci-bridge/cxl_downstream.c
> >> > >
> >> > > Hi!
> >> > >
> >> > > There is a memory problem somewhere in this new device. I can make
> >> > > QEMU
> >> > > crash by running something like this:
> >> > >
> >> > > $ MALLOC_PERTURB_=59 ./qemu-system-x86_64 -M x-remote \
> >> > > -display none -monitor stdio
> >> > > QEMU 7.1.50 monitor - type 'help' for more information
> >> > > (qemu) device_add cxl-downstream
> >> > > ./qemu/qom/object.c:1188:5: runtime error: member access within
> >> > > misaligned
> >> > > address 0x3b3b3b3b3b3b3b3b for type 'struct Object', which requires 8
> >> > > byte
> >> > > alignment
> >> > > 0x3b3b3b3b3b3b3b3b: note: pointer points here
> >> > > <memory cannot be printed>
> >> > > Bus error (core dumped)
> >> > >
> >> > > Could you have a look if you've got some spare minutes?
> >> >
> >> > Ping! Jonathan, Michael, any news on this bug?
> >> >
> >> > (this breaks one of my local tests, that's why it's annoying for me)
> >> Sorry, my email filters ate your earlier message.
> >>
> >> Looking into this now. I'll note that it also happens on
> >> device_add xio3130-downstream so not specific to this new device.
> >>
> >> So far all I've managed to do is track it to something rcu related
> >> as failing in a call to drain_call_rcu() in qmp_device_add()
> >>
> >> Will continue digging.
> >
> > Assuming I'm seeing the same thing...
> >
> > Problem is g_free() on the PCIBridge windows:
> > https://elixir.bootlin.com/qemu/latest/source/hw/pci/pci_bridge.c#L235
> >
> > Is called before we get an rcu_call() to flatview_destroy() as a
> > result of the final call of flatview_unref() in address_space_set_flatview()
> > so we get a use after free.
> >
> > As to what the fix is... Suggestions welcome!
>
> It sounds like this is the wrong place to free the value then. I guess
> the PCI aliases into &w->alias_io() don't get dealt with until RCU
> clean-up time.
>
> I *think* using g_free_rcu() should be enough to ensure the free occurs
> after the rest of the RCU cleanups but maybe you should only be cleaning
> up the windows at device unrealize time? Is this a dynamic piece of
> memory which gets updated during the lifetime of the device?
There is unfortunately code that swaps it for an updated structure
in pci_bridge_update_mappings()
>
> If the memory is being cleared with RCU then the access to the base
> pointer should be done with the appropriate qatomic_rcu_[set|read]
> functions.
>
I'm annoyingly snowed under this week with other things, but hopefully
can get to in a few days. Note we are looking at an old problem
here, just one that's happening for an additional device, not sure
if that really affects urgency of fix though.
Jonathan