qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide


From: Michael S. Tsirkin
Subject: Re: [Qemu-devel] [PULL 14/28] exec: make address spaces 64-bit wide
Date: Tue, 14 Jan 2014 13:50:49 +0200

On Tue, Jan 14, 2014 at 12:24:24PM +0200, Avi Kivity wrote:
> On 01/14/2014 12:48 AM, Alex Williamson wrote:
> >On Mon, 2014-01-13 at 22:48 +0100, Alexander Graf wrote:
> >>>Am 13.01.2014 um 22:39 schrieb Alex Williamson <address@hidden>:
> >>>
> >>>>On Sun, 2014-01-12 at 16:03 +0100, Alexander Graf wrote:
> >>>>>On 12.01.2014, at 08:54, Michael S. Tsirkin <address@hidden> wrote:
> >>>>>
> >>>>>>On Fri, Jan 10, 2014 at 08:31:36AM -0700, Alex Williamson wrote:
> >>>>>>>On Fri, 2014-01-10 at 14:55 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>On Thu, Jan 09, 2014 at 03:42:22PM -0700, Alex Williamson wrote:
> >>>>>>>>>On Thu, 2014-01-09 at 23:56 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>On Thu, Jan 09, 2014 at 12:03:26PM -0700, Alex Williamson wrote:
> >>>>>>>>>>>On Thu, 2014-01-09 at 11:47 -0700, Alex Williamson wrote:
> >>>>>>>>>>>>On Thu, 2014-01-09 at 20:00 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>On Thu, Jan 09, 2014 at 10:24:47AM -0700, Alex Williamson wrote:
> >>>>>>>>>>>>>>On Wed, 2013-12-11 at 20:30 +0200, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>From: Paolo Bonzini <address@hidden>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>As an alternative to commit 818f86b (exec: limit system memory
> >>>>>>>>>>>>>>size, 2013-11-04) let's just make all address spaces 64-bit 
> >>>>>>>>>>>>>>wide.
> >>>>>>>>>>>>>>This eliminates problems with phys_page_find ignoring bits above
> >>>>>>>>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS and address_space_translate_internal
> >>>>>>>>>>>>>>consequently messing up the computations.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>In Luiz's reported crash, at startup gdb attempts to read from 
> >>>>>>>>>>>>>>address
> >>>>>>>>>>>>>>0xffffffffffffffe6 to 0xffffffffffffffff inclusive.  The region 
> >>>>>>>>>>>>>>it gets
> >>>>>>>>>>>>>>is the newly introduced master abort region, which is as big as 
> >>>>>>>>>>>>>>the PCI
> >>>>>>>>>>>>>>address space (see pci_bus_init).  Due to a typo that's only 
> >>>>>>>>>>>>>>2^63-1,
> >>>>>>>>>>>>>>not 2^64.  But we get it anyway because phys_page_find ignores 
> >>>>>>>>>>>>>>the upper
> >>>>>>>>>>>>>>bits of the physical address.  In 
> >>>>>>>>>>>>>>address_space_translate_internal then
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>   diff = int128_sub(section->mr->size, int128_make64(addr));
> >>>>>>>>>>>>>>   *plen = int128_get64(int128_min(diff, int128_make64(*plen)));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff becomes negative, and int128_get64 booms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>The size of the PCI address space region should be fixed anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>Reported-by: Luiz Capitulino <address@hidden>
> >>>>>>>>>>>>>>Signed-off-by: Paolo Bonzini <address@hidden>
> >>>>>>>>>>>>>>Signed-off-by: Michael S. Tsirkin <address@hidden>
> >>>>>>>>>>>>>>---
> >>>>>>>>>>>>>>exec.c | 8 ++------
> >>>>>>>>>>>>>>1 file changed, 2 insertions(+), 6 deletions(-)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>diff --git a/exec.c b/exec.c
> >>>>>>>>>>>>>>index 7e5ce93..f907f5f 100644
> >>>>>>>>>>>>>>--- a/exec.c
> >>>>>>>>>>>>>>+++ b/exec.c
> >>>>>>>>>>>>>>@@ -94,7 +94,7 @@ struct PhysPageEntry {
> >>>>>>>>>>>>>>#define PHYS_MAP_NODE_NIL (((uint32_t)~0) >> 6)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>/* Size of the L2 (and L3, etc) page tables.  */
> >>>>>>>>>>>>>>-#define ADDR_SPACE_BITS TARGET_PHYS_ADDR_SPACE_BITS
> >>>>>>>>>>>>>>+#define ADDR_SPACE_BITS 64
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>#define P_L2_BITS 10
> >>>>>>>>>>>>>>#define P_L2_SIZE (1 << P_L2_BITS)
> >>>>>>>>>>>>>>@@ -1861,11 +1861,7 @@ static void memory_map_init(void)
> >>>>>>>>>>>>>>{
> >>>>>>>>>>>>>>    system_memory = g_malloc(sizeof(*system_memory));
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>-    assert(ADDR_SPACE_BITS <= 64);
> >>>>>>>>>>>>>>-
> >>>>>>>>>>>>>>-    memory_region_init(system_memory, NULL, "system",
> >>>>>>>>>>>>>>-                       ADDR_SPACE_BITS == 64 ?
> >>>>>>>>>>>>>>-                       UINT64_MAX : (0x1ULL << 
> >>>>>>>>>>>>>>ADDR_SPACE_BITS));
> >>>>>>>>>>>>>>+    memory_region_init(system_memory, NULL, "system", 
> >>>>>>>>>>>>>>UINT64_MAX);
> >>>>>>>>>>>>>>    address_space_init(&address_space_memory, system_memory, 
> >>>>>>>>>>>>>> "memory");
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>    system_io = g_malloc(sizeof(*system_io));
> >>>>>>>>>>>>>This seems to have some unexpected consequences around sizing 
> >>>>>>>>>>>>>64bit PCI
> >>>>>>>>>>>>>BARs that I'm not sure how to handle.
> >>>>>>>>>>>>BARs are often disabled during sizing. Maybe you
> >>>>>>>>>>>>don't detect BAR being disabled?
> >>>>>>>>>>>See the trace below, the BARs are not disabled.  QEMU pci-core is 
> >>>>>>>>>>>doing
> >>>>>>>>>>>the sizing an memory region updates for the BARs, vfio is just a
> >>>>>>>>>>>pass-through here.
> >>>>>>>>>>Sorry, not in the trace below, but yes the sizing seems to be 
> >>>>>>>>>>happening
> >>>>>>>>>>while I/O & memory are enabled int he command register.  Thanks,
> >>>>>>>>>>
> >>>>>>>>>>Alex
> >>>>>>>>>OK then from QEMU POV this BAR value is not special at all.
> >>>>>>>>Unfortunately
> >>>>>>>>
> >>>>>>>>>>>>>After this patch I get vfio
> >>>>>>>>>>>>>traces like this:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) febe0004
> >>>>>>>>>>>>>(save lower 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xffffffff, 
> >>>>>>>>>>>>>len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x10, len=0x4) ffffc004
> >>>>>>>>>>>>>(read size mask)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x10, 0xfebe0004, 
> >>>>>>>>>>>>>len=0x4)
> >>>>>>>>>>>>>(restore BAR)
> >>>>>>>>>>>>>vfio: region_add febe0000 - febe3fff [0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region re-mapped)
> >>>>>>>>>>>>>vfio: vfio_pci_read_config(0000:01:10.0, @0x14, len=0x4) 0
> >>>>>>>>>>>>>(save upper 32bits of BAR)
> >>>>>>>>>>>>>vfio: vfio_pci_write_config(0000:01:10.0, @0x14, 0xffffffff, 
> >>>>>>>>>>>>>len=0x4)
> >>>>>>>>>>>>>(write mask to BAR)
> >>>>>>>>>>>>>vfio: region_del febe0000 - febe3fff
> >>>>>>>>>>>>>(memory region gets unmapped)
> >>>>>>>>>>>>>vfio: region_add fffffffffebe0000 - fffffffffebe3fff 
> >>>>>>>>>>>>>[0x7fcf3654d000]
> >>>>>>>>>>>>>(memory region gets re-mapped with new address)
> >>>>>>>>>>>>>qemu-system-x86_64: vfio_dma_map(0x7fcf38861710, 
> >>>>>>>>>>>>>0xfffffffffebe0000, 0x4000, 0x7fcf3654d000) = -14 (Bad address)
> >>>>>>>>>>>>>(iommu barfs because it can only handle 48bit physical addresses)
> >>>>>>>>>>>>Why are you trying to program BAR addresses for dma in the iommu?
> >>>>>>>>>>>Two reasons, first I can't tell the difference between RAM and 
> >>>>>>>>>>>MMIO.
> >>>>>>>>>Why can't you? Generally memory core let you find out easily.
> >>>>>>>>My MemoryListener is setup for &address_space_memory and I then filter
> >>>>>>>>out anything that's not memory_region_is_ram().  This still gets
> >>>>>>>>through, so how do I easily find out?
> >>>>>>>>
> >>>>>>>>>But in this case it's vfio device itself that is sized so for sure 
> >>>>>>>>>you
> >>>>>>>>>know it's MMIO.
> >>>>>>>>How so?  I have a MemoryListener as described above and pass 
> >>>>>>>>everything
> >>>>>>>>through to the IOMMU.  I suppose I could look through all the
> >>>>>>>>VFIODevices and check if the MemoryRegion matches, but that seems 
> >>>>>>>>really
> >>>>>>>>ugly.
> >>>>>>>>
> >>>>>>>>>Maybe you will have same issue if there's another device with a 64 
> >>>>>>>>>bit
> >>>>>>>>>bar though, like ivshmem?
> >>>>>>>>Perhaps, I suspect I'll see anything that registers their BAR
> >>>>>>>>MemoryRegion from memory_region_init_ram or 
> >>>>>>>>memory_region_init_ram_ptr.
> >>>>>>>Must be a 64 bit BAR to trigger the issue though.
> >>>>>>>
> >>>>>>>>>>>Second, it enables peer-to-peer DMA between devices, which is 
> >>>>>>>>>>>something
> >>>>>>>>>>>that we might be able to take advantage of with GPU passthrough.
> >>>>>>>>>>>
> >>>>>>>>>>>>>Prior to this change, there was no re-map with the 
> >>>>>>>>>>>>>fffffffffebe0000
> >>>>>>>>>>>>>address, presumably because it was beyond the address space of 
> >>>>>>>>>>>>>the PCI
> >>>>>>>>>>>>>window.  This address is clearly not in a PCI MMIO space, so why 
> >>>>>>>>>>>>>are we
> >>>>>>>>>>>>>allowing it to be realized in the system address space at this 
> >>>>>>>>>>>>>location?
> >>>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Alex
> >>>>>>>>>>>>Why do you think it is not in PCI MMIO space?
> >>>>>>>>>>>>True, CPU can't access this address but other pci devices can.
> >>>>>>>>>>>What happens on real hardware when an address like this is 
> >>>>>>>>>>>programmed to
> >>>>>>>>>>>a device?  The CPU doesn't have the physical bits to access it.  I 
> >>>>>>>>>>>have
> >>>>>>>>>>>serious doubts that another PCI device would be able to access it
> >>>>>>>>>>>either.  Maybe in some limited scenario where the devices are on 
> >>>>>>>>>>>the
> >>>>>>>>>>>same conventional PCI bus.  In the typical case, PCI addresses are
> >>>>>>>>>>>always limited by some kind of aperture, whether that's explicit in
> >>>>>>>>>>>bridge windows or implicit in hardware design (and perhaps made 
> >>>>>>>>>>>explicit
> >>>>>>>>>>>in ACPI).  Even if I wanted to filter these out as noise in vfio, 
> >>>>>>>>>>>how
> >>>>>>>>>>>would I do it in a way that still allows real 64bit MMIO to be
> >>>>>>>>>>>programmed.  PCI has this knowledge, I hope.  VFIO doesn't.  
> >>>>>>>>>>>Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>>Alex
> >>>>>>>>>AFAIK PCI doesn't have that knowledge as such. PCI spec is explicit 
> >>>>>>>>>that
> >>>>>>>>>full 64 bit addresses must be allowed and hardware validation
> >>>>>>>>>test suites normally check that it actually does work
> >>>>>>>>>if it happens.
> >>>>>>>>Sure, PCI devices themselves, but the chipset typically has defined
> >>>>>>>>routing, that's more what I'm referring to.  There are generally only
> >>>>>>>>fixed address windows for RAM vs MMIO.
> >>>>>>>The physical chipset? Likely - in the presence of IOMMU.
> >>>>>>>Without that, devices can talk to each other without going
> >>>>>>>through chipset, and bridge spec is very explicit that
> >>>>>>>full 64 bit addressing must be supported.
> >>>>>>>
> >>>>>>>So as long as we don't emulate an IOMMU,
> >>>>>>>guest will normally think it's okay to use any address.
> >>>>>>>
> >>>>>>>>>Yes, if there's a bridge somewhere on the path that bridge's
> >>>>>>>>>windows would protect you, but pci already does this filtering:
> >>>>>>>>>if you see this address in the memory map this means
> >>>>>>>>>your virtual device is on root bus.
> >>>>>>>>>
> >>>>>>>>>So I think it's the other way around: if VFIO requires specific
> >>>>>>>>>address ranges to be assigned to devices, it should give this
> >>>>>>>>>info to qemu and qemu can give this to guest.
> >>>>>>>>>Then anything outside that range can be ignored by VFIO.
> >>>>>>>>Then we get into deficiencies in the IOMMU API and maybe VFIO.  
> >>>>>>>>There's
> >>>>>>>>currently no way to find out the address width of the IOMMU.  We've 
> >>>>>>>>been
> >>>>>>>>getting by because it's safely close enough to the CPU address width 
> >>>>>>>>to
> >>>>>>>>not be a concern until we start exposing things at the top of the 
> >>>>>>>>64bit
> >>>>>>>>address space.  Maybe I can safely ignore anything above
> >>>>>>>>TARGET_PHYS_ADDR_SPACE_BITS for now.  Thanks,
> >>>>>>>>
> >>>>>>>>Alex
> >>>>>>>I think it's not related to target CPU at all - it's a host limitation.
> >>>>>>>So just make up your own constant, maybe depending on host 
> >>>>>>>architecture.
> >>>>>>>Long term add an ioctl to query it.
> >>>>>>It's a hardware limitation which I'd imagine has some loose ties to the
> >>>>>>physical address bits of the CPU.
> >>>>>>
> >>>>>>>Also, we can add a fwcfg interface to tell bios that it should avoid
> >>>>>>>placing BARs above some address.
> >>>>>>That doesn't help this case, it's a spurious mapping caused by sizing
> >>>>>>the BARs with them enabled.  We may still want such a thing to feed into
> >>>>>>building ACPI tables though.
> >>>>>Well the point is that if you want BIOS to avoid
> >>>>>specific addresses, you need to tell it what to avoid.
> >>>>>But neither BIOS nor ACPI actually cover the range above
> >>>>>2^48 ATM so it's not a high priority.
> >>>>>
> >>>>>>>Since it's a vfio limitation I think it should be a vfio API, along the
> >>>>>>>lines of vfio_get_addr_space_bits(void).
> >>>>>>>(Is this true btw? legacy assignment doesn't have this problem?)
> >>>>>>It's an IOMMU hardware limitation, legacy assignment has the same
> >>>>>>problem.  It looks like legacy will abort() in QEMU for the failed
> >>>>>>mapping and I'm planning to tighten vfio to also kill the VM for failed
> >>>>>>mappings.  In the short term, I think I'll ignore any mappings above
> >>>>>>TARGET_PHYS_ADDR_SPACE_BITS,
> >>>>>That seems very wrong. It will still fail on an x86 host if we are
> >>>>>emulating a CPU with full 64 bit addressing. The limitation is on the
> >>>>>host side there's no real reason to tie it to the target.
> >>>I doubt vfio would be the only thing broken in that case.
> >>>
> >>>>>>long term vfio already has an IOMMU info
> >>>>>>ioctl that we could use to return this information, but we'll need to
> >>>>>>figure out how to get it out of the IOMMU driver first.
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Alex
> >>>>>Short term, just assume 48 bits on x86.
> >>>I hate to pick an arbitrary value since we have a very specific mapping
> >>>we're trying to avoid.  Perhaps a better option is to skip anything
> >>>where:
> >>>
> >>>        MemoryRegionSection.offset_within_address_space >
> >>>        ~MemoryRegionSection.offset_within_address_space
> >>>
> >>>>>We need to figure out what's the limitation on ppc and arm -
> >>>>>maybe there's none and it can address full 64 bit range.
> >>>>IIUC on PPC and ARM you always have BAR windows where things can get 
> >>>>mapped into. Unlike x86 where the full phyiscal address range can be 
> >>>>overlayed by BARs.
> >>>>
> >>>>Or did I misunderstand the question?
> >>>Sounds right, if either BAR mappings outside the window will not be
> >>>realized in the memory space or the IOMMU has a full 64bit address
> >>>space, there's no problem.  Here we have an intermediate step in the BAR
> >>>sizing producing a stray mapping that the IOMMU hardware can't handle.
> >>>Even if we could handle it, it's not clear that we want to.  On AMD-Vi
> >>>the IOMMU pages tables can grow to 6-levels deep.  A stray mapping like
> >>>this then causes space and time overhead until the tables are pruned
> >>>back down.  Thanks,
> >>I thought sizing is hard defined as a set to
> >>-1? Can't we check for that one special case and treat it as "not mapped, 
> >>but tell the guest the size in config space"?
> >PCI doesn't want to handle this as anything special to differentiate a
> >sizing mask from a valid BAR address.  I agree though, I'd prefer to
> >never see a spurious address like this in my MemoryListener.
> >
> >
> 
> Can't you just ignore regions that cannot be mapped?  Oh, and teach
> the bios and/or linux to disable memory access while sizing.


I know Linux won't disable memory access while sizing because
there are some broken devices where you can't re-enable it afterwards.

It should be harmless to set BAR to any silly value as long
as you are careful not to access it.


-- 
MST



reply via email to

[Prev in Thread] Current Thread [Next in Thread]