qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] Qemu and 32 PCIe devices


From: Laszlo Ersek
Subject: Re: [Qemu-devel] Qemu and 32 PCIe devices
Date: Wed, 9 Aug 2017 03:06:29 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

On 08/08/17 17:51, Laszlo Ersek wrote:
> On 08/08/17 12:39, Marcin Juszkiewicz wrote:

>> Anyway, beyond the things written in that comment, there is one very
>> interesting symptom that makes me think another (milder?) bottleneck
>> could be in QEMU:
>>
>> When having a large number of PCI(e) devices (to the tune of 100+),
>> even the printing of DEBUG messages slows down extremely, to the
>> point where I can basically follow, with the naked eye, the printing
>> of *individual* characters, on the QEMU debug port. (Obviously such
>> printing is unrelated to PCI devices; the QEMU debug port is a simple
>> platform device on Q35 and i440fx). This suggests to me that the high
>> number of MemoryRegion objects in QEMU, used for the BARs of PCI(e)
>> devices and bridges, could slow down the dispatching of the
>> individual IO or MMIO accesses. I don't know what data structure QEMU
>> uses for representing the "flat view" of the address sapce, but I
>> think it *could* be a bottleneck. At least I tried to profile a few
>> bits in the firmware, and found nothing related specifically to the
>> slowdown of DEBUG prints.
>
> (Paolo says the data structure is a radix tree, so the bottleneck
> being there would be surprising. Also Paolo has given me tips for
> profiling, so I'm looking into it.)

I have some results here. They are probably not as well compiled as they
would be by someone who profiles stuff every morning for breakfast, but
it's a start I hope.


(1) My command script was the following. I used virtio-scsi-pci
controllers because they don't need backends and also have no pflash
impact in the firmware.

The PCI hierarchy is basically a DMI-PCI bridge plugged into the PCI
Express root complex, with up to five PCI-PCI bridges on that, and then
31 virtio-scsi-pci devices on each PCI-PCI bridge. By moving the "exit
0" around, I could control the number of virtio PCI devices. Pretty
crude, but it worked.

Also, I used PCI (and not PCI Express) bridges and devices because I
didn't want to run out of IO space for this test. (That's a different
question, related to other points in this (and other) threads.)

> CODE=/home/virt-images/OVMF_CODE.4m.fd
> TMPL=/home/virt-images/OVMF_VARS.4m.fd
>
> cd ~/tmp/
>
> if ! [ -e vars16.fd ]; then
>   cp $TMPL vars16.fd
> fi
>
> qemu-system-x86_64 \
>   \
>   -machine q35,vmport=off,accel=kvm \
>   -drive if=pflash,readonly,format=raw,file=$CODE \
>   -drive if=pflash,format=raw,file=vars16.fd \
>   -net none \
>   -display none \
>   -fw_cfg name=opt/ovmf/PcdResizeXterm,string=y \
>   -m 2048 \
>   -debugcon file:debug16.log \
>   -global isa-debugcon.iobase=0x402 \
>   -name debug-threads=on \
>   -s \
>   -chardev stdio,signal=off,mux=on,id=char0 \
>   -mon chardev=char0,mode=readline \
>   -serial chardev:char0 \
>   \
>   -trace enable='pci_update_mappings*' \
>   \
>   -device i82801b11-bridge,id=dmi-pci-bridge \
>   \
>   -device pci-bridge,id=bridge-1,chassis_nr=1,bus=dmi-pci-bridge \
>   -device pci-bridge,id=bridge-2,chassis_nr=2,bus=dmi-pci-bridge \
>   -device pci-bridge,id=bridge-3,chassis_nr=3,bus=dmi-pci-bridge \
>   -device pci-bridge,id=bridge-4,chassis_nr=4,bus=dmi-pci-bridge \
>   -device pci-bridge,id=bridge-5,chassis_nr=5,bus=dmi-pci-bridge \
>   \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x2.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x3.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x4.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x5.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x6.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x7.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x8.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x9.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xa.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xb.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xc.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xd.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xe.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0xf.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x10.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x11.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x12.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x13.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x14.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x15.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x16.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x17.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x18.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x19.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1a.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1b.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1c.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1d.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1e.0 \
>   -device virtio-scsi-pci,bus=bridge-1,addr=0x1f.0 \
>   \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x2.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x3.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x4.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x5.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x6.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x7.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x8.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x9.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xa.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xb.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xc.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xd.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xe.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0xf.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x10.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x11.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x12.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x13.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x14.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x15.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x16.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x17.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x18.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x19.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1a.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1b.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1c.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1d.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1e.0 \
>   -device virtio-scsi-pci,bus=bridge-2,addr=0x1f.0 \
>   \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x2.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x3.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x4.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x5.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x6.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x7.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x8.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x9.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xa.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xb.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xc.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xd.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xe.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0xf.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x10.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x11.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x12.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x13.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x14.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x15.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x16.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x17.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x18.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x19.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1a.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1b.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1c.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1d.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1e.0 \
>   -device virtio-scsi-pci,bus=bridge-3,addr=0x1f.0 \
>
> exit 0
>
>   \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x2.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x3.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x4.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x5.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x6.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x7.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x8.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x9.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xa.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xb.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xc.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xd.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xe.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0xf.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x10.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x11.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x12.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x13.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x14.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x15.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x16.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x17.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x18.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x19.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1a.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1b.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1c.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1d.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1e.0 \
>   -device virtio-scsi-pci,bus=bridge-4,addr=0x1f.0 \
>   \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x1.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x2.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x3.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x4.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x5.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x6.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x7.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x8.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x9.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xa.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xb.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xc.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xd.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xe.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0xf.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x10.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x11.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x12.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x13.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x14.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x15.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x16.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x17.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x18.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x19.0 \
>   -device virtio-scsi-pci,bus=bridge-5,addr=0x1a.0
>
> exit 0
>
>   -drive 
> id=iso,if=none,format=raw,readonly,file=/usr/share/OVMF/UefiShell.iso \
>   -device ide-cd,drive=iso,bootindex=0 \
>   \


(2) At first I did some super crude measurements, basically from VM
start to UEFI shell prompt:

- 31 devices (single PCI-PCI bridge):  14 seconds
- 46 devices (two PCI-PCI bridges):    31 seconds
- 62 devices (two PCI-PCI bridges):    64 seconds
- 93 devices (three PCI-PCI bridges): 203 seconds

The device count and runtime multipliers are, relative to the first
line:
- 31/31  = 1;     14/14  =  1
- 46/31 ~= 1.48;  31/14 ~=  2.21 (note: 1.48 * 1.48 ~= 2.19)
- 62/31  = 2;     64/14 ~=  4.57 (note: 2    * 2     = 4)
- 93/31  = 3;    203/14 ~= 14.5  (note: 3    * 3     = 9)

So the boot time seems to be a super-quadratic function of the virtio
PCI device count.


(3) Paolo told me to grab a "perf top" screenshot of when the firmware
was in one of those parts that seemed to slow down when adding more
devices.

I took several such screenshots (from different slowed-down parts, with
different device counts), and they were all close variants of the
following:

>   20.14%  qemu-system-x86_64                  [.] render_memory_region
>   17.14%  qemu-system-x86_64                  [.] subpage_register
>   10.31%  qemu-system-x86_64                  [.] int128_add
>    7.86%  qemu-system-x86_64                  [.] addrrange_end
>    7.30%  qemu-system-x86_64                  [.] int128_ge
>    4.89%  qemu-system-x86_64                  [.] int128_nz
>    3.94%  qemu-system-x86_64                  [.] phys_page_compact
>    2.73%  qemu-system-x86_64                  [.] phys_map_node_alloc

and

>   25.53%  qemu-system-x86_64                  [.] subpage_register
>   15.14%  qemu-system-x86_64                  [.] render_memory_region
>    7.99%  qemu-system-x86_64                  [.] int128_add
>    6.17%  qemu-system-x86_64                  [.] addrrange_end
>    5.61%  qemu-system-x86_64                  [.] int128_ge
>    3.94%  qemu-system-x86_64                  [.] int128_nz
>    3.82%  qemu-system-x86_64                  [.] phys_page_compact
>    2.73%  qemu-system-x86_64                  [.] phys_map_node_alloc

and

>   30.16%  qemu-system-x86_64             [.] subpage_register
>   14.44%  qemu-system-x86_64             [.] render_memory_region
>    7.92%  qemu-system-x86_64             [.] int128_add
>    6.02%  qemu-system-x86_64             [.] addrrange_end
>    5.57%  qemu-system-x86_64             [.] int128_ge
>    3.86%  qemu-system-x86_64             [.] int128_nz
>    2.20%  libc-2.17.so                   [.] __memset_sse2

(Note, this was an unoptimized, debug build of QEMU, if that matters.)

In the middle case, I even probed into the firmware (using QEMU's GDB
server), which gave me:

> #0  0x000000007fce71cd in MmioRead16 (Address=2148630532)
>     at MdePkg/Library/BaseIoLibIntrinsic/IoLib.c:154
> #1  0x000000007fce670f in PciExpressRead16 (Address=1146884)
>     at MdePkg/Library/BasePciExpressLib/PciExpressLib.c:483
> #2  0x000000007fce1faa in PciRead16 (Address=1146884)
>     at OvmfPkg/Library/DxePciLibI440FxQ35/PciLib.c:454
> #3  0x000000007fce2e0d in PciSegmentRead16 (Address=1146884)
>     at MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c:428
> #4  0x000000007fce3068 in PciSegmentReadBuffer (StartAddress=1146884,
>     Size=2, Buffer=0x7feef8fe) at
>     MdePkg/Library/BasePciSegmentLibPci/PciSegmentLib.c:1174
> #5  0x000000007fce011b in RootBridgeIoPciAccess (This=0x7f6051c0,
>     Read=1 '\001', Width=EfiPciWidthUint16,  Address=1146884, Count=1,
>     Buffer=0x7feef8fe) at
>     MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciRootBridgeIo.c:964
> #6  0x000000007fce01a9 in RootBridgeIoPciRead (This=0x7f6051c0,
>     Width=EfiPciWidthUint16, Address=16973828, Count=1,
>     Buffer=0x7feef8fe) at
>     MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciRootBridgeIo.c:996
> #7  0x000000007ea9a8e0 in PciIoConfigRead (This=0x7f59e828,
>     Width=EfiPciIoWidthUint16, Offset=4, Count=1,  Buffer=0x7feef8fe)
>     at MdeModulePkg/Bus/Pci/PciBusDxe/PciIo.c:753
> #8  0x000000007ea93240 in PciOperateRegister (PciIoDevice=0x7f59e818,
>     Command=0, Offset=4 '\004',  Operation=3 '\003', PtrCommand=0x0)
>     at MdeModulePkg/Bus/Pci/PciBusDxe/PciCommand.c:46
> #9  0x000000007ea9bad1 in PciIoAttributes (This=0x7f59e828,
>     Operation=EfiPciIoAttributeOperationEnable,  Attributes=0,
>     Result=0x0) at MdeModulePkg/Bus/Pci/PciBusDxe/PciIo.c:1743
> #10 0x000000007ea9bb48 in PciIoAttributes (This=0x7f59c428,
>     Operation=EfiPciIoAttributeOperationEnable,  Attributes=1792,
>     Result=0x0) at MdeModulePkg/Bus/Pci/PciBusDxe/PciIo.c:1753
> #11 0x000000007eb1a98d in DetectAndPreparePlatformPciDevicePath
>     (Handle=0x7f120418, PciIo=0x7f59c428, Pci=0x7feefaa0) at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:834
> #12 0x000000007eb1a92e in VisitingAPciInstance (Handle=0x7f120418,
>     Instance=0x7f59c428,  Context=0x7eb1a958
>     <DetectAndPreparePlatformPciDevicePath>) at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:789
> #13 0x000000007eb1a894 in VisitAllInstancesOfProtocol (Id=0x7eb2dd40
>     <gEfiPciIoProtocolGuid>,  CallBackFunction=0x7eb1a8c2
>     <VisitingAPciInstance>, Context=0x7eb1a958
>     <DetectAndPreparePlatformPciDevicePath>) at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:748
> #14 0x000000007eb1a956 in VisitAllPciInstances
>     (CallBackFunction=0x7eb1a958
>     <DetectAndPreparePlatformPciDevicePath>) at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:804
> #15 0x000000007eb1ab5a in DetectAndPreparePlatformPciDevicePaths
>     (DetectVgaOnly=0 '\000') at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:904
> #16 0x000000007eb1abba in PlatformInitializeConsole
>     (PlatformConsole=0x7eb2e280) at
>     OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:938
> #17 0x000000007eb1a157 in PlatformBootManagerBeforeConsole ()
>     at OvmfPkg/Library/PlatformBootManagerLib/BdsPlatform.c:397
> #18 0x000000007eb05d20 in BdsEntry (This=0x7eb2de08)
>     at MdeModulePkg/Universal/BdsDxe/BdsEntry.c:908
> #19 0x000000007fef1263 in DxeMain (HobStart=0x7fca3018)
>     at MdeModulePkg/Core/Dxe/DxeMain/DxeMain.c:521
> #20 0x000000007fef0589 in ProcessModuleEntryPointList
>     (HobStart=0x7bf56000) at
>     
> Build/OvmfX64/NOOPT_GCC48/X64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/AutoGen.c:417
> #21 0x000000007fef0260 in _ModuleEntryPoint (HobStart=0x7bf56000)
>     at MdePkg/Library/DxeCoreEntryPoint/DxeCoreEntryPoint.c:54

In advance: the interesting frames are those with
"EfiPciIoAttributeOperationEnable".


(4) Two very visible slowdowns occur when (a)
"OvmfPkg/AcpiPlatformDxe/PciDecoding.c" temporarily enables, in a loop,
IO and MMIO decoding for all PCI devices that were enumerated earlier --
so that the AML generation in QEMU could take their BARs in account --,
and (b) when the (legacy) virtio driver is bound to the virtio PCI
devices.

Loop (a) consists basically purely of PCI command register massaging.

Part (b) needed a bit ^W whole lot more investigation, so I learned how
to enable edk2's built-in profiling for OVMF, added a few "perf points"
to the VirtioPciDeviceBindingStart() function in
"OvmfPkg/VirtioPciDeviceDxe/VirtioPciDevice.c", and printed the
following measurements at the UEFI shell prompt with the DP (display
perf) utility:

When booting with 31 devices:

> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> DB:Support:      25558      137584           5           2         242
>   DB:Start:        175     7021314       40121           0     1221468
>   OpenProto         31         124           3           0           5
>  GetPciAttr         31          84           2           0           3
>    EnaPciIo         31     2671566       86179           0      102605
>    VPciInit         31        1927          62           0          72
>  InstVProto         31       12302         396           0         543

When booting with 93 devices:

> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> DB:Support:      37214      274119           7           2         414
>   DB:Start:        423    87530041      206926           0     3730887
>   OpenProto         93         510           5           0           9
>  GetPciAttr         93         302           3           0          11
>    EnaPciIo         93    62163154      668420           0     1008274
>    VPciInit         93        6564          70           0         239
>  InstVProto         93       42478         456           0        1128

The DB:Support and DB:Start intervals are generic (they cover all other
drivers as well); they reflect the call counts and durations of when any
driver is asked to determine whether it supports a given device, and
when -- after a positive answer -- the same is asked to actually bind
the device.

The other six lines are specific actions in
VirtioPciDeviceBindingStart().

We see the following, when going from 31 to 93 virtio PCI devices:

- the call counts (less than) triple on all rows, good

- the average, shortest and longest durations remain mostly unchanged
  for the DB:Support, OpenProto, GetPciAttr, VPciInit, InstVProto rows,

- on the "EnaPciIo" row (which stands for enabling IO decoding in the
  legacy or transitional virtio device's command register), the average
  duration gets multiplied by ~7.8 (668420/86179), while the max
  duration is multiplied by ~9.8 (1008274/102605),

- the DB:Start row is similar: it includes "EnaPciIo", but it is
  smoothed by UEFI drivers that don't bind to PciIo protocol instances,

- the cumulative duration for EnaPciIo goes from ~2.7 seconds to ~62.2
  seconds (a factor of ~23.3, corresponding to three times as many
  devices bound, times the avg. EnaPciIo duration getting multiplied by
  ~7.8).


(5) After this, I turned to QEMU, and applied the following patch:

> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 258fbe51e2ee..a3c250483211 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1289,19 +1289,27 @@ static void pci_update_mappings(PCIDevice *d)
>              trace_pci_update_mappings_del(d, pci_bus_num(d->bus),
>                                            PCI_SLOT(d->devfn),
>                                            PCI_FUNC(d->devfn),
>                                            i, r->addr, r->size);
>              memory_region_del_subregion(r->address_space, r->memory);
> +            trace_pci_update_mappings_del_done(d, pci_bus_num(d->bus),
> +                                               PCI_SLOT(d->devfn),
> +                                               PCI_FUNC(d->devfn),
> +                                               i, r->addr, r->size);
>          }
>          r->addr = new_addr;
>          if (r->addr != PCI_BAR_UNMAPPED) {
>              trace_pci_update_mappings_add(d, pci_bus_num(d->bus),
>                                            PCI_SLOT(d->devfn),
>                                            PCI_FUNC(d->devfn),
>                                            i, r->addr, r->size);
>              memory_region_add_subregion_overlap(r->address_space,
>                                                  r->addr, r->memory, 1);
> +            trace_pci_update_mappings_add_done(d, pci_bus_num(d->bus),
> +                                               PCI_SLOT(d->devfn),
> +                                               PCI_FUNC(d->devfn),
> +                                               i, r->addr, r->size);
>          }
>      }
>
>      pci_update_vga(d);
>  }
> diff --git a/hw/pci/trace-events b/hw/pci/trace-events
> index f68c178afc2b..ffb5c473e048 100644
> --- a/hw/pci/trace-events
> +++ b/hw/pci/trace-events
> @@ -1,10 +1,12 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>
>  # hw/pci/pci.c
>  pci_update_mappings_del(void *d, uint32_t bus, uint32_t slot, uint32_t func, 
> int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x 
> %d,0x%"PRIx64"+0x%"PRIx64
>  pci_update_mappings_add(void *d, uint32_t bus, uint32_t slot, uint32_t func, 
> int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x 
> %d,0x%"PRIx64"+0x%"PRIx64
> +pci_update_mappings_del_done(void *d, uint32_t bus, uint32_t slot, uint32_t 
> func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x 
> %d,0x%"PRIx64"+0x%"PRIx64
> +pci_update_mappings_add_done(void *d, uint32_t bus, uint32_t slot, uint32_t 
> func, int bar, uint64_t addr, uint64_t size) "d=%p %02x:%02x.%x 
> %d,0x%"PRIx64"+0x%"PRIx64
>
>  # hw/pci/pci_host.c
>  pci_cfg_read(const char *dev, unsigned devid, unsigned fnid, unsigned offs, 
> unsigned val) "%s %02u:%u @0x%x -> 0x%x"
>  pci_cfg_write(const char *dev, unsigned devid, unsigned fnid, unsigned offs, 
> unsigned val) "%s %02u:%u @0x%x <- 0x%x"

(This is matched by the "-trace enable='pci_update_mappings*'" option
under (1).)

With this patch, I only tested the "93 devices" case, as the slowdown
became visible to the naked eye from the trace messages, as the firmware
enabled more and more BARs / command registers (and inversely, the
speedup was perceivable when the firmware disabled more and more BARs /
command registers).

I will attach the trace output (compressed), from the launch of the VM
to reaching the UEFI shell. It would be interesting to plot the time
differences between the add/del and matching add_done/del_done trace
points (the difference grows continuously as more and more BARs are
enabled); for now I'll quote two extremes from the BAR-enablement loop:

Lines 2311-2312:

> address@hidden:pci_update_mappings_add       d=0x55567cfd4620  02:01.0  
> 0,0xa780+0x40
> address@hidden:pci_update_mappings_add_done  d=0x55567cfd4620  02:01.0  
> 0,0xa780+0x40

Lines 2863-2864:

> address@hidden:pci_update_mappings_add       d=0x555702401bf0  04:1f.0  
> 0,0x8000+0x40
> address@hidden:pci_update_mappings_add_done  d=0x555702401bf0  04:1f.0  
> 0,0x8000+0x40

Enabling IO decoding (hence BAR 0) for the first virtio-scsi-pci device,
02:01.0, took 3.762 msecs.

Enabling IO decoding (hence BAR 0), at the end of the same firmware
loop, for the last virtio-scsi-pci device, 04:1f.0, took 408.275 msecs,
which is approx. 108 times as long. (Almost half a second -- it's
human-perceivable.)


I didn't try to follow QEMU more deeply. Hopefully this is useful for
something.

Thanks
Laszlo

Attachment: trace-column-t.txt.xz
Description: application/xz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]