qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH V2 RESEND] docs: add PCIe devices placement guid


From: Marcel Apfelbaum
Subject: Re: [Qemu-devel] [PATCH V2 RESEND] docs: add PCIe devices placement guidelines
Date: Thu, 27 Oct 2016 14:27:11 +0300
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.1.1

On 10/14/2016 02:36 PM, Laszlo Ersek wrote:
On 10/13/16 16:05, Marcel Apfelbaum wrote:
On 10/13/2016 04:52 PM, Marcel Apfelbaum wrote:
Proposes best practices on how to use PCI Express/PCI device
in PCI Express based machines and explain the reasoning behind them.

Signed-off-by: Marcel Apfelbaum <address@hidden>
---

Hi,

I am sending the doc  twice, it appears the first time didn't make it
to qemu-devel list.

Hi,

Adding people to CC. Sorry for the earlier noise.

Thanks,
Marcel



Hi Laszlo,
Thanks for the review, I'll do my best to address  all
the comments in nest version.

[...]

+
+2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
+          -device
ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
+          -device <dev>,bus=root_port1
+      Note that chassis parameter is compulsory, and must be unique
+      for each PCI Express Root Port.

{7} Hmmm, I think it's rather that the (chassis, slot) *pair* must be
unique. You can get away with leaving the default chassis=0 unspecified,
and spell out just the slots, I think.


Yes you can, I wanted to let the "slot" out of the equation and use the chassis
as a known parameter from the PCI bridge - easier to digest.
However your comment make sense, I'll update the doc.

+2.2.2 Using multi-function PCI Express Root Ports:
+      -device
ioh3420,id=root_port1,multifunction=on,chassis=x[,bus=pcie.0][,slot=y][,addr=z.0]
\
+      -device
ioh3420,id=root_port2,,chassis=x1[,bus=pcie.0][,slot=y1][,addr=z.1] \
+      -device
ioh3420,id=root_port3,,chassis=x2[,bus=pcie.0][,slot=y2][,addr=z.2] \

{8} This looks good to me, except for the double-comma typos: ",,".

+2.2.2 Plugging a PCI Express device into a Switch:
+      -device
ioh3420,id=root_port1,chassis=x[,bus=pcie.0][,slot=y][,addr=z]  \
+      -device
x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x]          \
+      -device
xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1[,slot=y1][,addr=z1]]
\
+      -device <dev>,bus=downstream_port1
+

{9} For all of these command lines, can you specify if z=0 (that is,
device#0) is valid or not in the addr=z properties?



Address 0 is valid for all PCI bridges (we see PCI Express 
Root/Upstream/Downstream/Ports as PCI bridges)
that do not have an SHPC component.
Even the "regular" pci bridge can use slot 0 if we pass shpc=off as parameter.
For PCI Express slot 0 is always valid, this is why I didn't add it to the doc.


Earlier discussion on this question:

On 10/04/16 18:25, Laine Stump wrote:
On 10/04/2016 11:45 AM, Alex Williamson wrote:

 Same with the restriction from using slot
0 on PCI bridges, there's no basis for that except on the root bus.

I tried allowing devices to be plugged into slot 0 of a pci-bridge in
libvirt - qemu barfed, so I moved the "minSlot" for pci-bridge back up
to 1. Slot 0 is completely usable on a dmi-to-pci-bridge though (and
libvirt allows it). At this point, even if qemu enabled using slot 0
of a pci-bridge, libvirt wouldn't be able to expose that to users
(unless the min/max slot of each PCI controller was made visible
somewhere via QMP)

On 10/05/16 12:03, Marcel Apfelbaum wrote:
The reason for not being able to plug a device into slot 0 of a PCI
Bridge is the SHPC (Hot-plug controller) device embedded in the PCI
bridge by default. The SHPC spec requires this. If one disables it
with shpc=false, he should be able to use the slot 0.

For simplicity's sake I guess we should just recommend >=1 slot numbers.


Is a pity to skip slot 0 for PCI Express bridges.
- Root ports and Downstream ports have only slot 0 :)
- Root Complexes (pice.0 and pxb-pcies) can use slot 0

Back to the patch:

On 10/13/16 16:05, Marcel Apfelbaum wrote:
On 10/13/2016 04:52 PM, Marcel Apfelbaum wrote:

+
+2.3 PCI only hierarchy
+======================
+Legacy PCI devices can be plugged into pcie.0 as Integrated Devices.
+Besides that use DMI-PCI bridges (i82801b11-bridge) to start PCI
hierarchies.
+
+Prefer flat hierarchies. For most scenarios a single DMI-PCI bridge
(having 32 slots)
+and several PCI-PCI bridges attached to it (each supporting also 32
slots) will support
+hundreds of legacy devices. The recommendation is to populate one
PCI-PCI bridge
+under the DMI-PCI bridge until is full and then plug a new PCI-PCI
bridge...
+
+   pcie.0 bus
+   ----------------------------------------------
+        |                            |
+   -----------               ------------------
+   | PCI Dev |               | DMI-PCI BRIDGE |
+   ----------                ------------------
+                               |            |
+                        -----------    ------------------
+                        | PCI Dev |    | PCI-PCI Bridge |
+                        -----------    ------------------
+                                         |           |
+                                  -----------     -----------
+                                  | PCI Dev |     | PCI Dev |
+                                  -----------     -----------
+
+2.3.1 To plug a PCI device into a pcie.0 as Integrated Device use:
+      -device <dev>[,bus=pcie.0]

(This is repeated from 2.1.1, but I guess it doesn't hurt, for
completeness in this chapter.)

+2.3.2 Plugging a PCI device into a DMI-PCI bridge:
+      -device i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]    \
+      -device <dev>,bus=dmi_pci_bridge1[,addr=x]

{10} I recall that we discussed this at length, that is, placing PCI
(non-express) devices directly on the DMI-PCI bridge. IIRC the argument
was that it's technically possible, it just won't support hot-plug.


Exactly

I'd like if we removed this use case from the document. It might be
possible, but for simplicity's sake, we shouldn't advertise it, in my
opinion. (Unless someone strongly disagrees with this idea, of course.)
I recall esp. Laine, Alex and Daniel focusing on this, but I don't
remember if (and what for) they wanted this option. Personally I'd like
to see it disappear (unless convinced otherwise).

... Actually, going through the RFC thread again, it seems that the use
case that I'm arguing against -- i.e., "use DMI-PCI as a generic
PCIe-to-PCI bridge" -- makes Alex cringe as well. So there's that :)


I can do that, even if for non-hotplug scenarios we don't have any reason
to avoid that. On the other hand, not advertising is not a big crime.

{11} Syntax remark, should we keep this section: the commas are not
right in ",[,bus=pcie.0]".

+2.3.3 Plugging a PCI device into a PCI-PCI bridge:
+      -device
i82801b11-bridge,id=dmi_pci_bridge1,[,bus=pcie.0]
\

{12} double comma again

+      -device
pci-bridge,id=pci_bridge1,bus=dmi_pci_bridge1[,chassis_nr=x][,addr=y]
\
+      -device <dev>,bus=pci_bridge1[,addr=x]

{13} It would be nice to spell out the valid device addresses (y and x)
here too -- can we use 0 for them? SHPC again?

Can we / should we simply go with >=1 device addresses?


For pci-bridges only - yes. A better idea (I think) is to disable SHPC by 
default
from the next QEMU version. Make this slot usable. Sounds OK?

+
+
+3. IO space issues
+===================
+The PCI Express Root Ports and PCI Express Downstream ports are seen by
+Firmware/Guest OS as PCI-PCI bridges and, as required by PCI spec,
+should reserve a 4K IO range for each even if only one (multifunction)
+device can be plugged into them, resulting in poor IO space utilization.

{14} I completely agree with this paragraph, I'd just recommend to allow
the reader more time to digest it. Something like:

----
The PCI Express Root Ports and PCI Express Downstream ports are seen by
Firmware/Guest OS as PCI-PCI bridges. As required by the PCI spec, each
such Port should be reserved a 4K IO range for, even though only one
(multifunction) device can be plugged into each Port. This results in
poor IO space utilization.
----

+
+The firmware used by QEMU (SeaBIOS/OVMF) may try further optimizations
+by not allocating IO space if possible:
+    (1) - For empty PCI Express Root Ports/PCI Express Downstream ports.
+    (2) - If the device behind the PCI Express Root Ports/PCI Express
+          Downstream has no IO BARs.

{15} I'd say:

... by not allocating IO space for each PCI Express Root / PCI Express
Downstream port if:
(1) the port is empty, or
(2) the device behind the port has no IO BARs.

+
+The IO space is very limited, 65536 byte-wide IO ports, but it's
fragmented

{16} Suggestion: "The IO space is very limited, to 65536 byte-wide IO
ports, and may even be fragmented by fixed IO ports owned by platform
devices."

+resulting in ~10 PCI Express Root Ports (or PCI Express

{17} s/resulting in/resulting in at most/

Downstream/Upstream ports)

{18} Whoa, upstream ports? Why do we need to spell them out here?
Upstream ports need their own bus numbers, but their IO space assignment
only collects the IO space assignments of their downstream ports. Isn't
that right?


Yes, I wad referring to the "couple", if you have an Upstream port you obviously
have at least a Downstream one.

(I believe spelling out upstream ports here was recommended by Alex, but
I don't understand why.)


Maybe to understand the connection between them? Usage?

+ports per system if devices with IO BARs are used in the PCI Express

{19} The word "ports" is unnecessary here.

hierarchy.
+
+Using the proposed device placing strategy solves this issue
+by using only PCI Express devices within PCI Express hierarchy.
+
+The PCI Express spec requires the PCI Express devices to work without
using IO.
+The PCI hierarchy has no such limitations.

{20} There should be no empty line (or even a line break) between the
two above paragraphs: the second paragraph explains the first one.

+
+
+4. Bus numbers issues
+======================
+Each PCI domain can have up to only 256 buses and the QEMU PCI Express
+machines do not support multiple PCI domains even if extra Root
+Complexes (pxb-pcie) are used.
+
+Each element of the PCI Express hierarchy (Root Complexes,
+PCI Express Root Ports, PCI Express Downstream/Upstream ports)
+takes up bus numbers. Since only one (multifunction) device
+can be attached to a PCI Express Root Port or PCI Express Downstream
+Port it is advised to plan in advance for the expected number of
+devices to prevent bus numbers starvation.

{21} Please add:

"""
In particular:

- Avoiding PCI Express Switches (and thereby striving for a flat PCI
Express hierarchy) enables the hierarchy to not spend bus numbers on
Upstream Ports.

- The bus_nr properties of the pxb-pcie devices partition the 0..255 bus
number space. All bus numbers assigned to the buses recursively behind a
given pxb-pcie device's root bus must fit between the bus_nr property of
that pxb-pcie device, and the lowest of the higher bus_nr properties
that the command line sets for other pxb-pcie devices.
"""

(You do mention switches being more bus number-hungry, under chapter 5,
hotplug; however, saving bus numbers makes sense for a purely
cold-plugged scenario as well, especially in combination with pxb-pcie,
where the partitioning can become a limiting factor.)

+
+
+5. Hot Plug
+============
+The PCI Express root buses (pcie.0 and the buses exposed by pxb-pcie
devices)
+do not support hot-plug, so any devices plugged into Root Complexes
+cannot be hot-plugged/hot-unplugged:
+    (1) PCI Express Integrated Devices
+    (2) PCI Express Root Ports
+    (3) DMI-PCI bridges
+    (4) pxb-pcie
+
+PCI devices can be hot-plugged into PCI-PCI bridges, however cannot
+be hot-plugged into DMI-PCI bridges.

{22} If you agree with my suggestion to remove 2.3.2, that is, the "PCI
device cold-plugged directly into DMI-PCI bridge" case, then the second
half of the sentence can be dropped.

+The PCI hotplug is ACPI based and can work side by side with the
+PCI Express native hotplug.
+
+PCI Express devices can be natively hot-plugged/hot-unplugged into/from
+PCI Express Root Ports (and PCI Express Downstream Ports).
+
+5.1 Planning for hotplug:
+    (1) PCI hierarchy
+        Leave enough PCI-PCI bridge slots empty or add one
+        or more empty PCI-PCI bridges to the DMI-PCI bridge.
+
+        For each such bridge the Guest Firmware is expected to
reserve 4K IO
+        space and 2M MMIO range to be used for all devices behind it.
+
+        Because of the hard IO limit of around 10 PCI bridges (~ 40K
space) per system
+        don't use more than 9 bridges, leaving 4K for the Integrated
devices
+        and none for the PCI Express Hierarchy.

{23} s/9 bridges/9 PCI-PCI bridges/

+
+    (2) PCI Express hierarchy:
+        Leave enough PCI Express Root Ports empty. Use multifunction
+        PCI Express Root Ports to prevent going out of PCI bus numbers.

{24} I agree, but I'd put it a bit differently: use multifunction PCI
Express Root Ports on the Root Complex(es), for keeping the hierarchy as
flat as possible, thereby saving PCI bus numbers.

+        Don't use PCI Express Switches if you don't have too, they use
+        an extra PCI bus that may handy to plug another device id it
comes to it.
+

{25} I'd put it as: Don't use PCI Express Switches if you don't have
too, each one of those uses an extra PCI bus (for its Upstream Port)
that could be put to better use with another Root Port or Downstream
Port, which may come handy for hotplugging another device.

{26} Another remark (important to me) in this section: the document
doesn't state firmware expectations. It's clear the firmware is expected
to reserve no IO space for PCI Express Downstream Ports and Root Ports,
but what about MMIO?

We discussed this at length with Alex, but I think we didn't conclude
anything. It would be nice if firmware received some instructions from
this document in this regard, even before we implement our own ports and
bridges in QEMU.


Hmm, I have no idea what to add here, except:
The firmware is expected to reserve at least 2M for each pci bridge?

<digression>

If we think such recommendations are out of scope at this point, *and*
noone disagrees strongly (Gerd?), then I could add some experimental
fw_cfg knobs to OVMF for this, such as (units in MB):

-fw_cfg opt/org.tianocore.ovmf/X-ReservePciE/PrefMmio32Mb,string=...
-fw_cfg opt/org.tianocore.ovmf/X-ReservePciE/NonPrefMmio32Mb,string=...
-fw_cfg opt/org.tianocore.ovmf/X-ReservePciE/PrefMmio64Mb,string=..
-fw_cfg opt/org.tianocore.ovmf/X-ReservePciE/NonPrefMmio64Mb,string=..

Under this idea, I would reserve no resources at all for Downstream
Ports and Root Ports in OVMF by default; but users could influence those
reservations. I think that would be enough to kick things off. It also
needs no modifications for QEMU.

</digression>

+5.3 Hot plug example:
+Using HMP: (add -monitor stdio to QEMU command line)
+  device_add <dev>,id=<id>,bus=<pcie.0/PCI Express Root Port
Id/PCI-PCI bridge Id/pxb-pcie Id>

{27} I think the bus=<...> part is incorrect here. Based on the rest of
the guidelines, we have to specify the ID of:
- a PCI Express Root Port, or
- a PCI Express Downstream Port, or
- a PCI-PCI bridge.


I don't get it, you specify what you wrote above as the bus, right?
For example if you start the machine with
    .... -device ioh3420,id=root_port1,
you hotplug with: device_add e1000,bus=root_port1.

+
+
+6. Device assignment
+====================
+Host devices are mostly PCI Express and should be plugged only into
+PCI Express Root Ports or PCI Express Downstream Ports.
+PCI-PCI bridge slots can be used for legacy PCI host devices.
+
+6.1 How to detect if a device is PCI Express:
+  > lspci -s 03:00.0 -v (as root)
+
+    03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 83)
+    Subsystem: Intel Corporation Dual Band Wireless-AC 7260
+    Flags: bus master, fast devsel, latency 0, IRQ 50
+    Memory at f0400000 (64-bit, non-prefetchable) [size=8K]
+    Capabilities: [c8] Power Management version 3
+    Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
+    Capabilities: [40] Express Endpoint, MSI 00
+
+    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+    Capabilities: [100] Advanced Error Reporting
+    Capabilities: [140] Device Serial Number 7c-7a-91-ff-ff-90-db-20
+    Capabilities: [14c] Latency Tolerance Reporting
+    Capabilities: [154] Vendor Specific Information: ID=cafe Rev=1
Len=014
+
+
+7. Virtio devices
+=================
+Virtio devices plugged into the PCI hierarchy or as Integrated Devices
+will remain PCI and have transitional behaviour as default.
+Transitional virtio devices work in both IO and MMIO modes depending on
+the guest support.

{28} Suggest to add: firmware will assign both IO and MMIO resources to
transitional virtio devices.

+
+Virtio devices plugged into PCI Express ports are PCI Express devices
and
+have "1.0" behavior by default without IO support.
+In both case disable-* properties can be used to override the behaviour.

{29} s/case/cases/; also, please spell out the disable-* properties fully.

+
+Note that setting disable-legacy=off will enable legacy mode (enabling
+legacy behavior) for PCI Express virtio devices causing them to
+require IO space, which, given our PCI Express hierarchy, may quickly

{30} s/given our PCI Express hierarchy/given the limited available IO space/

+lead to resource exhaustion, and is therefore strongly discouraged.
+
+
+8. Conclusion
+==============
+The proposal offers a usage model that is easy to understand and follow
+and in the same time overcomes the PCI Express architecture limitations.
+




I think this version has seen big improvements, and I think it's
structurally complete. While composing this review, I went through the
entire RFC thread again, and I *think* you didn't miss anything from
that. Great job again!


Thanks!

My comments vary in importance. I trust you to take each comment with an
appropriately sized grain of salt ;)


Sure, thanks
Marcel

Thank you!
Laszlo





reply via email to

[Prev in Thread] Current Thread [Next in Thread]