On 08/18/2017 01:14 AM, David Gibson wrote:
On Thu, Aug 17, 2017 at 06:31:28PM -0300, Daniel Henrique Barboza wrote:
On 08/17/2017 04:52 AM, David Gibson wrote:
On Tue, Aug 15, 2017 at 05:28:46PM -0300, Daniel Henrique Barboza wrote:
This patch is a follow up on the discussions that started with
Laurent's patch series "spapr: disable hotplugging without OS" [1]
and discussions made at patch "spapr: reset DRCs on migration
pre_load" [2].
At this moment, we do not support CPU/memory hotplug in early
boot stages, before CAS. The reason is that the hotplug event
can't be handled at SLOF level (or even in PRELAUNCH runstate) and
at the same time can't be canceled. This leads to devices being
unable to be hot unplugged and, in some cases, guest kernel Ooops.
After CAS, with the FDT in place, the guest can handle the hotplug
events and everything works as usual.
An attempt to try to support hotplug before CAS was made, but not
successful. The key difference in the current code flow between a
coldplugged and a hotplugged device, in the PRELAUNCH state, is that
the coldplugged device is registered at the base FDT, allowing its
DRC to go straight to CONFIGURED state. In theory, this can also be
done with a hotplugged device if we can add it to the base of the
existing FDT. However, tampering with the FDT after writing in the
guest memory, besides being a dubitable idea, is also not
possible. The FDT is written in ppc_spapr_reset and there is no
way to retrieve it - we can calculate the fdt_address but the
fdt_size isn't stored. Storing the fdt_size to allow for
later retrieval is yet another state that would need to be
migrated. In short, it is not worth the trouble.
All this said, this patch opted to disable CPU/mem hotplug at early
boot stages. CAS detection is made by checking if there are
any bits set in ov5_cas to avoid adding an extra state that
would need tracking/migration. The patch also makes sure that
it doesn't interfere with hotplug in the INMIGRATE state.
[1] https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg05226.html
[2] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg01989.html
Signed-off-by: Daniel Henrique Barboza <address@hidden>
I don't think this is a good idea.
1) After my DRC cleanups, early hotplug works just fine for me. I'm
not sure why it isn't for you: we need to understand that before
proceeding.
2) libvirt actually uses early hotplug fairly often (before even
starting the firmware). At the moment this works - at least in some
cases (see above), though there are some wrinkles to work out. This
will break it completely and require an entirely different approach to
fix again.
Now that you mentioned I remember having this same discussion with you,
about the same topic. Back then we decided to leave it alone, since you
couldn't
reproduce the behavior but I could.
I still can reproduce this bug and ended up investigating a bit more today:
- one difference in QEMU between hotplugging before and after CAS is here:
hw/ppc/spapr_events.c - rtas_event_log_to_source
switch (log_type) {
case RTAS_LOG_TYPE_HOTPLUG:
source = spapr_event_sources_get_source(spapr->event_sources,
EVENT_CLASS_HOT_PLUG);
if (spapr_ovec_test(spapr->ov5_cas, OV5_HP_EVT)) {
g_assert(source->enabled);
break;
}
/* fall back to epow for legacy hotplug interrupt source */
case RTAS_LOG_TYPE_EPOW:
source = spapr_event_sources_get_source(spapr->event_sources,
EVENT_CLASS_EPOW);
break;
default:
Note the ovec_test for OV5_HP_EVT. When hotplugging a CPU In early boot,
ov5_cas
doesn't have anything set, making this check fails and due to the 'break'
position there
(that I believe it is intended), it falls back to log the event as EPOW
instead of HOT_PLUG.
Ah.. I'm not sure if this is the cause of the problem you're seeing,
but it's a good point regardless. We're queuing an event before we
know how to queue events, which is definitely wrong.
Urgh.. how to fix this. Banning hotplug until after CAS seems
logically sound, but will break the libvirt use case. So we could..
1) Queue events internally, but only deliver them to the right actual
RTAS queue after CAS. Blech.
2) At CAS time, reset all the DRCs, and emit extra DT information for
anything hotplugged since init. I've suggested this before, but
dropped it when early hotplug seemed to be working for me.
Update: I've been trying to do the (2) solution here by adding extra
DT info of the hotplugged CPU, while resetting its DRC. The problem I am
facing is that the kernel panics at parse_numa_properties when trying to
find the node id of the added CPU. This is intriguing because the extra DT
info is enough to make the kernel start the added CPU at smp_release_cpus().
The DT information is being added at the CAS response, in
spapr_fixup_cpu_dt,
and it's the same information that is added for the coldplugged CPUs. I have
a
theory that the guest (SLOF?) isn't prepared to handle a new CPU being added
at
this point, which might be leading to this kernel bug I am seeing. I
appreciate if
anyone with good CAS/SLOF knowledge can weight in here - there is a good
chance
that adding a new CPU element at this point in the DT isn't supported at
all.
Now, in other news ....
3) Work around in libvirt by having it issue a system_reset after the
hotplugs.
... talking with Mike about this bug he proposed a similar solution to (3),
but with
a CAS induced reset. If there are hotplugged devs at that point, set
spapr->cas_reboot
to true. Doing that, QEMU reboots automatically and the hotplugged devs are
considered
coldplugged. It works.
I am mentioning this because this is a simpler solution than relying on
libvirt issuing a system_reset. This is also solution that I would like to
avoid though - I
am not looking forward to see people bugs complaining about the machine
rebooting in
the middle of the boot. I will exhaust my attempts at (2) before proposing
going
with this CAS reboot solution.