|
From: | Maciej S. Szmigiero |
Subject: | Re: [PATCH v3 08/24] migration: Add thread pool of optional load threads |
Date: | Thu, 12 Dec 2024 23:53:42 +0100 |
User-agent: | Mozilla Thunderbird |
On 12.12.2024 17:55, Peter Xu wrote:
On Wed, Dec 11, 2024 at 12:05:03AM +0100, Maciej S. Szmigiero wrote:On 4.12.2024 23:43, Peter Xu wrote:On Thu, Nov 28, 2024 at 01:11:53PM +0100, Maciej S. Szmigiero wrote:+static int qemu_loadvm_load_thread(void *thread_opaque) +{ + struct LoadThreadData *data = thread_opaque; + int ret; + + ret = data->function(&load_threads_abort, data->opaque); + if (ret && !qatomic_read(&load_threads_ret)) { + /* + * Racy with the above read but that's okay - which thread error + * return we report is purely arbitrary anyway. + */ + qatomic_set(&load_threads_ret, ret); + }Can we use cmpxchg instead? E.g.: if (ret) { qatomic_cmpxchg(&load_threads_ret, 0, ret); }cmpxchg always forces sequentially consistent ordering while qatomic_read() and qatomic_set() have relaxed ordering. As the comment above describes, there's no need for sequential consistency since which thread error is returned is arbitrary anyway.IMHO this is not a hot path, so mem ordering isn't an issue. If we could avoid any data race we still should try to. I do feel uneasy on the current design where everybody shares the "whether to quit" via one bool, and any thread can set it... meanwhile we can't stablize the first error to report later. E.g., ideally we want to capture the first error no matter where it came from, then keep it with migrate_set_error() so that "query-migrate" on dest later can tell us what was wrong. I think libvirt generally uses that. So as to support a string error, at least we'll need to allow Error** in the thread fn: typedef bool (*MigrationLoadThread)(void *opaque, bool *should_quit, Error **errp); I also changed retval to bool, as I mentioned elsewhere QEMU tries to stick with "bool SOME_FUNCTION(..., Error **errp)" kind of error reporting. Then any thread should only report error to qemu_loadvm_load_thread(), and the report should always be a local Error**, then it further reports to the global error. Something like: static int qemu_loadvm_load_thread(void *thread_opaque) { MigrationIncomingState *mis = migration_incoming_get_current(); struct LoadThreadData *data = thread_opaque; Error *error = NULL; if (!data->function(data->opaque, &mis->should_quit, &error)) { migrate_set_error(migrate_get_current(), error); } return 0; } migrate_set_error() is thread-safe, and it'll only record the 1st error. Then the thread should only read &should_quit, and only set &error. If we want, migrate_set_error() can set &should_quit. PS: I wished we have an unified place to tell whether we should quit incoming migration - we already have multifd_recv_state->exiting, we could have had a global flag like that then we can already use. But I know I'm asking too much.. However would you think it make sense to still have at least Error** report the error and record it?This could work with the following changes/caveats: * Needs g_autoptr(Error) otherwise these Error objects will leak.True.. or just error_free() it after set.* "1st error" here is as arbitrary as with my current code since which thread first acquires the mutex in migrate_set_error() is unspecified.Yes that's still a step forward on being verbose of errors, which is almost always more helpful than a bool.. Not exactly the 1st error in time sequence is ok - we don't strongly ask for that, e.g. if two threads error at merely the same time it's ok we only record one of them no matter which one is first. That's unusual to start with. OTOH it matters on that we fail other threads only _after_ we set_error() for the first error. If so it's mostly always the case the captured error will be valid and the real 1st error.* We still need to test this new error flag (now as migrate_has_error()) in qemu_loadvm_state() to see whether we proceed forward with the migration.Yes, or just to work like what this patch does: set mis->should_quit within the 1st setup of migrate_set_error(). For the longer term, maybe we need to do more to put together all error setup/detection for migration.. but for now we can at least do that for this series to set should_quit=true there only. It should work like your series, only that the boolean won't be writable to data->function() but read-only there, for the sake of capturing the Error string.
migrate_set_error() wouldn't be called until qemu_loadvm_state() exits into process_incoming_migration_co(). Also this does not account other qemu_loadvm_state() callers like qmp_xen_load_devices_state() or load_snapshot(). While these other callers might not use load threads currently, it feels wrong to wait for these threads in qemu_loadvm_state() but set their termination/abort flag as a side effect of completely different function (migrate_set_error()). Having a dedicated abort flag also makes the semantics easy to infer from code since once can simply grep for this flag name (load_threads_abort) to see where it is being written. Its name is also pretty descriptive making it easy to immediately tell what it does.
------------------------------------------------------------------- Also, I am not in favor of replacing load_threads_abort with something else since we still want to ask threads to quit for other reasons, like earlier (non-load threads related) failure in the migration process. That's why we set this flag unconditionally in qemu_loadvm_state() - see also my answer about that flag in the next message.I'm not against having a boolean to say quit, maybe we should have that for !vfio use case too, and I'm ok we introduce one. But I hope two things can work out: - Capture Error* and persist it in query-migrate (aka, use migrate_set_error).
Will do.
- Avoid setting load_threads_abort explicitly in vmstate load path. It should really be part of destroy(), IMHO, as I mentioned in the other email, to recycle load threads in a failure case.
That's the same thread abort flag issue as in the first block of my reply above.
Thanks,
Thanks, Maciej
[Prev in Thread] | Current Thread | [Next in Thread] |