Gleb Natapov wrote:
With unreliable socket it doesn't matter what write() returns data may
or may not reach the destination regardless, with reliable sockets
write() succeeds only after data was acked by the receiver, but it still
doesn't mean that data will be read from destination socket.
You are correct and we handle both of these cases appropriately. In the
event that we think we completed a migration successfully and we really
didn't because of a lost network connection, the result is both the
source and destination are stopped. A third party can resume the source
and continue along happily.
The case being debated is whether write() can ever actually complete and
yet still return an error. In this case, since we automatically resume
the source on error, the result would be two copies of the VM running.
I haven't seen any evidence that this case could actually happen other
than theoretic speculation. I just at the migration code and it's not a
simple change to try and be conservative wrt this case because of the
way we do buffering.