[PATCH v4 0/5] Improve robustness of migration
From Red Hat internal testing we've had some reports that if attempting to migrate without passt-repair, the failure mode is uglier than we'd like. The migration fails, which is somewhat expected, but we don't correctly roll things back on the source, so it breaks network there as well. Address this and several other fragilities in the migration. Everything tested with the basic test suite, plus some additional testing for the specific functionality of the patches: For patches 1..2: * I get a clean migration if there are now active flows * Migration completes, although connections are broken if passt-repair isn't connected For patches 3..5: * Migration completes ok if the source and destination hosts have different IPs * Target correctly sees the bind() failure and abandons the flow * Unfortunately, target-side client doesn't get a reset, it just sits there not working. This is because a) the RST we try to send is lost because the queue is still inactive over the migration and b) we don't send RSTs or ICMPs for packets from the guest which no longer match a flow (I hope to tackle this soon) * After manually quitting the stuck client on the target, other connectivity works There are more fragile cases that I'm looking to fix, particularly the die()s in flow_migrate_source_rollback() and elsewhere. David Gibson (5): migrate, flow: Trivially succeed if migrating with no flows migrate, flow: Don't attempt to migrate TCP flows without passt-repair tcp: Correct error code handling from tcp_flow_repair_socket() tcp: Unconditionally move to CLOSED state on tcp_rst() migrate, tcp: Don't flow_alloc_cancel() during incoming migration flow.c | 17 +++++++++++++++-- tcp.c | 28 +++++++++++++++++++++------- 2 files changed, 36 insertions(+), 9 deletions(-) -- 2.48.1
We could get a migration request when we have no active flows; or at least
none that we need or are able to migrate. In this case after sending or
receiving the number of flows we continue to step through various lists.
In the target case, this could include communication with passt-repair. If
passt-repair wasn't started that could cause further errors, but of course
they shouldn't matter if we have nothing to repair.
Make it more obvious that there's nothing to do and avoid such errors by
short-circuiting flow_migrate_{source,target}() if there are no migratable
flows.
Signed-off-by: David Gibson
Migrating TCP flows requires passt-repair in order to use TCP_REPAIR. If
passt-repair is not started, our failure mode is pretty ugly though: we'll
attempt the migration, hitting various problems when we can't enter repair
mode. In some cases we may not roll back these changes properly, meaning
we break network connections on the source.
Our general approach is not to completely block migration if there are
problems, but simply to break any flows we can't migrate. So, if we have
no connection from passt-repair carry on with the migration, but don't
attempt to migrate any TCP connections.
Signed-off-by: David Gibson
There are two small bugs in error returns from tcp_low_repair_socket(),
which is supposed to return a negative errno code:
1) On bind() failures, wedirectly pass on the return code from bind(),
which is just 0 or -1, instead of an error code.
2) In the caller, tcp_flow_migrate_target() we call strerror_() directly
on the negative error code, but strerror() requires a positive error
code.
Correct both of these.
Signed-off-by: David Gibson
tcp_rst() attempts to send an RST packet to the guest, and if that succeeds
moves the flow to CLOSED state. However, even if the tcp_send_flag() fails
the flow is still dead: we've usually closed the socket already, and
something has already gone irretrievably wrong. So we should still mark
the flow as CLOSED. That will cause it to be cleaned up, meaning any
future packets from the guest for it won't match a flow, so should generate
new RSTs (they don't at the moment, but that's a separate bug).
Signed-off-by: David Gibson
In tcp_flow_migrate_target(), if we're unable to create and bind the new
socket, we print an error, cancel the flow and carry on. This seems to
make sense based on our policy of generally letting the migration complete
even if some or all flows are lost in the process. But it doesn't quite
work: the flow_alloc_cancel() means that the flows in the target's flow
table are no longer one to one match to the flows which the source is
sending data for. This means that data for later flows will be mismatched
to a different flow. Most likely that will cause some nasty error later,
but even worse it might appear to succeed but lead to data corruption due
to incorrectly restoring one of the flows.
Instead, we should leave the flow in the table until we've read all the
data for it, *then* discard it. Technically removing the
flow_alloc_cancel() would be enough for this: if tcp_flow_repair_socket()
fails it leaves conn->sock == -1, which will cause the restore functions
in tcp_flow_migrate_target_ext() to fail, discarding the flow. To make
what's going on clearer (and with less extraneous error messages), put
several explicit tests for a missing socket later in the migration path to
read the data associated with the flow but explicitly discard it.
Signed-off-by: David Gibson
On Thu, 27 Feb 2025 16:55:12 +1100
David Gibson
From Red Hat internal testing we've had some reports that if attempting to migrate without passt-repair, the failure mode is uglier than we'd like. The migration fails, which is somewhat expected, but we don't correctly roll things back on the source, so it breaks network there as well.
Address this and several other fragilities in the migration.
Everything tested with the basic test suite, plus some additional testing for the specific functionality of the patches:
For patches 1..2: * I get a clean migration if there are now active flows * Migration completes, although connections are broken if passt-repair isn't connected
For patches 3..5: * Migration completes ok if the source and destination hosts have different IPs * Target correctly sees the bind() failure and abandons the flow * Unfortunately, target-side client doesn't get a reset, it just sits there not working. This is because a) the RST we try to send is lost because the queue is still inactive over the migration and b) we don't send RSTs or ICMPs for packets from the guest which no longer match a flow (I hope to tackle this soon) * After manually quitting the stuck client on the target, other connectivity works
Applied. I didn't manage to try out the setup you proposed yet, but I tried things out in my partial flow (with migration succeeding but connections closing right away) and everything looks fine there. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio