[PATCH v5 0/6] Draft, incomplete series introducing state migration
This switches to the new facility with stages and introduces per-flow migration, with a simple structure roughly based on struct tcp_tap_conn. I tested this: a simple connection between socat and nc with some text works (again) for me. Sequence numbers (of course), window parameters, and MSS all match. However, it *is* draft quality and *it might contain bugs*. It also contains a lot of TODO items, missing comments, probably some left-overs, and so on. David Gibson (2): migrate: Make more handling common rather than vhost-user specific migrate: Don't handle the migration channel through epoll Stefano Brivio (4): Introduce facilities for guest migration on top of vhost-user infrastructure Add interfaces and configuration bits for passt-repair vhost_user: Make source quit after reporting migration state Implement source and target sides of migration Makefile | 14 +- conf.c | 44 +++++- epoll_type.h | 6 +- flow.c | 226 +++++++++++++++++++++++++++++++ flow.h | 7 + isolation.c | 2 +- migrate.c | 291 ++++++++++++++++++++++++++++++++++++++++ migrate.h | 54 ++++++++ passt.1 | 11 ++ passt.c | 17 ++- passt.h | 17 +++ repair.c | 193 ++++++++++++++++++++++++++ repair.h | 16 +++ tap.c | 65 +-------- tcp.c | 372 +++++++++++++++++++++++++++++++++++++++++++++++++++ tcp_conn.h | 59 ++++++++ util.c | 62 +++++++++ util.h | 27 ++++ vhost_user.c | 69 +++------- virtio.h | 4 - vu_common.c | 49 +------ vu_common.h | 2 +- 22 files changed, 1427 insertions(+), 180 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h create mode 100644 repair.c create mode 100644 repair.h -- 2.43.0
Add migration facilities based on top of the current vhost-user
infrastructure, moving vu_migrate() to migrate.c.
Versioned migration stages define function pointers to be called on
source or target, or data sections that need to be transferred.
The migration header consists of a magic number and a version
identifier.
Co-authored-by: David Gibson
On Wed, Feb 05, 2025 at 01:38:59AM +0100, Stefano Brivio wrote:
Add migration facilities based on top of the current vhost-user infrastructure, moving vu_migrate() to migrate.c.
Versioned migration stages define function pointers to be called on source or target, or data sections that need to be transferred.
The migration header consists of a magic number and a version identifier.
Co-authored-by: David Gibson
Given this, it should also have my S-o-b,
Signed-off-by: David Gibson
Signed-off-by: Stefano Brivio
--- Makefile | 12 +-- migrate.c | 210 ++++++++++++++++++++++++++++++++++++++++++++++++++++ migrate.h | 51 +++++++++++++ passt.c | 2 +- util.h | 26 +++++++ vu_common.c | 58 +++++---------- vu_common.h | 2 +- 7 files changed, 315 insertions(+), 46 deletions(-) create mode 100644 migrate.c create mode 100644 migrate.h diff --git a/Makefile b/Makefile index d3d4b78..be89b07 100644 --- a/Makefile +++ b/Makefile @@ -38,8 +38,8 @@ FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c packet.c passt.c pasta.c pcap.c pif.c tap.c tcp.c \ - tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ + ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c tap.c \ + tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c udp_vu.c util.c \ vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c PASST_REPAIR_SRCS = passt-repair.c @@ -49,10 +49,10 @@ MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h \ - siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h tcp_splice.h \ - tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h vhost_user.h \ - virtio.h vu_common.h + lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ + pcap.h pif.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h tcp_internal.h \ + tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h udp_vu.h util.h \ + vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include
\nint main(){int a=getrandom(0, 0, 0);} diff --git a/migrate.c b/migrate.c new file mode 100644 index 0000000..a7031f9 --- /dev/null +++ b/migrate.c @@ -0,0 +1,210 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +/* PASST - Plug A Simple Socket Transport + * for qemu/UNIX domain socket mode + * + * PASTA - Pack A Subtle Tap Abstraction + * for network namespace/tap device mode + * + * migrate.c - Migration sections, layout, and routines + * + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#include +#include + +#include "util.h" +#include "ip.h" +#include "passt.h" +#include "inany.h" +#include "flow.h" +#include "flow_table.h" + +#include "migrate.h" + +/* Current version of migration data */ +#define MIGRATE_VERSION 1 + +/* Magic identifier for migration data */ +#define MIGRATE_MAGIC 0xB1BB1D1B0BB1D1B0 + +/* Migration header to send from source */ +static struct migrate_header header = { + .magic = htonll_constant(MIGRATE_MAGIC), + .version = htonl_constant(MIGRATE_VERSION), +}; + +/** + * migrate_send_block() - Migration stage handler to send verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Sends the buffer in @stage->iov over the migration channel. + */ +__attribute__((__unused__)) +static int migrate_send_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (write_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +/** + * migrate_recv_block() - Migration stage handler to receive verbatim data + * @c: Execution context + * @stage: Migration stage + * @fd: Migration fd + * + * Reads the buffer in @stage->iov from the migration channel. + * + * #syscalls:vu readv + */ +__attribute__((__unused__)) +static int migrate_recv_block(struct ctx *c, + const struct migrate_stage *stage, int fd) +{ + (void)c; + + if (read_remainder(fd, &stage->iov, 1, 0) < 0) + return errno; + + return 0; +} + +#define DATA_STAGE(v) \ + { \ + .name = #v, \ + .source = migrate_send_block, \ + .target = migrate_recv_block, \ + .iov = { &(v), sizeof(v) }, \ + } + +/* Stages for version 1 */ +static const struct migrate_stage stages_v1[] = { + { + .name = "flow pre", + .target = NULL, + }, + { + .name = "flow post", + .source = NULL, + }, + { 0 }, +}; + +/* Set of data versions */ +static const struct migrate_version versions[] = { + { + 1, stages_v1, + }, + { 0 }, +}; + +/** + * migrate_source() - Migration as source, send state to hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_source(struct ctx *c, int fd) +{ + const struct migrate_version *v = versions + ARRAY_SIZE(versions) - 1; + const struct migrate_stage *s; + int ret; + + ret = write_all_buf(fd, &header, sizeof(header)); + if (ret) { + err("Can't send migration header: %s, abort", strerror_(ret)); + return ret; + } + + for (s = v->s; *s->name; s++) { + if (!s->source) + continue; + + debug("Source side migration: %s", s->name); + + if ((ret = s->source(c, s, fd))) { + err("Source migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} + +/** + * migrate_target_read_header() - Read header in target + * @fd: Descriptor for state transfer + * + * Return: version number on success, 0 on failure with errno set + */ +static uint32_t migrate_target_read_header(int fd) +{ + struct migrate_header h; + + if (read_all_buf(fd, &h, sizeof(h))) + return 0; + + debug("Source magic: 0x%016" PRIx64 ", version: %u", + be64toh(h.magic), ntohl_constant(h.version)); + + if (ntohll_constant(h.magic) != MIGRATE_MAGIC || !ntohl(h.version)) { + errno = EINVAL; + return 0; + } + + return ntohl(h.version); +} + +/** + * migrate_target() - Migration as target, receive state from hypervisor + * @c: Execution context + * @fd: File descriptor for state transfer + * + * Return: 0 on success, positive error code on failure + */ +int migrate_target(struct ctx *c, int fd) +{ + const struct migrate_version *v; + const struct migrate_stage *s; + uint32_t id; + int ret; + + id = migrate_target_read_header(fd); + if (!id) { + ret = errno; + err("Migration header check failed: %s, abort", strerror_(ret)); + return ret; + } + + for (v = versions; v->id && v->id == id; v++); + if (!v->id) { + err("Unsupported version: %u", id); + return -ENOTSUP; + } + + for (s = v->s; *s->name; s++) { + if (!s->target) + continue; + + debug("Target side migration: %s", s->name); + + if ((ret = s->target(c, s, fd))) { + err("Target migration stage %s: %s, abort", s->name, + strerror_(ret)); + return ret; + } + } + + return 0; +} diff --git a/migrate.h b/migrate.h new file mode 100644 index 0000000..3093b6e --- /dev/null +++ b/migrate.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright (c) 2025 Red Hat GmbH + * Author: Stefano Brivio + */ + +#ifndef MIGRATE_H +#define MIGRATE_H + +/** + * struct migrate_header - Migration header from source + * @magic: 0xB1BB1D1B0BB1D1B0, network order + * @version: Highest known, target aborts if too old, network order + */ +struct migrate_header { + uint64_t magic; + uint32_t version; +} __attribute__((packed)); + +/** + * struct migrate_stage - Callbacks and parameters for one stage of migration + * @name: Stage name (for debugging) + * @source: Callback to implement this stage on the source + * @target: Callback to implement this stage on the target + * @iov: Optional data section to transfer + */ +struct migrate_stage { + const char *name; + int (*source)(struct ctx *c, + const struct migrate_stage *stage, int fd); + int (*target)(struct ctx *c, + const struct migrate_stage *stage, int fd); + + /* FIXME: rollback callbacks? */ + + struct iovec iov; +}; + +/** + * struct migrate_version - Stages for a particular protocol version + * @id: Version number, host order + * @s: Ordered array of stages, NULL-terminated + */ +struct migrate_version { + uint32_t id; + const struct migrate_stage *s; +}; + +int migrate_source(struct ctx *c, int fd); +int migrate_target(struct ctx *c, int fd); + +#endif /* MIGRATE_H */ diff --git a/passt.c b/passt.c index b1c8ab6..184d4e5 100644 --- a/passt.c +++ b/passt.c @@ -358,7 +358,7 @@ loop: vu_kick_cb(c.vdev, ref, &now); break; case EPOLL_TYPE_VHOST_MIGRATION: - vu_migrate(c.vdev, eventmask); + vu_migrate(&c, eventmask); break; default: /* Can't happen */ diff --git a/util.h b/util.h index 23b165c..1aed629 100644 --- a/util.h +++ b/util.h @@ -122,12 +122,38 @@ (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) #endif +#ifndef __bswap_constant_32 +#define __bswap_constant_32(x) \ + ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | \ + (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24)) +#endif + +#ifndef __bswap_constant_64 +#define __bswap_constant_64(x) \ + ((((x) & 0xff00000000000000ULL) >> 56) | \ + (((x) & 0x00ff000000000000ULL) >> 40) | \ + (((x) & 0x0000ff0000000000ULL) >> 24) | \ + (((x) & 0x000000ff00000000ULL) >> 8) | \ + (((x) & 0x00000000ff000000ULL) << 8) | \ + (((x) & 0x0000000000ff0000ULL) << 24) | \ + (((x) & 0x000000000000ff00ULL) << 40) | \ + (((x) & 0x00000000000000ffULL) << 56)) +#endif + #if __BYTE_ORDER == __BIG_ENDIAN #define htons_constant(x) (x) #define htonl_constant(x) (x) +#define htonll_constant(x) (x) +#define ntohs_constant(x) (x) +#define ntohl_constant(x) (x) +#define ntohll_constant(x) (x) #else #define htons_constant(x) (__bswap_constant_16(x)) #define htonl_constant(x) (__bswap_constant_32(x)) +#define htonll_constant(x) (__bswap_constant_64(x)) +#define ntohs_constant(x) (__bswap_constant_16(x)) +#define ntohl_constant(x) (__bswap_constant_32(x)) +#define ntohll_constant(x) (__bswap_constant_64(x)) #endif
/** diff --git a/vu_common.c b/vu_common.c index ab04d31..3d41824 100644 --- a/vu_common.c +++ b/vu_common.c @@ -5,6 +5,7 @@ * common_vu.c - vhost-user common UDP and TCP functions */
+#include
#include #include #include @@ -17,6 +18,7 @@ #include "vhost_user.h" #include "pcap.h" #include "vu_common.h" +#include "migrate.h" #define VU_MAX_TX_BUFFER_NB 2
@@ -305,48 +307,28 @@ err: }
/** - * vu_migrate() - Send/receive passt insternal state to/from QEMU - * @vdev: vhost-user device + * vu_migrate() - Send/receive passt internal state to/from QEMU + * @c: Execution context * @events: epoll events */ -void vu_migrate(struct vu_dev *vdev, uint32_t events) +void vu_migrate(struct ctx *c, uint32_t events) { - int ret; + struct vu_dev *vdev = c->vdev; + int rc = EIO;
- /* TODO: collect/set passt internal state - * and use vdev->device_state_fd to send/receive it - */ debug("vu_migrate fd %d events %x", vdev->device_state_fd, events); - if (events & EPOLLOUT) { - debug("Saving backend state"); - - /* send some stuff */ - ret = write(vdev->device_state_fd, "PASST", 6); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - vdev->device_state_result = ret == -1 ? -1 : 0; - /* Closing the file descriptor signals the end of transfer */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } else if (events & EPOLLIN) { - char buf[6]; - - debug("Loading backend state"); - /* read some stuff */ - ret = read(vdev->device_state_fd, buf, sizeof(buf)); - /* value to be returned by VHOST_USER_CHECK_DEVICE_STATE */ - if (ret != sizeof(buf)) { - vdev->device_state_result = -1; - } else { - ret = strncmp(buf, "PASST", sizeof(buf)); - vdev->device_state_result = ret == 0 ? 0 : -1; - } - } else if (events & EPOLLHUP) { - debug("Closing migration channel");
- /* The end of file signals the end of the transfer. */ - epoll_del(vdev->context, vdev->device_state_fd); - close(vdev->device_state_fd); - vdev->device_state_fd = -1; - } + if (events & EPOLLOUT) + rc = migrate_source(c, vdev->device_state_fd); + else if (events & EPOLLIN) + rc = migrate_target(c, vdev->device_state_fd); + + /* EPOLLHUP without EPOLLIN/EPOLLOUT, or EPOLLERR? Migration failed */ + + vdev->device_state_result = rc; + + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, vdev->device_state_fd, NULL); + debug("Closing migration channel"); + close(vdev->device_state_fd); + vdev->device_state_fd = -1; } diff --git a/vu_common.h b/vu_common.h index d56c021..69c4006 100644 --- a/vu_common.h +++ b/vu_common.h @@ -57,5 +57,5 @@ void vu_flush(const struct vu_dev *vdev, struct vu_virtq *vq, void vu_kick_cb(struct vu_dev *vdev, union epoll_ref ref, const struct timespec *now); int vu_send_single(const struct ctx *c, const void *buf, size_t size); -void vu_migrate(struct vu_dev *vdev, uint32_t events); +void vu_migrate(struct ctx *c, uint32_t events); #endif /* VU_COMMON_H */
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
From: David Gibson
From: David Gibson
In vhost-user mode, by default, create a second UNIX domain socket
accepting connections from passt-repair, with the usual listener
socket.
When we need to set or clear TCP_REPAIR on sockets, we'll send them
via SCM_RIGHTS to passt-repair, who sets the socket option values we
ask for.
To that end, introduce batched functions to request TCP_REPAIR
settings on sockets, so that we don't have to send a single message
for each socket, on migration. When needed, repair_flush() will
send the message and check for the reply.
Signed-off-by: Stefano Brivio
On migration, the source process asks passt-helper to set TCP sockets
in repair mode, dumps the information we need to migrate connections,
and closes them.
At this point, we can't pass them back to passt-helper using
SCM_RIGHTS, because they are closed, from that perspective, and
sendmsg() will give us EBADF. But if we don't clear repair mode, the
port they are bound to will not be available for binding in the
target.
Terminate once we're done with the migration and we reported the
state. This is equivalent to clearing repair mode on the sockets we
just closed.
Signed-off-by: Stefano Brivio
On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps, === Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. === Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that. I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table). Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime). I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes.
Signed-off-by: Stefano Brivio
--- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }
+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
To: German whom I feel okay to To: now that
https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who
knows one thing or two about vhost-user based migration.
This is Stefano from your neighbouring mailing list, passt-dev. David
is wondering:
On Wed, 5 Feb 2025 13:09:42 +1100
David Gibson
On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed.
As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,
=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===
Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.
I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).
Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).
I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes.
...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd. We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing. Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong. We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit. Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure. Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.
Signed-off-by: Stefano Brivio
--- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }
+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }
-- Stefano
On 05.02.25 06:47, Stefano Brivio wrote:
To: German whom I feel okay to To: now that https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who knows one thing or two about vhost-user based migration.
This is Stefano from your neighbouring mailing list, passt-dev. David is wondering:
On Wed, 5 Feb 2025 13:09:42 +1100 David Gibson
wrote: On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed. As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,
=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===
Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.
I wouldn’t call it a limbo state, AFAIU the source side just continues to be the VM that it was, just paused. So in case of migration failure on the destination (which is where virtiofsd migration errors are reported), you can (and should, I believe?) continue the source VM. But even in case of success, you could have it continue, and then you’d have cloned the VM (which I think has always been a bit of a problem for networking specifically).
I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).
Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).
I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes. ...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd.
As far as I understand, with passt, there’s ownership that needs to be… passet. (ha ha) Which leads to that “point of no return”. We don’t have such a thing in virtiofsd, because ownership can be and is shared between source and destination instance once the migration state has started to be generated (and we know that the guest is stopped during vhost-user migration). So how it works in virtiofsd is: - Depending on how it’s done, collecting our state can too much take time for VM downtime. So once we see F_LOG_ALL set, we begin collecting it. At this point, the guest is still running, so we need to be able to change the collected state when the guest does something that’d affect it. - Once we receive SET_DEVICE_STATE, we serialize our state.* - Errors are returned through CHECK_DEVICE_STATE. - The source instance remains open and operational. If the (source) guest is unpaused, it can continue to use the filesystem. The destination instance meanwhile receives that state (while the guest is stopped), deserializes and applies it, and is then ready to begin operation whenever requests start coming in from the guest. There is no point of no return for us, because the source instance remains operational throughout; nothing is moved, just shared. Until execution continues on the destination (which I think you can only detect by seeing the virtqueue being started?), anything can still report an error and thus fail migration, requiring continuing on the source. Besides waiting for which instance will eventually continue execution, the only thing that comes to mind that might be helpful is the return-path feature; with it, the source instance will at least know whether the destination was successful in migrating or not. It isn’t reflected in vhost-user (yet), though. * Note that in case of snapshotting (migration to file) with the background-snapshot migration flag set, F_LOG_ALL won’t be set. So we must also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in which case we must then first collect our state before we can serialize it. Hanna
We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing.
Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong.
We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.
Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure.
Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.
Signed-off-by: Stefano Brivio
--- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; }
+static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD);
+ if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK);
@@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events)
if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }
Thanks for the prompt answer!
We'll go through it in more detail in a moment and probably come back
with *a few* more questions, but at a first glance it seems to clarify
things exactly as we needed.
--
Stefano
On Wed, 5 Feb 2025 09:58:11 +0100
Hanna Czenczek
[...]
On Wed, Feb 05, 2025 at 09:58:11AM +0100, Hanna Czenczek wrote:
On 05.02.25 06:47, Stefano Brivio wrote:
To: German whom I feel okay to To: now that https://github.com/kubevirt/kubevirt/pull/13756 is merged, and Hanna who knows one thing or two about vhost-user based migration.
This is Stefano from your neighbouring mailing list, passt-dev. David is wondering:
On Wed, 5 Feb 2025 13:09:42 +1100 David Gibson
wrote: On Wed, Feb 05, 2025 at 01:39:03AM +0100, Stefano Brivio wrote:
On migration, the source process asks passt-helper to set TCP sockets in repair mode, dumps the information we need to migrate connections, and closes them.
At this point, we can't pass them back to passt-helper using SCM_RIGHTS, because they are closed, from that perspective, and sendmsg() will give us EBADF. But if we don't clear repair mode, the port they are bound to will not be available for binding in the target.
Terminate once we're done with the migration and we reported the state. This is equivalent to clearing repair mode on the sockets we just closed. As we've discussed, quitting still makes sense, but the description above is not really accurate. Perhaps,
=== Once we've passed the migration's "point of no return", there's no way to resume the guest on the source side, because we no longer own the connections. There's not really anything we can do except exit. ===
Except.. thinking about it, I'm not sure that's technically true. After migration, the source qemu enters a kind of limbo state. I suppose for the case of to-disk migration (savevm) the guest can actually be resumed. Which for us is not really compatible with completing at least a local migration properly. Not really sure what to do about that.
I wouldn’t call it a limbo state, AFAIU the source side just continues to be the VM that it was, just paused. So in case of migration failure on the destination (which is where virtiofsd migration errors are reported), you can (and should, I believe?) continue the source VM. But even in case of success, you could have it continue, and then you’d have cloned the VM (which I think has always been a bit of a problem for networking specifically).
Right, I kind of realised this later in my thinking.
I think it's also technically possible to use monitor commands to boot up essentially an entirely new guest instance in the original qemu, in which case for us it would make sense to basically reset ourselves (flush the low table).
Hrm.. we really need to know the sequence of events in a bit more detail to get this right (not that this stops improving the guts of the logic in the meantime).
I'm asking around to see if I can find who did the migration stuff or virtiofsd, so we can compare notes. ...what we should be doing in the source passt at different stages of moving our TCP connections (or failing to move them) over to the target, which might be inspired by what you're doing with your... filesystem things in virtiofsd.
As far as I understand, with passt, there’s ownership that needs to be… passet. (ha ha) Which leads to that “point of no return”. We don’t have such a thing in virtiofsd, because ownership can be and is shared between source and destination instance once the migration state has started to be generated (and we know that the guest is stopped during vhost-user migration).
Right, exactly. Because we're a network device with active connections, those connections can't be continued by both the source and destination guests. Therefore, there has to be some point at which ownership is irrevocably transferred. This shows up most obviously for a local to local migration, because on the target we fail to connect() if the source hasn't yet closed its corresponding sockets. I guess with a more normal L2 network device the problem is still there, but appears at a different level: if you resume both guests that'll somewhat work, but one or both will shortly get network errors (TCP RSTs, most likely) as the packets they're trying to send on the same connection conflict.
So how it works in virtiofsd is: - Depending on how it’s done, collecting our state can too much take time for VM downtime. So once we see F_LOG_ALL set, we begin collecting it. At this point, the guest is still running, so we need to be able to change the collected state when the guest does something that’d affect it.
Oh! That's a huge insight, thanks. We also (might) have trouble with having too much state to transfer (large socket buffers, or example). I was thinking there was nothing we could do about it, because we didn't know about the migration until the guest was already stopped - but I'd forgotten that we get the F_LOG_ALL notification. That might make a number of improvements possible.
- Once we receive SET_DEVICE_STATE, we serialize our state.* - Errors are returned through CHECK_DEVICE_STATE. - The source instance remains open and operational. If the (source) guest is unpaused, it can continue to use the filesystem.
The destination instance meanwhile receives that state (while the guest is stopped), deserializes and applies it, and is then ready to begin operation whenever requests start coming in from the guest.
There is no point of no return for us, because the source instance remains operational throughout; nothing is moved, just shared. Until execution continues on the destination (which I think you can only detect by seeing the virtqueue being started?), anything can still report an error and thus fail migration, requiring continuing on the source.
Besides waiting for which instance will eventually continue execution, the only thing that comes to mind that might be helpful is the return-path feature; with it, the source instance will at least know whether the destination was successful in migrating or not. It isn’t reflected in vhost-user (yet), though.
Right, that's kind of what we'd be looking for, but it's kind of worse, because ideally we'd want a two-way handshake. The target says "I've got all the information" to the source, then the source drops the connections and says to the target "I've dropped all the connections", and only then does the target re-establish the connections.
* Note that in case of snapshotting (migration to file) with the background-snapshot migration flag set, F_LOG_ALL won’t be set. So we must also prepare for receiving SET_DEVICE_STATE without F_LOG_ALL set, in which case we must then first collect our state before we can serialize it.
Oh, ouch. For snapshotting, we'd be expecting to resume on the source, even if it's after a long delay. On the target there's really no hope of maintaining active connections. Maybe we should not transfer TCP connnections if F_LOG_ALL isn't set.
Hanna
We're taking for granted that as long as we have a chance to detect failure (e.g. we can't dump sequence numbers from a TCP socket in the source), we should use that to abort the whole thing.
Once we're past that point, we have several options. And actually, that's regardless of failure: because we also have the question of what to do if we see that nothing went wrong.
We can exit in the source, for example (this is what patch implements): wait for VHOST_USER_CHECK_DEVICE_STATE, report that, and quit.
Or we can just clear up all our connections and resume (start from a blank state). Or do that, only if we didn't smell failure.
Would you have some pointers, general guidelines, ideas? I know that the topic is a bit broad, but I'm hopeful that you have a lot of clear answers for us. :) Thanks.
Signed-off-by: Stefano Brivio
--- vhost_user.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/vhost_user.c b/vhost_user.c index b107d0f..70773d6 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -997,6 +997,8 @@ static bool vu_send_rarp_exec(struct vu_dev *vdev, return false; } +static bool quit_on_device_state = false; + /** * vu_set_device_state_fd_exec() - Set the device state migration channel * @vdev: vhost-user device @@ -1024,6 +1026,9 @@ static bool vu_set_device_state_fd_exec(struct vu_dev *vdev, migrate_request(vdev->context, msg->fds[0], direction == VHOST_USER_TRANSFER_STATE_DIRECTION_LOAD); + if (direction == VHOST_USER_TRANSFER_STATE_DIRECTION_SAVE) + quit_on_device_state = true; + /* We don't provide a new fd for the data transfer */ vmsg_set_reply_u64(msg, VHOST_USER_VRING_NOFD_MASK); @@ -1201,4 +1206,10 @@ void vu_control_handler(struct vu_dev *vdev, int fd, uint32_t events) if (reply_requested) vu_send_reply(fd, &msg); + + if (quit_on_device_state && + msg.hdr.request == VHOST_USER_CHECK_DEVICE_STATE) { + info("Migration complete, exiting"); + exit(EXIT_SUCCESS); + } }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
This implements flow preparation on the source, transfer of data with
a format roughly inspired by struct tcp_tap_conn, and flow insertion
on the target, with all the appropriate window options, window
scaling, MSS, etc.
The target side is rather convoluted because we first need to create
sockets and switch them to repair mode, before we can apply options
that are *not* stored in the flow table. However, we don't want to
request repair mode for sockets one by one. So we need to do this in
several steps.
A hack in order to connect() on the "RARP" message should be easy to
enable, I left a couple of comments in that sense.
This is very much draft quality, but I tested the whole flow, and it
works for me. Window parameters and MSS match, too.
Signed-off-by: Stefano Brivio
On Wed, Feb 05, 2025 at 01:39:04AM +0100, Stefano Brivio wrote:
This implements flow preparation on the source, transfer of data with a format roughly inspired by struct tcp_tap_conn, and flow insertion on the target, with all the appropriate window options, window scaling, MSS, etc.
The target side is rather convoluted because we first need to create sockets and switch them to repair mode, before we can apply options that are *not* stored in the flow table. However, we don't want to request repair mode for sockets one by one. So we need to do this in several steps.
A hack in order to connect() on the "RARP" message should be easy to enable, I left a couple of comments in that sense.
This is very much draft quality, but I tested the whole flow, and it works for me. Window parameters and MSS match, too.
Signed-off-by: Stefano Brivio
[snip]
diff --git a/isolation.c b/isolation.c index c944fb3..df58bb8 100644 --- a/isolation.c +++ b/isolation.c @@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c) { struct sock_fprog prog;
- prctl(PR_SET_DUMPABLE, 0); +// prctl(PR_SET_DUMPABLE, 0);
Looks like a stray debugging change made it in here. Fwiw, I keep a branch around with debugging hacks including exactly this. To make it harder to accidentally leak debug hacks into "real" series, I usually rebase between my debugging branch and main. In this case it conflicted with the patch from my debugging branch, of course, which is why I spotted it.
switch (c->mode) { case MODE_PASST:
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Hanna Czenczek
-
Stefano Brivio