Some concerns came to my mind during the weekend and now I tried
a bit quickly to fix them. A couple of functions became horrible
as a result.
More tests: IPv6 too, iperf3 inbound and outbound.
From another read of:
https://github.com/checkpoint-restore/criu/blob/criu-dev/soccr/soccr.c
I noticed that the way I was dumping and restoring queues
was almost entirely bogus. It should be fixed now. Handling of
FIN might be needed, though: we might have an off-by-one when we
restore queues, I …
[View More]guess.
Riddle of this version: enabling the flow_migrate_source_early()
callback (shrinking the window early on) breaks things if I run
the source without strace: the source (I think? sends a reset
when we close the sockets after we switch repair mode on the
second time (we flip that twice).
So it's commented out for the moment. By itself, it works, and
effectively limits what the peer sends during migration.
There must be some other race or issue in passt-repair or in the
matching interface, but I couldn't figure it out.
David Gibson (1):
migrate: Migrate guest observed addresses
Stefano Brivio (5):
migrate: Skeleton of live migration logic
Add interfaces and configuration bits for passt-repair
vhost_user: Make source quit after reporting migration state
migrate: Migrate TCP flows
test: Add migration tests
Makefile | 14 +-
conf.c | 44 ++-
epoll_type.h | 6 +-
flow.c | 248 ++++++++++++
flow.h | 8 +
migrate.c | 309 +++++++++++++++
migrate.h | 54 +++
passt.1 | 11 +
passt.c | 21 +-
passt.h | 15 +
repair.c | 211 ++++++++++
repair.h | 16 +
tap.c | 65 +--
tcp.c | 789 +++++++++++++++++++++++++++++++++++++
tcp_conn.h | 95 +++++
test/lib/layout | 55 ++-
test/lib/setup | 134 +++++++
test/lib/test | 48 +++
test/migrate/basic | 59 +++
test/migrate/bidirectional | 64 +++
test/migrate/iperf3_in4 | 50 +++
test/migrate/iperf3_in6 | 58 +++
test/migrate/iperf3_out4 | 50 +++
test/migrate/iperf3_out6 | 58 +++
test/run | 10 +
util.c | 62 +++
util.h | 30 ++
vhost_user.c | 68 +---
virtio.h | 4 -
vu_common.c | 49 +--
vu_common.h | 2 +-
31 files changed, 2524 insertions(+), 183 deletions(-)
create mode 100644 migrate.c
create mode 100644 migrate.h
create mode 100644 repair.c
create mode 100644 repair.h
create mode 100644 test/migrate/basic
create mode 100644 test/migrate/bidirectional
create mode 100644 test/migrate/iperf3_in4
create mode 100644 test/migrate/iperf3_in6
create mode 100644 test/migrate/iperf3_out4
create mode 100644 test/migrate/iperf3_out6
--
2.43.0
[View Less]
This should be a relatively common case and I'm a bit surprised it's
been broken since I added the "gateway mapping" functionality, but it
doesn't happen with Podman, and not with systemd-resolved or similar
local proxies, and also not with servers where typically the gateway
is just a router and not a DNS resolver. That could be the reason why
nobody noticed until now.
By default, we'll map the address of the default gateway, in
containers and guests, to represent "the host", so that we have …
[View More]a
well-defined way to reach the host. Say:
0.0029: NAT to host 127.0.0.1: 192.168.100.1
But if the host gateway is also a DNS resolver:
0.0029: DNS:
0.0029: 192.168.100.1
then we'll send DNS queries directed to it to the host instead:
0.0372: Flow 0 (INI): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => ?
0.0372: Flow 0 (TGT): INI -> TGT
0.0373: Flow 0 (TGT): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
0.0373: Flow 0 (UDP flow): TGT -> TYPED
0.0373: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
0.0373: Flow 0 (UDP flow): Side 0 hash table insert: bucket: 31049
0.0374: Flow 0 (UDP flow): TYPED -> ACTIVE
0.0374: Flow 0 (UDP flow): TAP [192.168.100.157]:41892 -> [192.168.100.1]:53 => HOST [0.0.0.0]:41892 -> [127.0.0.1]:53
which doesn't quite work, of course:
0.0374: pasta: epoll event on UDP reply socket 95 (events: 0x00000008)
0.0374: ICMP error on UDP socket 95: Connection refused
unless the host is a resolver itself... but then we wouldn't find the
address of the gateway in its /etc/resolv.conf, presumably.
Fix this by making an exception for DNS traffic: if the default
gateway is a resolver, match on DNS traffic going to the default
gateway, and explicitly forward it to the configured resolver.
Reported-by: Prafulla Giri <prafulla.giri(a)protonmail.com>
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
conf.c | 16 ++++++++++------
passt.1 | 14 ++++++++++----
2 files changed, 20 insertions(+), 10 deletions(-)
diff --git a/conf.c b/conf.c
index df2b016..360ce21 100644
--- a/conf.c
+++ b/conf.c
@@ -426,10 +426,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.dns_host))
c->ip4.dns_host = ns4;
- /* Guest or container can only access local addresses via
- * redirect
+ /* Special handling if guest or container can only access local
+ * addresses via redirect, or if the host gateway is also a
+ * resolver and we shadow its address
*/
- if (IN4_IS_ADDR_LOOPBACK(&ns4)) {
+ if (IN4_IS_ADDR_LOOPBACK(&ns4) ||
+ IN4_ARE_ADDR_EQUAL(&ns4, &c->ip4.map_host_loopback)) {
if (IN4_IS_ADDR_UNSPECIFIED(&c->ip4.map_host_loopback))
return;
@@ -445,10 +447,12 @@ static void add_dns_resolv(struct ctx *c, const char *nameserver,
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.dns_host))
c->ip6.dns_host = ns6;
- /* Guest or container can only access local addresses via
- * redirect
+ /* Special handling if guest or container can only access local
+ * addresses via redirect, or if the host gateway is also a
+ * resolver and we shadow its address
*/
- if (IN6_IS_ADDR_LOOPBACK(&ns6)) {
+ if (IN6_IS_ADDR_LOOPBACK(&ns6) ||
+ IN6_ARE_ADDR_EQUAL(&ns6, &c->ip6.map_host_loopback)) {
if (IN6_IS_ADDR_UNSPECIFIED(&c->ip6.map_host_loopback))
return;
diff --git a/passt.1 b/passt.1
index d9cd33e..7b6eabd 100644
--- a/passt.1
+++ b/passt.1
@@ -941,10 +941,16 @@ with destination 127.0.0.10, and the default IPv4 gateway is 192.0.2.1, while
the last observed source address from guest or namespace is 192.0.2.2, this will
be translated to a connection from 192.0.2.1 to 192.0.2.2.
-Similarly, for traffic coming from guest or namespace, packets with
-destination address corresponding to the \fB\-\-map-host-loopback\fR
-address will have their destination address translated to a loopback
-address.
+Similarly, for traffic coming from guest or namespace, packets with destination
+address corresponding to the \fB\-\-map-host-loopback\fR address will have their
+destination address translated to a loopback address.
+
+As an exception, traffic identified as DNS, originally directed to the
+\fB\-\-map-host-loopback\fR address, if this address matches a resolver address
+on the host, is \fBnot\fR translated to loopback, but rather handled in the same
+way as if specified as \-\-dns-forward address, if no such option was given.
+In the common case where the host gateway also acts a resolver, this avoids that
+the host mapping shadows the gateway/resolver itself.
.SS Handling of local traffic in pasta
--
2.43.0
[View Less]
v12:
- clean up, add comments, complete error handling
- add iperf3 test with 6 concurrent flows and migration under flood
This looks reasonably stable and polished to me, probably
enough to be merged. The behaviour now looks solid under flood,
too.
Still to do, I guess:
1. more iperf3 tests, with IPv6, in the other direction, and
with mixed flows
2. support for other types of flow (assuming that we care at
all... things already work)
3. find a way to close the window socket-…
[View More]side early on, and
reopen it in the target, to entirely avoid retransmissions.
I can set the socket in repair mode in the source, fetch
TCP_REPAIR_WINDOW parameters, just change rcv_wnd to 0, set
them back, then disable repair mode. Nothing bad happens and
the window probe from TCP_REPAIR_OFF is visible, but the
window isn't updated, because it's not actually recalculated
using rcv_wnd meanwhile. Dummy send/recv() calls don't really
change things, either. But there must be some other way.
I haven't tried doing this guest-side but that part should
be trivial.
David Gibson (1):
migrate: Migrate guest observed addresses
Stefano Brivio (5):
migrate: Skeleton of live migration logic
Add interfaces and configuration bits for passt-repair
vhost_user: Make source quit after reporting migration state
migrate: Migrate TCP flows
test: Add migration tests
Makefile | 14 +-
conf.c | 44 ++-
epoll_type.h | 6 +-
flow.c | 202 +++++++++++++
flow.h | 6 +
migrate.c | 300 +++++++++++++++++++
migrate.h | 54 ++++
passt.1 | 11 +
passt.c | 15 +-
passt.h | 15 +
repair.c | 211 ++++++++++++++
repair.h | 16 +
tap.c | 65 +----
tcp.c | 583 +++++++++++++++++++++++++++++++++++++
tcp_conn.h | 87 ++++++
test/lib/layout | 55 +++-
test/lib/setup | 128 ++++++++
test/lib/test | 48 +++
test/migrate/basic | 54 ++++
test/migrate/bidirectional | 59 ++++
test/migrate/iperf3_out4 | 42 +++
test/run | 10 +
util.c | 62 ++++
util.h | 30 ++
vhost_user.c | 68 ++---
virtio.h | 4 -
vu_common.c | 49 +---
vu_common.h | 2 +-
28 files changed, 2060 insertions(+), 180 deletions(-)
create mode 100644 migrate.c
create mode 100644 migrate.h
create mode 100644 repair.c
create mode 100644 repair.h
create mode 100644 test/migrate/basic
create mode 100644 test/migrate/bidirectional
create mode 100644 test/migrate/iperf3_out4
--
2.43.0
[View Less]
I've spent today trying to debug this failure. I've gathered a bunch
of information, but no breakthroughs, alas. At this point I suspect a
kernel bug, though I hope I'm wrong.
# Background.
I think these are as you described it on your system:
* Most (but not every) time I run migrate/bidirectional it fails,
with the "outbound" stream only getting the before migration piece
* I can't reproduce if I put strace on the guest 2 passt.
Possibly unlike you:
* I'm able to use TRACE=1, and …
[View More]the problem still reproduces
* I can put strace on the outer pasta and the problem still
reproduces
The specific anomolies I was focused on were:
* The passt_2 pcap shows "and from guest 2" coming _inbound_ a bit
after it (correctly) went outbound
* The pasta_1 pcap doesn't seem to show "and from guest 2" in either
direction
# Observations
* I added a hack (see other series) that let me log comments to the
pcap file as ethertype 0xffff, this was so I could have debugging
messages in order with the the captured packets.
* I used that to bin down exactly where the bogus output "and from
guest 2" was being recorded, and it's in tcp_vu_data_from_sock()
* I traced back from there, and passt_2 really does seem to be
getting "and from guest 2" from a recvmsg() on the socket. I see
from my pcap comments that we're getting 17 bytes from recvmsg()
right before capturing the inbound packet, at any rate.
* As noted, I couldn't reproduce with an strace on passt_2, so I
couldn't confirm that piece that way
It kind of seemed like we were sendmsg()ing "and from guest 2" and it
was bouncing straight back to our socket, instead of being delivered
to the outer pasta.
* I tried putting a dumpcap on 'lo' in the pasta namespace, thinking
I might see this weird passt->passt packet. But, nothing. There
are thousands of packets of the qemu migration stream, and
absolutely nothing else.
* I also tried dumpcap on the external interface in the pasta
namspace, and I didn't see anything different from what pasta
captured (although I didn't check super carefully). In particular
I didn't seem to see "and from guest 2" in either direction there
either
* Since I couldn't strace() passt_2, I instead tried logging
TCP sendmsg() and recvmsg() calls of length 17 using systemtap
(script attached). At this point it gets even weirder:
On a working run (achieved by adding the strace), I get this:
BEGIN
tcp sendmsg(-129530279294592) len=17 - ./passt -s /tmp/passt-tests-niICXS/migrate/passt_2.socket -P /tmp/passt-tests-niICXS/migrate/passt_2.pid -f --vhost-user -p /home/dwg/src/passt/test/test_logs/passt_2.pcap --trace -t 10004 -u 10004
tcp sendmsg(-129489810388608) len=17 - ./pasta -p /home/dwg/src/passt/test/test_logs/pasta_1.pcap --trace --trace -l /tmp/pasta1.log -P /tmp/passt-tests-niICXS/migrate/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr 169.254.1.1 --config-net /home/dwg/src/passt/test/nstool hold /tmp/passt-tests-niICXS/migrate/ns1.hold
END
This mostly makes sense. passt_2 sends the expected outbound packet
to the namespace, then pasta_1 forwards it on to the host. I don't
know why I'm not seeing the recvmsg() from the socat server, though.
In the failing case, though, I get this:
BEGIN
tcp sendmsg(-129471392995840) len=17 - ./passt -s /tmp/passt-tests-CV71zo/migrate/passt_2.socket -P /tmp/passt-tests-CV71zo/migrate/passt_2.pid -f --vhost-user -p /home/dwg/src/passt/test/test_logs/passt_2.pcap --trace -t 10004 -u 10004
tcp recvmsg(-129476447043584) len=17 - ./pasta -p /home/dwg/src/passt/test/test_logs/pasta_1.pcap --trace --trace -l /tmp/pasta1.log -P /tmp/passt-tests-CV71zo/migrate/pasta_1.pid -t 10001,10002,10004 -T 10003 -u 10001,10002,10004 -U 10003 --map-guest-addr 169.254.1.1 --config-net /home/dwg/src/passt/test/nstool hold /tmp/passt-tests-CV71zo/migrate/ns1.hold
END
First event seems the same: passt_2 sending the outbound packet, as
expected. The second, though, is weird: the outer pasta seems to
receive the data from a socket, not from tap as we'd expect. That
might explain the other symptoms, if pasta received it on its socket,
it would send inwards.
But... I don't see pasta sending that "and from guest 2" inbound in
its packet capture. And, weirder still, although I see that recvmsg()
with systemtap, I don't see it in an strace of pasta.
...and.. that's where I'm at. Attaching my systemtap script and a
ball of logs. Hoping they're helpful :/.
--
David Gibson (he or they) | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you, not the other way
| around.
http://www.ozlabs.org/~dgibson
[View Less]
It looks like me, myself and I couldn't agree on the "simple" protocol
between passt and passt-repair. The man page and passt say it's one
confirmation per command, but the passt-repair implementation had one
confirmation per socket instead.
This caused all sort of mysterious issues with repair mode
pseudo-randomly enabled, and leading to hours of fun (mostly not
mine). Oops.
Switch to one confirmation per command (of course).
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
…
[View More]passt-repair.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/passt-repair.c b/passt-repair.c
index 322066a..614cee0 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -63,6 +63,7 @@ int main(int argc, char **argv)
struct cmsghdr *cmsg;
struct msghdr msg;
struct iovec iov;
+ int op;
prctl(PR_SET_DUMPABLE, 0);
@@ -150,25 +151,24 @@ loop:
_exit(1);
}
- for (i = 0; i < n; i++) {
- int o = cmd;
+ op = cmd;
- if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &o, sizeof(o))) {
+ for (i = 0; i < n; i++) {
+ if (setsockopt(fds[i], SOL_TCP, TCP_REPAIR, &op, sizeof(op))) {
fprintf(stderr,
- "Setting TCP_REPAIR to %i on socket %i: %s", o,
+ "Setting TCP_REPAIR to %i on socket %i: %s", op,
fds[i], strerror(errno));
_exit(1);
}
/* Close _our_ copy */
close(fds[i]);
+ }
- /* Confirm setting by echoing the command back */
- if (send(s, &cmd, sizeof(cmd), 0) < 0) {
- fprintf(stderr, "Reply to command %i: %s\n",
- o, strerror(errno));
- _exit(1);
- }
+ /* Confirm setting by echoing the command back */
+ if (send(s, &cmd, sizeof(cmd), 0) < 0) {
+ fprintf(stderr, "Reply to %i: %s\n", op, strerror(errno));
+ _exit(1);
}
goto loop;
--
2.43.0
[View Less]
If we use glibc's perror(), we need to allow dup() and fcntl() in our
seccomp profiles, which are a bit too much for this simple helper. On
top of that, we would probably need a wrapper to avoid allocation for
translated messages.
While at it: ECONNRESET is just a close() from passt, treat it like
EOF.
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
passt-repair.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/passt-repair.c b/passt-repair.c
…
[View More]index 3c3247b..d137a18 100644
--- a/passt-repair.c
+++ b/passt-repair.c
@@ -95,7 +95,7 @@ int main(int argc, char **argv)
}
if ((s = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
- perror("Failed to create AF_UNIX socket");
+ fprintf(stderr, "Failed to create AF_UNIX socket: %i\n", errno);
_exit(1);
}
@@ -108,8 +108,12 @@ int main(int argc, char **argv)
loop:
ret = recvmsg(s, &msg, 0);
if (ret < 0) {
- perror("Failed to receive message");
- _exit(1);
+ if (errno == ECONNRESET) {
+ ret = 0;
+ } else {
+ fprintf(stderr, "Failed to read message: %i\n", errno);
+ _exit(1);
+ }
}
if (!ret) /* Done */
--
2.43.0
[View Less]
Here are a couple of hacky patches I was using for debugging the
failures in the migrate/bidirectional test. Further details on where
I'm at coming in another email.
David Gibson (2):
pcap comment hacks
debug
pcap.c | 20 ++++++++++++++++++++
pcap.h | 2 ++
tap.c | 1 +
tcp_vu.c | 6 ++++++
test/lib/setup | 8 +++++---
test/migrate/bidirectional | 15 ++++++++++++++-
vu_common.c …
[View More]| 1 +
7 files changed, 49 insertions(+), 4 deletions(-)
--
2.48.1
[View Less]