It looks like we need it as workaround for this situation, readily reproducible at least with a 6.5 Linux kernel, with default rmem_max and wmem_max values: - an iperf3 client on the host sends about 160 KiB, typically segmented into five frames by passt. We read this data using MSG_PEEK - the iperf3 server on the guest starts receiving - meanwhile, the host kernel advertised a zero-sized window to the receiver, as expected - eventually, the guest acknowledges all the data sent so far, and we drop it from the buffer, courtesy of tcp_sock_consume(), using recv() with MSG_TRUNC - the client, however, doesn't get an updated window value, and even keepalive packets are answered with zero-window segments, until the connection is closed It looks like dropping data from a socket using MSG_TRUNC doesn't cause a recalculation of the window, which would be expected as a result of any receiving operation that invalidates data on a buffer (that is, not with MSG_PEEK). Strangely enough, setting TCP_WINDOW_CLAMP via setsockopt(), even to the previous value we clamped to, forces a recalculation of the window which is advertised to the guest. I couldn't quite confirm this issue by following all the possible code paths in the kernel, yet. If confirmed, this should be fixed in the kernel, but meanwhile this workaround looks robust to me (and it will be needed for backward compatibility anyway). Reported-by: Matej Hrica <mhrica(a)redhat.com> Link: https://bugs.passt.top/show_bug.cgi?id=74 Analysed-by: David Gibson <david(a)gibson.dropbear.id.au> Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com> --- tcp.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index dff5e79..32917c8 100644 --- a/tcp.c +++ b/tcp.c @@ -1780,7 +1780,23 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, wnd <<= conn->ws_from_tap; wnd = MIN(MAX_WINDOW, wnd); - if (conn->flags & WND_CLAMPED) { + /* TODO: With (at least) Linux kernel versions 6.1 to 6.5, if we end up + * with a zero-sized window on a TCP socket, dropping data (once + * acknowledged by the guest) with recv() and MSG_TRUNC doesn't appear + * to be enough to make the kernel advertise a non-zero window to the + * receiver. Forcing a TCP_WINDOW_CLAMP setting, even with the existing + * value, fixes this. + * + * The STALLED flag on a connection is a sufficient indication that we + * might have a zero-sized window on the socket, because it's set if we + * exhausted the tap-side window, or if everything we receive from a + * socket is already in flight to the guest. + * + * So, if STALLED is set, and we received a window value from the tap, + * force a TCP_WINDOW_CLAMP setsockopt(). This should be investigated + * further and fixed in the kernel instead, if confirmed. + */ + if (!(conn->flags & STALLED) && conn->flags & WND_CLAMPED) { if (prev_scaled == wnd) return; @@ -2409,12 +2425,12 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, i = keep - 1; } - tcp_clamp_window(c, conn, max_ack_seq_wnd); - /* On socket flush failure, pretend there was no ACK, try again later */ if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq); + tcp_clamp_window(c, conn, max_ack_seq_wnd); + if (retr) { trace("TCP: fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); @@ -2572,8 +2588,6 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (th->ack && !(conn->events & ESTABLISHED)) tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - conn_flag(c, conn, ~STALLED); - /* Establishing connection from socket */ if (conn->events & SOCK_ACCEPTED) { if (th->syn && th->ack && !th->fin) { @@ -2628,6 +2642,11 @@ int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, if (count == -1) goto reset; + /* Note: STALLED matters for tcp_clamp_window(): unset it only after + * processing data (and window) from the tap side + */ + conn_flag(c, conn, ~STALLED); + if (conn->seq_ack_to_tap != conn->seq_from_tap) ack_due = 1; -- 2.39.2