[PATCH v3 0/7] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels

newer
[PATCH v2] build: Fix errors of...

Stefano Brivio

29 Aug 2025 29 Aug '25

10:11 p.m.

Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes. Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case. While playing with a reproducer that seems to be matching those failures: while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done and a kernel including that commit, I hit a few different failures, that should be fixed by this series. Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command: while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done where write.sh is simply: #!/bin/sh cat > /tmp/large.rcv so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/7. v3: - add patch 6/7 - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1 v2: in 3/6, rewind sequence also if the zero-window update comes in the middle of a batch with non-zero window updates Stefano Brivio (7): tcp: FIN flags have to be retransmitted as well tcp: Factor sequence rewind for retransmissions into a new function tcp: Rewind sequence when guest shrinks window to zero tcp: Fix closing logic for half-closed connections tcp: Don't try to transmit right after the peer shrank the window to zero tcp: Cast operands of sequence comparison macros to uint32_t before using them tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent tcp.c | 181 ++++++++++++++++++++++++++++++++++--------------- tcp_internal.h | 12 ++-- 2 files changed, 136 insertions(+), 57 deletions(-) -- 2.43.0

Show replies by date

Stefano Brivio

29 Aug 29 Aug

10:11 p.m.

New subject: [PATCH v3 1/7] tcp: FIN flags have to be retransmitted as well

If we're retransmitting any data, and we sent a FIN segment to our peer, regardless of whether it was received, we obviously have to retransmit it as well, given that it can only come with the last data segment, or after it. Unconditionally clear the internal TAP_FIN_SENT flag whenever we re-transmit, so that we know we have to send it again, in case. Reported-by: Paul Holzinger Signed-off-by: Stefano Brivio Reviewed-by: David Gibson --- tcp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/tcp.c b/tcp.c index 957b498..7c1f237 100644 --- a/tcp.c +++ b/tcp.c @@ -1759,6 +1759,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, "fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; + conn->events &= ~TAP_FIN_SENT; if (tcp_set_peek_offset(conn, 0)) { tcp_rst(c, conn); return -1; @@ -2286,6 +2287,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) flow_dbg(conn, "ACK timeout, retry"); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; + conn->events &= ~TAP_FIN_SENT; if (!conn->wnd_from_tap) conn->wnd_from_tap = 1; /* Zero-window probe */ if (tcp_set_peek_offset(conn, 0)) { -- 2.43.0

Stefano Brivio

10:11 p.m.

New subject: [PATCH v3 2/7] tcp: Factor sequence rewind for retransmissions into a new function

...as I'm going to need a third occurrence of this in the next change. This introduces a small functional change in tcp_data_from_tap(): the sequence was previously rewound to the highest ACK number we found in the current packet batch, and not to the current value of seq_ack_from_tap. The two might differ in case tcp_sock_consume() failed, because in that case we're ignoring that ACK altogether. But if we're ignoring it, it looks more correct to me to start retransmitting from an earlier sequence anyway. Signed-off-by: Stefano Brivio Reviewed-by: David Gibson --- tcp.c | 47 ++++++++++++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 7c1f237..1402ca2 100644 --- a/tcp.c +++ b/tcp.c @@ -1097,6 +1097,26 @@ static void tcp_update_seqack_from_tap(const struct ctx *c, } } +/** + * tcp_rewind_seq() - Rewind sequence to tap and socket offset to current ACK + * @c: Execution context + * @conn: Connection pointer + * + * Return: 0 on success, -1 on failure, with connection reset + */ +static int tcp_rewind_seq(const struct ctx *c, struct tcp_tap_conn *conn) +{ + conn->seq_to_tap = conn->seq_ack_from_tap; + conn->events &= ~TAP_FIN_SENT; + + if (tcp_set_peek_offset(conn, 0)) { + tcp_rst(c, conn); + return -1; + } + + return 0; +} + /** * tcp_prepare_flags() - Prepare header for flags-only segment (no payload) * @c: Execution context @@ -1757,13 +1777,11 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (retr) { flow_trace(conn, "fast re-transmit, ACK: %u, previous sequence: %u", - max_ack_seq, conn->seq_to_tap); - conn->seq_to_tap = max_ack_seq; - conn->events &= ~TAP_FIN_SENT; - if (tcp_set_peek_offset(conn, 0)) { - tcp_rst(c, conn); + conn->seq_ack_from_tap, conn->seq_to_tap); + + if (tcp_rewind_seq(c, conn)) return -1; - } + tcp_data_from_sock(c, conn); } @@ -2285,17 +2303,16 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_rst(c, conn); } else { flow_dbg(conn, "ACK timeout, retry"); - conn->retrans++; - conn->seq_to_tap = conn->seq_ack_from_tap; - conn->events &= ~TAP_FIN_SENT; + if (!conn->wnd_from_tap) conn->wnd_from_tap = 1; /* Zero-window probe */ - if (tcp_set_peek_offset(conn, 0)) { - tcp_rst(c, conn); - } else { - tcp_data_from_sock(c, conn); - tcp_timer_ctl(c, conn); - } + + conn->retrans++; + if (tcp_rewind_seq(c, conn)) + return; + + tcp_data_from_sock(c, conn); + tcp_timer_ctl(c, conn); } } else { struct itimerspec new = { { 0 }, { ACT_TIMEOUT, 0 } }; -- 2.43.0

Stefano Brivio

10:11 p.m.

New subject: [PATCH v3 3/7] tcp: Rewind sequence when guest shrinks window to zero

A window shrunk to zero means by definition that anything else that might be in flight is now out of window. Restart from the currently acknowledged sequence. We need to do that both in tcp_tap_window_update(), where we already check for zero-window updates, as well as in tcp_data_from_tap(), because we might get one of those updates in a batch of packets that also contains a non-zero window update. Suggested-by: Jon Maloy Signed-off-by: Stefano Brivio --- tcp.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/tcp.c b/tcp.c index 1402ca2..11c9c84 100644 --- a/tcp.c +++ b/tcp.c @@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, /** * tcp_tap_window_update() - Process an updated window from tap side + * @c: Execution context * @conn: Connection pointer * @wnd: Window value, host order, unscaled */ -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) +static void tcp_tap_window_update(const struct ctx *c, + struct tcp_tap_conn *conn, unsigned wnd) { wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); /* Work-around for bug introduced in peer kernel code, commit - * e2142825c120 ("net: tcp: send zero-window ACK when no memory"). - * We don't update if window shrank to zero. + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't + * update the window if it shrank to zero, so that we'll eventually + * retry to send data, but rewind the sequence as that obviously implies + * that no data beyond the updated window will ever be acknowledged. */ - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) { + tcp_rewind_seq(c, conn); return; + } conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); @@ -1694,7 +1700,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); if (p->count == 1) { - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, + ntohs(th->window)); return 1; } @@ -1713,6 +1720,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, ack_seq == max_ack_seq && ntohs(th->window) == max_ack_seq_wnd; + /* See tcp_tap_window_update() for details. On + * top of that, we also need to check here if a + * zero-window update is contained in a batch of + * packets that includes a non-zero window as + * well. + */ + if (!ntohs(th->window)) + tcp_rewind_seq(c, conn); + max_ack_seq_wnd = ntohs(th->window); max_ack_seq = ack_seq; } @@ -1772,7 +1788,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq); - tcp_tap_window_update(conn, max_ack_seq_wnd); + tcp_tap_window_update(c, conn, max_ack_seq_wnd); if (retr) { flow_trace(conn, @@ -1861,7 +1877,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c, const struct tcphdr *th, const char *opts, size_t optlen) { - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_get_tap_ws(conn, opts, optlen); /* First value is not scaled */ @@ -2059,7 +2075,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!th->ack) goto reset; - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_data_from_sock(c, conn); @@ -2071,7 +2087,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (conn->events & TAP_FIN_RCVD) { tcp_sock_consume(conn, ntohl(th->ack_seq)); tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_data_from_sock(c, conn); if (conn->events & SOCK_FIN_RCVD && -- 2.43.0

David Gibson

1 Sep 1 Sep

6:28 a.m.

New subject: [PATCH v3 3/7] tcp: Rewind sequence when guest shrinks window to zero

On Fri, Aug 29, 2025 at 10:11:28PM +0200, Stefano Brivio wrote:

...

A window shrunk to zero means by definition that anything else that might be in flight is now out of window. Restart from the currently acknowledged sequence.

We need to do that both in tcp_tap_window_update(), where we already check for zero-window updates, as well as in tcp_data_from_tap(), because we might get one of those updates in a batch of packets that also contains a non-zero window update.

Suggested-by: Jon Maloy Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tcp.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/tcp.c b/tcp.c index 1402ca2..11c9c84 100644 --- a/tcp.c +++ b/tcp.c @@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,

/** * tcp_tap_window_update() - Process an updated window from tap side + * @c: Execution context * @conn: Connection pointer * @wnd: Window value, host order, unscaled */ -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) +static void tcp_tap_window_update(const struct ctx *c, + struct tcp_tap_conn *conn, unsigned wnd) { wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);

/* Work-around for bug introduced in peer kernel code, commit - * e2142825c120 ("net: tcp: send zero-window ACK when no memory"). - * We don't update if window shrank to zero. + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't + * update the window if it shrank to zero, so that we'll eventually + * retry to send data, but rewind the sequence as that obviously implies + * that no data beyond the updated window will ever be acknowledged.

Nit: Arguably "no data...will ever" is not quite right. It presumbly won't be acknowledged until we resend it at least once, but we certainly hope it will be acknowledged after that point.

...

*/ - if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) + if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) { + tcp_rewind_seq(c, conn); return; + }

conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);

@@ -1694,7 +1700,8 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn);

if (p->count == 1) { - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, + ntohs(th->window)); return 1; }

@@ -1713,6 +1720,15 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, ack_seq == max_ack_seq && ntohs(th->window) == max_ack_seq_wnd;

+ /* See tcp_tap_window_update() for details. On + * top of that, we also need to check here if a + * zero-window update is contained in a batch of + * packets that includes a non-zero window as + * well. + */ + if (!ntohs(th->window)) + tcp_rewind_seq(c, conn); + max_ack_seq_wnd = ntohs(th->window); max_ack_seq = ack_seq; } @@ -1772,7 +1788,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (ack && !tcp_sock_consume(conn, max_ack_seq)) tcp_update_seqack_from_tap(c, conn, max_ack_seq);

- tcp_tap_window_update(conn, max_ack_seq_wnd); + tcp_tap_window_update(c, conn, max_ack_seq_wnd);

if (retr) { flow_trace(conn, @@ -1861,7 +1877,7 @@ static void tcp_conn_from_sock_finish(const struct ctx *c, const struct tcphdr *th, const char *opts, size_t optlen) { - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_get_tap_ws(conn, opts, optlen);

/* First value is not scaled */ @@ -2059,7 +2075,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!th->ack) goto reset;

- tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window));

tcp_data_from_sock(c, conn);

@@ -2071,7 +2087,7 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (conn->events & TAP_FIN_RCVD) { tcp_sock_consume(conn, ntohl(th->ack_seq)); tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - tcp_tap_window_update(conn, ntohs(th->window)); + tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_data_from_sock(c, conn);

if (conn->events & SOCK_FIN_RCVD && -- 2.43.0

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

10:07 p.m.

New subject: [PATCH v3 3/7] tcp: Rewind sequence when guest shrinks window to zero

On Mon, 1 Sep 2025 14:28:33 +1000 David Gibson wrote:

...

On Fri, Aug 29, 2025 at 10:11:28PM +0200, Stefano Brivio wrote:

...
A window shrunk to zero means by definition that anything else that might be in flight is now out of window. Restart from the currently acknowledged sequence.

We need to do that both in tcp_tap_window_update(), where we already check for zero-window updates, as well as in tcp_data_from_tap(), because we might get one of those updates in a batch of packets that also contains a non-zero window update.

Suggested-by: Jon Maloy Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...
--- tcp.c | 34 +++++++++++++++++++++++++--------- 1 file changed, 25 insertions(+), 9 deletions(-)

diff --git a/tcp.c b/tcp.c index 1402ca2..11c9c84 100644 --- a/tcp.c +++ b/tcp.c @@ -1257,19 +1257,25 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn,

/** * tcp_tap_window_update() - Process an updated window from tap side + * @c: Execution context * @conn: Connection pointer * @wnd: Window value, host order, unscaled */ -static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) +static void tcp_tap_window_update(const struct ctx *c, + struct tcp_tap_conn *conn, unsigned wnd) { wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap);

/* Work-around for bug introduced in peer kernel code, commit - * e2142825c120 ("net: tcp: send zero-window ACK when no memory"). - * We don't update if window shrank to zero. + * e2142825c120 ("net: tcp: send zero-window ACK when no memory"): don't + * update the window if it shrank to zero, so that we'll eventually + * retry to send data, but rewind the sequence as that obviously implies + * that no data beyond the updated window will ever be acknowledged.

Nit: Arguably "no data...will ever" is not quite right. It presumbly won't be acknowledged until we resend it at least once, but we certainly hope it will be acknowledged after that point.

Hmm, yes, it's "ever" in a rather relative sense (we'll time out, eventually, but that takes a while), which makes it not "ever", strictly speaking. Let me just drop "ever", unless you have better ideas on how to improve the comment. -- Stefano

Stefano Brivio

29 Aug 29 Aug

10:11 p.m.

New subject: [PATCH v3 4/7] tcp: Fix closing logic for half-closed connections

First off, don't close connections half-closed by the guest before our own FIN is acknowledged by the guest itself. That is, after we receive a FIN from the guest (TAP_FIN_RCVD), if we don't have any data left to send from the socket (SOCK_FIN_RCVD, or EPOLLHUP), we send a FIN segment to the guest (TAP_FIN_SENT), but we need to actually have it acknowledged (and have no pending retransmissions) before we can close the connection: check for TAP_FIN_ACKED, first. Then, if we set TAP_FIN_SENT, and we receive an ACK segment from the guest, set TAP_FIN_ACKED. This was entirely missing for the TAP_FIN_RCVD case, and as we fix the problem described above, this becomes relevant as well. Signed-off-by: Stefano Brivio Reviewed-by: David Gibson --- tcp.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/tcp.c b/tcp.c index 11c9c84..fccea35 100644 --- a/tcp.c +++ b/tcp.c @@ -2090,9 +2090,14 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, tcp_tap_window_update(c, conn, ntohs(th->window)); tcp_data_from_sock(c, conn); - if (conn->events & SOCK_FIN_RCVD && - conn->seq_ack_from_tap == conn->seq_to_tap) - conn_event(c, conn, CLOSED); + if (conn->seq_ack_from_tap == conn->seq_to_tap) { + if (th->ack && conn->events & TAP_FIN_SENT) + conn_event(c, conn, TAP_FIN_ACKED); + + if (conn->events & SOCK_FIN_RCVD && + conn->events & TAP_FIN_ACKED) + conn_event(c, conn, CLOSED); + } return 1; } @@ -2372,7 +2377,7 @@ void tcp_sock_handler(const struct ctx *c, union epoll_ref ref, return; } - if ((conn->events & TAP_FIN_SENT) && (events & EPOLLHUP)) { + if ((conn->events & TAP_FIN_ACKED) && (events & EPOLLHUP)) { conn_event(c, conn, CLOSED); return; } -- 2.43.0

Stefano Brivio

10:11 p.m.

New subject: [PATCH v3 5/7] tcp: Don't try to transmit right after the peer shrank the window to zero

If the peer shrinks the window to zero, we'll skip storing the new window, as a convenient way to cause window probes (which exceed any zero-sized window, strictly speaking) if we don't get window updates in a while. As we do so, though, we need to ensure we don't try to queue more data from the socket right after we process this window update, as the entire point of a zero-window advertisement is to keep us from sending more data. Signed-off-by: Stefano Brivio Reviewed-by: David Gibson --- tcp.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index fccea35..051550d 100644 --- a/tcp.c +++ b/tcp.c @@ -1260,8 +1260,10 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, * @c: Execution context * @conn: Connection pointer * @wnd: Window value, host order, unscaled + * + * Return: false on zero window (not stored to wnd_from_tap), true otherwise */ -static void tcp_tap_window_update(const struct ctx *c, +static bool tcp_tap_window_update(const struct ctx *c, struct tcp_tap_conn *conn, unsigned wnd) { wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); @@ -1274,13 +1276,14 @@ static void tcp_tap_window_update(const struct ctx *c, */ if (!wnd && SEQ_LT(conn->seq_ack_from_tap, conn->seq_to_tap)) { tcp_rewind_seq(c, conn); - return; + return false; } conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ + return true; } /** @@ -2075,9 +2078,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (!th->ack) goto reset; - tcp_tap_window_update(c, conn, ntohs(th->window)); - - tcp_data_from_sock(c, conn); + if (tcp_tap_window_update(c, conn, ntohs(th->window))) + tcp_data_from_sock(c, conn); if (p->count - idx == 1) return 1; @@ -2087,8 +2089,8 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, if (conn->events & TAP_FIN_RCVD) { tcp_sock_consume(conn, ntohl(th->ack_seq)); tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - tcp_tap_window_update(c, conn, ntohs(th->window)); - tcp_data_from_sock(c, conn); + if (tcp_tap_window_update(c, conn, ntohs(th->window))) + tcp_data_from_sock(c, conn); if (conn->seq_ack_from_tap == conn->seq_to_tap) { if (th->ack && conn->events & TAP_FIN_SENT) -- 2.43.0

Stefano Brivio

10:11 p.m.

New subject: [PATCH v3 6/7] tcp: Cast operands of sequence comparison macros to uint32_t before using them

Otherwise, passing signed types causes automatic promotion of the result of the subtractions as well, which is not what we want, as these macros rely on unsigned 32-bit arithmetic. The next patch introduces a ssize_t operand for SEQ_LE, illustrating the issue. Signed-off-by: Stefano Brivio --- tcp_internal.h | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/tcp_internal.h b/tcp_internal.h index 36c6533..c80ba40 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -18,10 +18,14 @@ sizeof(struct ipv6hdr), \ sizeof(uint32_t)) -#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW) -#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW) -#define SEQ_GE(a, b) ((a) - (b) < MAX_WINDOW) -#define SEQ_GT(a, b) ((a) - (b) - 1 < MAX_WINDOW) +#define SEQ_LE(a, b) \ + ((uint32_t)(b) - (uint32_t)(a) < MAX_WINDOW) +#define SEQ_LT(a, b) \ + ((uint32_t)(b) - (uint32_t)(a) - 1 < MAX_WINDOW) +#define SEQ_GE(a, b) \ + ((uint32_t)(a) - (uint32_t)(b) < MAX_WINDOW) +#define SEQ_GT(a, b) \ + ((uint32_t)(a) - (uint32_t)(b) - 1 < MAX_WINDOW) #define FIN (1 << 0) #define SYN (1 << 1) -- 2.43.0

David Gibson

1 Sep 1 Sep

6:29 a.m.

New subject: [PATCH v3 6/7] tcp: Cast operands of sequence comparison macros to uint32_t before using them

On Fri, Aug 29, 2025 at 10:11:31PM +0200, Stefano Brivio wrote:

...

Otherwise, passing signed types causes automatic promotion of the result of the subtractions as well, which is not what we want, as these macros rely on unsigned 32-bit arithmetic.

The next patch introduces a ssize_t operand for SEQ_LE, illustrating the issue.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tcp_internal.h | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/tcp_internal.h b/tcp_internal.h index 36c6533..c80ba40 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -18,10 +18,14 @@ sizeof(struct ipv6hdr), \ sizeof(uint32_t))

-#define SEQ_LE(a, b) ((b) - (a) < MAX_WINDOW) -#define SEQ_LT(a, b) ((b) - (a) - 1 < MAX_WINDOW) -#define SEQ_GE(a, b) ((a) - (b) < MAX_WINDOW) -#define SEQ_GT(a, b) ((a) - (b) - 1 < MAX_WINDOW) +#define SEQ_LE(a, b) \ + ((uint32_t)(b) - (uint32_t)(a) < MAX_WINDOW) +#define SEQ_LT(a, b) \ + ((uint32_t)(b) - (uint32_t)(a) - 1 < MAX_WINDOW) +#define SEQ_GE(a, b) \ + ((uint32_t)(a) - (uint32_t)(b) < MAX_WINDOW) +#define SEQ_GT(a, b) \ + ((uint32_t)(a) - (uint32_t)(b) - 1 < MAX_WINDOW)

#define FIN (1 << 0) #define SYN (1 << 1) -- 2.43.0

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

29 Aug 29 Aug

10:11 p.m.

New subject: [PATCH v3 7/7] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent

We currently have a number of discrepancies in the tcp_tap_handler() path between the half-closed connection path and the regular one, and they are mostly a result of code duplication, which comes in turn from the fact that tcp_data_from_tap() deals with data transfers as well as general connection bookkeeping, so we can't use it for half-closed connections. This suggests that we should probably rework it into two or more functions, in the long term, but for the moment being I'm just fixing one obvious issue, which is the lack of fast retransmissions in the TAP_FIN_RCVD path, and a potential one, which is the fact we don't handle socket flush failures. Add fast re-transmit for half-closed connections, and extract the logic to determine the TCP payload length from tcp_data_from_tap() into the new tcp_packet_data_len() helper to decrease a bit the amount of resulting code duplication. Handle the case of socket flush (tcp_sock_consume()) flush failure in the same way as tcp_data_from_tap() handles it. Signed-off-by: Stefano Brivio --- tcp.c | 81 +++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 57 insertions(+), 24 deletions(-) diff --git a/tcp.c b/tcp.c index 051550d..35d101e 100644 --- a/tcp.c +++ b/tcp.c @@ -1640,6 +1640,22 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn) return tcp_buf_data_from_sock(c, conn); } +/** + * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet + * @th: Pointer to TCP header + * @l4len: TCP packet length, including TCP header + * + * Return: data length of TCP packet, -1 on invalid value of Data Offset field + */ +static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len) +{ + size_t off = th->doff * 4UL; + + if (off < sizeof(*th) || off > l4len) + return -1; + + return l4len - off; +} /** * tcp_data_from_tap() - tap/guest data for established connection @@ -1671,30 +1687,24 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, for (i = idx, iov_i = 0; i < (int)p->count; i++) { uint32_t seq, seq_offset, ack_seq; const struct tcphdr *th; + ssize_t dlen; char *data; - size_t off; th = packet_get(p, i, 0, sizeof(*th), &len); if (!th) return -1; len += sizeof(*th); - off = th->doff * 4UL; - if (off < sizeof(*th) || off > len) - return -1; - if (th->rst) { conn_event(c, conn, CLOSED); return 1; } - len -= off; - data = packet_get(p, i, off, len, NULL); - if (!data) - continue; + if ((dlen = tcp_packet_data_len(th, len)) < 0) + return -1; seq = ntohl(th->seq); - if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { + if (SEQ_LT(seq, conn->seq_from_tap) && dlen <= 1) { flow_trace(conn, "keep-alive sequence: %u, previous: %u", seq, conn->seq_from_tap); @@ -1719,7 +1729,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) && SEQ_GE(ack_seq, max_ack_seq)) { /* Fast re-transmit */ - retr = !len && !th->fin && + retr = !dlen && !th->fin && ack_seq == max_ack_seq && ntohs(th->window) == max_ack_seq_wnd; @@ -1740,33 +1750,37 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (th->fin) fin = 1; - if (!len) + if (!dlen) + continue; + + data = packet_get(p, i, th->doff * 4UL, dlen, NULL); + if (!data) continue; seq_offset = seq_from_tap - seq; /* Use data from this buffer only in these two cases: * - * , seq_from_tap , seq_from_tap - * |--------| <-- len |--------| <-- len + * , seq_from_tap , seq_from_tap + * |--------| <-- dlen |--------| <-- dlen * '----' <-- offset ' <-- offset * ^ seq ^ seq - * (offset >= 0, seq + len > seq_from_tap) + * (offset >= 0, seq + dlen > seq_from_tap) * * discard in these two cases: - * , seq_from_tap , seq_from_tap - * |--------| <-- len |--------| <-- len + * , seq_from_tap , seq_from_tap + * |--------| <-- dlen |--------| <-- dlen * '--------' <-- offset '-----| <- offset * ^ seq ^ seq - * (offset >= 0, seq + len <= seq_from_tap) + * (offset >= 0, seq + dlen <= seq_from_tap) * * keep, look for another buffer, then go back, in this case: * , seq_from_tap - * |--------| <-- len + * |--------| <-- dlen * '===' <-- offset * ^ seq * (offset < 0) */ - if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + len, seq_from_tap)) + if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + dlen, seq_from_tap)) continue; if (SEQ_LT(seq_offset, 0)) { @@ -1776,7 +1790,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, } tcp_iov[iov_i].iov_base = data + seq_offset; - tcp_iov[iov_i].iov_len = len - seq_offset; + tcp_iov[iov_i].iov_len = dlen - seq_offset; seq_from_tap += tcp_iov[iov_i].iov_len; iov_i++; @@ -2087,9 +2101,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, /* Established connections not accepting data from tap */ if (conn->events & TAP_FIN_RCVD) { - tcp_sock_consume(conn, ntohl(th->ack_seq)); - tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - if (tcp_tap_window_update(c, conn, ntohs(th->window))) + bool retr; + + retr = th->ack && !tcp_packet_data_len(th, len) && !th->fin && + ntohl(th->ack_seq) == conn->seq_ack_from_tap && + ntohs(th->window) == conn->wnd_from_tap; + + /* On socket flush failure, pretend there was no ACK, try again + * later + */ + if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq))) + tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); + + if (retr) { + flow_trace(conn, + "fast re-transmit, ACK: %u, previous sequence: %u", + ntohl(th->ack_seq), conn->seq_to_tap); + + if (tcp_rewind_seq(c, conn)) + return -1; + } + + if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr) tcp_data_from_sock(c, conn); if (conn->seq_ack_from_tap == conn->seq_to_tap) { -- 2.43.0

David Gibson

1 Sep 1 Sep

6:31 a.m.

New subject: [PATCH v3 7/7] tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent

On Fri, Aug 29, 2025 at 10:11:32PM +0200, Stefano Brivio wrote:

...

We currently have a number of discrepancies in the tcp_tap_handler() path between the half-closed connection path and the regular one, and they are mostly a result of code duplication, which comes in turn from the fact that tcp_data_from_tap() deals with data transfers as well as general connection bookkeeping, so we can't use it for half-closed connections.

This suggests that we should probably rework it into two or more functions, in the long term, but for the moment being I'm just fixing one obvious issue, which is the lack of fast retransmissions in the TAP_FIN_RCVD path, and a potential one, which is the fact we don't handle socket flush failures.

Add fast re-transmit for half-closed connections, and extract the logic to determine the TCP payload length from tcp_data_from_tap() into the new tcp_packet_data_len() helper to decrease a bit the amount of resulting code duplication.

Handle the case of socket flush (tcp_sock_consume()) flush failure in the same way as tcp_data_from_tap() handles it.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tcp.c | 81 +++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 57 insertions(+), 24 deletions(-)

diff --git a/tcp.c b/tcp.c index 051550d..35d101e 100644 --- a/tcp.c +++ b/tcp.c @@ -1640,6 +1640,22 @@ static int tcp_data_from_sock(const struct ctx *c, struct tcp_tap_conn *conn)

return tcp_buf_data_from_sock(c, conn); } +/** + * tcp_packet_data_len() - Get data (TCP payload) length for a TCP packet + * @th: Pointer to TCP header + * @l4len: TCP packet length, including TCP header + * + * Return: data length of TCP packet, -1 on invalid value of Data Offset field + */ +static ssize_t tcp_packet_data_len(const struct tcphdr *th, size_t l4len) +{ + size_t off = th->doff * 4UL; + + if (off < sizeof(*th) || off > l4len) + return -1; + + return l4len - off; +}

/** * tcp_data_from_tap() - tap/guest data for established connection @@ -1671,30 +1687,24 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, for (i = idx, iov_i = 0; i < (int)p->count; i++) { uint32_t seq, seq_offset, ack_seq; const struct tcphdr *th; + ssize_t dlen; char *data; - size_t off;

th = packet_get(p, i, 0, sizeof(*th), &len); if (!th) return -1; len += sizeof(*th);

- off = th->doff * 4UL; - if (off < sizeof(*th) || off > len) - return -1; - if (th->rst) { conn_event(c, conn, CLOSED); return 1; }

- len -= off; - data = packet_get(p, i, off, len, NULL); - if (!data) - continue; + if ((dlen = tcp_packet_data_len(th, len)) < 0) + return -1;

seq = ntohl(th->seq); - if (SEQ_LT(seq, conn->seq_from_tap) && len <= 1) { + if (SEQ_LT(seq, conn->seq_from_tap) && dlen <= 1) { flow_trace(conn, "keep-alive sequence: %u, previous: %u", seq, conn->seq_from_tap); @@ -1719,7 +1729,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (SEQ_GE(ack_seq, conn->seq_ack_from_tap) && SEQ_GE(ack_seq, max_ack_seq)) { /* Fast re-transmit */ - retr = !len && !th->fin && + retr = !dlen && !th->fin && ack_seq == max_ack_seq && ntohs(th->window) == max_ack_seq_wnd;

@@ -1740,33 +1750,37 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (th->fin) fin = 1;

- if (!len) + if (!dlen) + continue; + + data = packet_get(p, i, th->doff * 4UL, dlen, NULL); + if (!data) continue;

seq_offset = seq_from_tap - seq; /* Use data from this buffer only in these two cases: * - * , seq_from_tap , seq_from_tap - * |--------| <-- len |--------| <-- len + * , seq_from_tap , seq_from_tap + * |--------| <-- dlen |--------| <-- dlen * '----' <-- offset ' <-- offset * ^ seq ^ seq - * (offset >= 0, seq + len > seq_from_tap) + * (offset >= 0, seq + dlen > seq_from_tap) * * discard in these two cases: - * , seq_from_tap , seq_from_tap - * |--------| <-- len |--------| <-- len + * , seq_from_tap , seq_from_tap + * |--------| <-- dlen |--------| <-- dlen * '--------' <-- offset '-----| <- offset * ^ seq ^ seq - * (offset >= 0, seq + len <= seq_from_tap) + * (offset >= 0, seq + dlen <= seq_from_tap) * * keep, look for another buffer, then go back, in this case: * , seq_from_tap - * |--------| <-- len + * |--------| <-- dlen * '===' <-- offset * ^ seq * (offset < 0) */ - if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + len, seq_from_tap)) + if (SEQ_GE(seq_offset, 0) && SEQ_LE(seq + dlen, seq_from_tap)) continue;

if (SEQ_LT(seq_offset, 0)) { @@ -1776,7 +1790,7 @@ static int tcp_data_from_tap(const struct ctx *c, struct tcp_tap_conn *conn, }

tcp_iov[iov_i].iov_base = data + seq_offset; - tcp_iov[iov_i].iov_len = len - seq_offset; + tcp_iov[iov_i].iov_len = dlen - seq_offset; seq_from_tap += tcp_iov[iov_i].iov_len; iov_i++;

@@ -2087,9 +2101,28 @@ int tcp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af,

/* Established connections not accepting data from tap */ if (conn->events & TAP_FIN_RCVD) { - tcp_sock_consume(conn, ntohl(th->ack_seq)); - tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); - if (tcp_tap_window_update(c, conn, ntohs(th->window))) + bool retr; + + retr = th->ack && !tcp_packet_data_len(th, len) && !th->fin && + ntohl(th->ack_seq) == conn->seq_ack_from_tap && + ntohs(th->window) == conn->wnd_from_tap; + + /* On socket flush failure, pretend there was no ACK, try again + * later + */ + if (th->ack && !tcp_sock_consume(conn, ntohl(th->ack_seq))) + tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); + + if (retr) { + flow_trace(conn, + "fast re-transmit, ACK: %u, previous sequence: %u", + ntohl(th->ack_seq), conn->seq_to_tap); + + if (tcp_rewind_seq(c, conn)) + return -1; + } + + if (tcp_tap_window_update(c, conn, ntohs(th->window)) || retr) tcp_data_from_sock(c, conn);

if (conn->seq_ack_from_tap == conn->seq_to_tap) { -- 2.43.0

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Paul Holzinger

12:02 p.m.

Hi, On 29/08/2025 22:11, Stefano Brivio wrote:

...

Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.

Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.

While playing with a reproducer that seems to be matching those failures:

while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

and a kernel including that commit, I hit a few different failures, that should be fixed by this series.

Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command:

while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

where write.sh is simply:

#!/bin/sh

cat > /tmp/large.rcv

so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/7.

v3: - add patch 6/7 - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1

v2: in 3/6, rewind sequence also if the zero-window update comes in the middle of a batch with non-zero window updates

Stefano Brivio (7): tcp: FIN flags have to be retransmitted as well tcp: Factor sequence rewind for retransmissions into a new function tcp: Rewind sequence when guest shrinks window to zero tcp: Fix closing logic for half-closed connections tcp: Don't try to transmit right after the peer shrank the window to zero tcp: Cast operands of sequence comparison macros to uint32_t before using them tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path consistent

tcp.c | 181 ++++++++++++++++++++++++++++++++++--------------- tcp_internal.h | 12 ++-- 2 files changed, 136 insertions(+), 57 deletions(-) I am afraid I have to give bad news that it is still broken. My reproducer failed after 70 mins (without logs) which means it took longer this time but I only have one run so far so hard to tell. I can enable logs again and see how long it takes then.

-- Paul Holzinger

Paul Holzinger

7:36 p.m.

On 01/09/2025 12:02, Paul Holzinger wrote:

...

Hi,

On 29/08/2025 22:11, Stefano Brivio wrote:

...
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.

Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.

While playing with a reproducer that seems to be matching those failures:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

and a kernel including that commit, I hit a few different failures, that should be fixed by this series.

Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

where write.sh is simply:

   #!/bin/sh       cat > /tmp/large.rcv

so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/7.

v3:    - add patch 6/7    - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1

v2: in 3/6, rewind sequence also if the zero-window update comes in      the middle of a batch with non-zero window updates

Stefano Brivio (7):    tcp: FIN flags have to be retransmitted as well    tcp: Factor sequence rewind for retransmissions into a new function    tcp: Rewind sequence when guest shrinks window to zero    tcp: Fix closing logic for half-closed connections    tcp: Don't try to transmit right after the peer shrank the window to      zero    tcp: Cast operands of sequence comparison macros to uint32_t before      using them    tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path      consistent

tcp.c          | 181 ++++++++++++++++++++++++++++++++++--------------- tcp_internal.h | 12 ++-- 2 files changed, 136 insertions(+), 57 deletions(-) I am afraid I have to give bad news that it is still broken. My reproducer failed after 70 mins (without logs) which means it took longer this time but I only have one run so far so hard to tell. I can enable logs again and see how long it takes then.

Ok, my logs reproducer is running for well over 7 hours now without triggering the issue, so this series improves the situation a lot. I keep trying but I think this is more than enough to convince me that this here is good. Tested-by: Paul Holzinger -- Paul Holzinger

Stefano Brivio

11:02 p.m.

On Mon, 1 Sep 2025 19:36:18 +0200 Paul Holzinger wrote:

...

On 01/09/2025 12:02, Paul Holzinger wrote:

...
Hi,

On 29/08/2025 22:11, Stefano Brivio wrote:

...
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.

Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.

While playing with a reproducer that seems to be matching those failures:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

and a kernel including that commit, I hit a few different failures, that should be fixed by this series.

Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

where write.sh is simply:

   #!/bin/sh       cat > /tmp/large.rcv

so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/7.

v3:    - add patch 6/7    - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1

v2: in 3/6, rewind sequence also if the zero-window update comes in      the middle of a batch with non-zero window updates

Stefano Brivio (7):    tcp: FIN flags have to be retransmitted as well    tcp: Factor sequence rewind for retransmissions into a new function    tcp: Rewind sequence when guest shrinks window to zero    tcp: Fix closing logic for half-closed connections    tcp: Don't try to transmit right after the peer shrank the window to      zero    tcp: Cast operands of sequence comparison macros to uint32_t before      using them    tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path      consistent

tcp.c          | 181 ++++++++++++++++++++++++++++++++++--------------- tcp_internal.h | 12 ++-- 2 files changed, 136 insertions(+), 57 deletions(-) I am afraid I have to give bad news that it is still broken. My reproducer failed after 70 mins (without logs) which means it took longer this time but I only have one run so far so hard to tell. I can enable logs again and see how long it takes then.

Ok, my logs reproducer is running for well over 7 hours now without triggering the issue, so this series improves the situation a lot. I keep trying but I think this is more than enough to convince me that this here is good.

Tested-by: Paul Holzinger

Thanks for testing and re-testing. Just one question before I go ahead and merge this: how did the original failure from earlier on Tuesday look like? Was that again a timeout? Another thing worth trying: captures without logs, which should be much less overhead (hence difference in timing). I should be able to figure out issues of this sort with captures and no logs (it's much harder the other way around). -- Stefano

Paul Holzinger

2 Sep 2 Sep

11:27 a.m.

On 01/09/2025 23:02, Stefano Brivio wrote:

...

On Mon, 1 Sep 2025 19:36:18 +0200 Paul Holzinger wrote:

...
...
Hi,

On 29/08/2025 22:11, Stefano Brivio wrote:

...
Starting from Linux kernel commit 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks"), window limits are enforced more aggressively with a bigger amount of zero-window updates compared to what happened with e2142825c120 ("net: tcp: send zero-window ACK when no memory") alone, and occasional duplicate ACKs can now be seen also for local transfers with default (208 KiB) socket buffer sizes.

Paul reports that, with 6.17-rc1-ish kernels, Podman tests for the pasta integration occasionally fail on the "TCP/IPv4 large transfer, tap" case.

While playing with a reproducer that seems to be matching those failures:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 OPEN:/tmp/large.rcv,trunc & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

and a kernel including that commit, I hit a few different failures, that should be fixed by this series.

Paul tested v1 of this series and found an additional failure (transfer timeout), which I could reproduce with a slightly different command:

   while true; do ./pasta --trace -l /tmp/pasta.log -p /tmp/pasta.pcap --config-net -t 5555 -- socat TCP-LISTEN:5555 EXEC:./write.sh & (sleep 0.3; socat -T2 OPEN:large.bin TCP:88.198.0.164:5555; ); wait; diff large.bin /tmp/large.rcv || break; done

where write.sh is simply:

   #!/bin/sh       cat > /tmp/large.rcv

so that the connection is not half-closed starting from the beginning, because socat can't make assumptions about the unidirectional nature of the traffic. This should now be fixed as well by the new version of patch 3/7.

v3:    - add patch 6/7    - in 7/7, check dlen <= 1 for keep-alive segments, instead of len <= 1

v2: in 3/6, rewind sequence also if the zero-window update comes in      the middle of a batch with non-zero window updates

Stefano Brivio (7):    tcp: FIN flags have to be retransmitted as well    tcp: Factor sequence rewind for retransmissions into a new function    tcp: Rewind sequence when guest shrinks window to zero    tcp: Fix closing logic for half-closed connections    tcp: Don't try to transmit right after the peer shrank the window to      zero    tcp: Cast operands of sequence comparison macros to uint32_t before      using them    tcp: Fast re-transmit if half-closed, make TAP_FIN_RCVD path      consistent

tcp.c          | 181 ++++++++++++++++++++++++++++++++++--------------- tcp_internal.h | 12 ++-- 2 files changed, 136 insertions(+), 57 deletions(-) I am afraid I have to give bad news that it is still broken. My reproducer failed after 70 mins (without logs) which means it took longer this time but I only have one run so far so hard to tell. I can enable logs again and see how long it takes then. Ok, my logs reproducer is running for well over 7 hours now without

On 01/09/2025 12:02, Paul Holzinger wrote: triggering the issue, so this series improves the situation a lot. I keep trying but I think this is more than enough to convince me that this here is good.

Tested-by: Paul Holzinger Thanks for testing and re-testing.

Just one question before I go ahead and merge this: how did the original failure from earlier on Tuesday look like? Was that again a timeout? Yes from the podman test all failures looked the same so far, the podman logs --follow command times out because the container did not exit. Which happens because socat in the container did not exit as the tcp stream seems to be hanging/stay open.

Another thing worth trying: captures without logs, which should be much less overhead (hence difference in timing). I will try that then.

I should be able to figure out issues of this sort with captures and no logs (it's much harder the other way around).

-- Paul Holzinger

181

Age (days ago)

185

Last active (days ago)

List overview

Download

15 comments

3 participants

participants (3)

David Gibson
Paul Holzinger
Stefano Brivio

[PATCH v3 0/7] tcp: Fixes for issues uncovered by tests with 6.17-rc1 kernels

tags

participants (3)