On Thu, 23 Apr 2026 21:06:15 -0400
Jon Maloy
The TCP window advertised to the guest/container must balance two competing needs: large enough to trigger kernel socket buffer auto-tuning, but not so large that sendmsg() partially fails, causing retransmissions.
The current approach uses the difference (SNDBUF_GET() - SIOCOUTQ), but SNDBUF_GET() returns a scaled value that only roughly accounts for per-skb overhead. The clamped_scale approximation doesn't accurately track the actual per-segment overhead, which can lead to both excessive retransmissions and reduced throughput.
We now introduce the use of SO_MEMINFO to obtain SK_MEMINFO_SNDBUF and SK_MEMINFO_WMEM_QUEUED from the kernel. The latter is presented in the kernel's own accounting units, i.e. including the sk_buff overhead, and matches exactly what the kernel's own sk_stream_memory_free() function is using.
When data is queued and the overhead ratio is observable, we calculate the per-segment overhead as (wmem_queued - sendq) / num_segments, then determine how many additional segments should fit in the remaining buffer space, considering the calculated per-mss overhead. This approach treats segments as discrete quantities, and produces a more accurate estimate of available buffer space than a linear scaling factor does.
When the ratio cannot be observed, e.g. because the queue is empty or we are in a transient state, we fall back to the existing clamped_scale calculation (scaling between 100% and 75% of buffer capacity).
When SO_MEMINFO succeeds, we also use SK_MEMINFO_SNDBUF directly to set SNDBUF, avoiding a separate SO_SNDBUF getsockopt() call.
If SO_MEMINFO is unavailable, we fall back to the pre-existing SNDBUF_GET() - SIOCOUTQ calculation.
We should also add: Link: https://github.com/containers/podman/issues/28219
Signed-off-by: Jon Maloy
--- v2: Updated according to feedback from Stefano. My own measurements indicate that this approach largely solves both the retransmission and throughput issues observed with the previous version.
I ran quite extensive tests (not yet with sub-millisecond RTTs though, and we definitely should) and this looks like magic. Not a retransmission, throughput converges very quickly, and in general (especially with higher RTTs) the throughput is better than without pasta (!). But I can't understand why. :) I think the approach I suggested considering segments as discrete should be slightly more accurate but I can't see why it would make such a big difference. Is it about the "early" (!sendq) phase perhaps, where you now went back to the previous approach instead of keeping a flat x / 3 * 4 proportion? Any clue? A few preliminary comments below:
--- tcp.c | 42 ++++++++++++++++++++++++++++++++++-------- tcp_conn.h | 2 +- 2 files changed, 35 insertions(+), 9 deletions(-)
diff --git a/tcp.c b/tcp.c index 43b8fdb..2ba08fd 100644 --- a/tcp.c +++ b/tcp.c @@ -295,6 +295,7 @@ #include
#include
+#include #include "checksum.h" #include "util.h" @@ -1128,19 +1129,44 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, new_wnd_to_tap = tinfo->tcpi_snd_wnd; } else { unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); + uint32_t mem[SK_MEMINFO_VARS]; + socklen_t mem_sl = sizeof(mem); + int mss = MSS_GET(conn); uint32_t sendq; - int limit; + uint32_t limit;
if (ioctl(s, SIOCOUTQ, &sendq)) { debug_perror("SIOCOUTQ on socket %i, assuming 0", s); sendq = 0; } - tcp_get_sndbuf(conn);
- if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
I think it would be good to preserve this comment because I found it quite surprising that it would actually happen in my tests and that we even had to check for it.
- limit = 0; - else - limit = SNDBUF_GET(conn) - (int)sendq; + if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) { + tcp_get_sndbuf(conn); + if (sendq > SNDBUF_GET(conn)) + limit = 0; + else + limit = SNDBUF_GET(conn) - sendq;
This functionally makes sense but I wonder if we could structure things differently because it's becoming a bit hard to follow: - (long overdue even before your patch) turn this into a new function - handling the exceptional case (SO_MEMINFO failing) more clearly as an exception - decouple checks from calculations Something like (I didn't finish or really think it through): --- /** * tcp_wnd_from_sndbuf() - Calculate window value from available sending buffer * @tinfo: tcp_info from kernel * * Return: window value to advertise, not scaled * * #syscalls ioctl */ uint32_t tcp_wnd_from_sndbuf(struct tcp_info *tinfo) { unsigned rtt_ms_ceiling = DIV_ROUND_UP(tinfo->tcpi_rtt, 1000); uint32_t mem[SK_MEMINFO_VARS]; socklen_t mem_sl = sizeof(mem); int mss = MSS_GET(conn); uint32_t sendq, sndbuf; uint32_t limit; if (getsockopt(s, SOL_SOCKET, SO_MEMINFO, &mem, &mem_sl)) { if (ioctl(s, SIOCOUTQ, &sendq)) { debug_perror("SIOCOUTQ on socket %i, assuming 0", s); sendq = 0; } tcp_get_sndbuf(conn); if (sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ limit = 0; else limit = SNDBUF_GET(conn) - sendq; goto clamp; } sndbuf = mem[SK_MEMINFO_SNDBUF]; sendq = mem[SK_MEMINFO_WMEM_QUEUED]; if (sendq > sndbuf) limit = 0; ... clamp: ... } ... int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, bool force_seq, struct tcp_info_linux *tinfo) { ... if ((conn->flags & LOCAL) || tcp_rtt_dst_low(conn)) new_wnd_to_tap = tinfo->tcpi_snd_wnd; else new_wnd_to_tap = tcp_wnd_from_sndbuf(tinfo); ... } ---
+ } else { + uint32_t sb = mem[SK_MEMINFO_SNDBUF]; + uint32_t wq = mem[SK_MEMINFO_WMEM_QUEUED]; + uint32_t cs = clamped_scale(sb, sb, SNDBUF_SMALL, + SNDBUF_BIG, 75); + + SNDBUF_SET(conn, MIN(INT_MAX, cs)); + + if (wq > sb) { + limit = 0; + } else if (!sendq || wq <= sendq || !mss) { + limit = SNDBUF_GET(conn) - sendq;
Isn't the new 'wq' value (which I would call 'sendq' if doable, while not fetching SIOCOUTQ) more accurate than sendq anyway? That is, shouldn't we rather use SNDBUF_GET(conn) - wq here?
+ } else { + uint32_t nsegs = MAX(sendq / mss, 1); + uint32_t overhead = (wq - sendq) / nsegs; + uint32_t remaining = sb - wq; + + nsegs = remaining / (mss + overhead); + limit = nsegs * mss;
...magic. Maybe it's really this.
+ } + }
/* If the sender uses mechanisms to prevent Silly Window * Syndrome (SWS, described in RFC 813 Section 3) it's critical @@ -1168,11 +1194,11 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn, * but we won't send enough to fill one because we're stuck * with pending data in the outbound queue */ - if (limit < MSS_GET(conn) && sendq && + if (limit < (uint32_t)MSS_GET(conn) && sendq && tinfo->tcpi_last_data_sent < rtt_ms_ceiling * 10) limit = 0;
- new_wnd_to_tap = MIN((int)tinfo->tcpi_snd_wnd, limit); + new_wnd_to_tap = MIN(tinfo->tcpi_snd_wnd, limit); }
new_wnd_to_tap = MIN(new_wnd_to_tap, MAX_WINDOW); diff --git a/tcp_conn.h b/tcp_conn.h index 6985426..9f5bee0 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -98,7 +98,7 @@ struct tcp_tap_conn { #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) -#define SNDBUF_GET(conn) (conn->sndbuf << (32 - SNDBUF_BITS)) +#define SNDBUF_GET(conn) ((uint32_t)(conn->sndbuf << (32 - SNDBUF_BITS)))
uint8_t seq_dup_ack_approx;
-- Stefano