Re: [PATCH 6/8] tcp: Allow exceeding the available sending buffer size in window advertisements

8 Dec 2025

On Fri, 5 Dec 2025 13:34:07 +1100
David Gibson  wrote:
...
On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
...
...under two conditions:
- the remote peer is advertising a bigger value to us, meaning that a
  bigger sending buffer is likely to benefit throughput, AND
I think this condition is redundant: if the remote peer is advertising
less, we'll clamp new_wnd_to_tap to that value anyway.
I almost fell for this. We have a subtractive term in the expression,
so it's not actually the case.

If the remote peer is advertising a smaller window, we just take buffer
size *minus pending bytes*, as limit, which can be smaller compared to
the window advertised by the peer.

If it's advertising a bigger window, we take an increased buffer size
minus pending bytes, as limit, which can be bigger than the peer's
window, so we'll use the peer's window as limit instead.

I added an example in v2 (now 7/9).
...
...
- this is not a short-lived connection, where the latency cost of
  retransmissions would be otherwise unacceptable.
By doing this, we can reliably trigger TCP buffer size auto-tuning (as
long as it's available) on bulk data transfers.
Signed-off-by: Stefano Brivio 
---
 tcp.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tcp.c b/tcp.c
index 2220059..454df69 100644
--- a/tcp.c
+++ b/tcp.c
@@ -353,6 +353,13 @@ enum {
 #define LOW_RTT_TABLE_SIZE		8
 #define LOW_RTT_THRESHOLD		10 /* us */
+/* Try to avoid retransmissions to improve latency on short-lived connections */
+#define SHORT_CONN_BYTES		(16ULL * 1024 * 1024)
+
+/* Temporarily exceed available sending buffer to force TCP auto-tuning */
+#define SNDBUF_BOOST_FACTOR		150 /* % */
+#define SNDBUF_BOOST(x)			((x) * SNDBUF_BOOST_FACTOR / 100)
For the short term, the fact this works empirically is enough.  For
the longer term, it would be nice to have a better understanding of
what this "overcommit" amount is actually estimating.
I think what we're looking for is an estimate of the number of bytes
that will have left the buffer by the time the guest gets back to us.  So:
  <connection throughput> * <guest-side RTT>
I don't think we want the bandwidth-delay product here (which I'm now
using earlier in the series) because the purpose here is to grow the
buffer at the beginning of a connection, if it looks like bulk traffic.

So we want to progressively exploit auto-tuning as long as we're
limited by a small buffer, but not later. At some point we want to
finally switch to the window advertised by the peer.

Well, I tried with the bandwidth-delay product in any case, but it's
not really helping with auto-tuning. It turns out that auto-tuning is
fundamentally different at the beginning anyway.
...
Alas, I don't see a way to estimate either of those from the
information we already track - we'd need additional bookkeeping.
It's all in struct tcp_info, it's called tcpi_delivery_rate. There are
other interesting bits there, by the way, that could be used in a
further refinement.
...
...
#define ACK_IF_NEEDED	0		/* See tcp_send_flag() */
#define CONN_IS_CLOSING(conn)						\
@@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */
      	limit = 0;
+		else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) &&
+			 tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)
This is pretty subtle, I think it would be worth having some rationale
in a comment, not just the commit message.
I turned the macro into a new function and added comments there, in v2.
...
...
+			limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq;
      else
      	limit = SNDBUF_GET(conn) - (int)sendq;
-- 
2.43.0
-- 
Stefano