On Fri, 5 Dec 2025 13:34:07 +1100
David Gibson
On Thu, Dec 04, 2025 at 08:45:39AM +0100, Stefano Brivio wrote:
...under two conditions:
- the remote peer is advertising a bigger value to us, meaning that a bigger sending buffer is likely to benefit throughput, AND
I think this condition is redundant: if the remote peer is advertising less, we'll clamp new_wnd_to_tap to that value anyway.
I almost fell for this. We have a subtractive term in the expression, so it's not actually the case. If the remote peer is advertising a smaller window, we just take buffer size *minus pending bytes*, as limit, which can be smaller compared to the window advertised by the peer. If it's advertising a bigger window, we take an increased buffer size minus pending bytes, as limit, which can be bigger than the peer's window, so we'll use the peer's window as limit instead. I added an example in v2 (now 7/9).
- this is not a short-lived connection, where the latency cost of retransmissions would be otherwise unacceptable.
By doing this, we can reliably trigger TCP buffer size auto-tuning (as long as it's available) on bulk data transfers.
Signed-off-by: Stefano Brivio
--- tcp.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/tcp.c b/tcp.c index 2220059..454df69 100644 --- a/tcp.c +++ b/tcp.c @@ -353,6 +353,13 @@ enum { #define LOW_RTT_TABLE_SIZE 8 #define LOW_RTT_THRESHOLD 10 /* us */
+/* Try to avoid retransmissions to improve latency on short-lived connections */ +#define SHORT_CONN_BYTES (16ULL * 1024 * 1024) + +/* Temporarily exceed available sending buffer to force TCP auto-tuning */ +#define SNDBUF_BOOST_FACTOR 150 /* % */ +#define SNDBUF_BOOST(x) ((x) * SNDBUF_BOOST_FACTOR / 100)
For the short term, the fact this works empirically is enough. For the longer term, it would be nice to have a better understanding of what this "overcommit" amount is actually estimating.
I think what we're looking for is an estimate of the number of bytes that will have left the buffer by the time the guest gets back to us. So: <connection throughput> * <guest-side RTT>
I don't think we want the bandwidth-delay product here (which I'm now using earlier in the series) because the purpose here is to grow the buffer at the beginning of a connection, if it looks like bulk traffic. So we want to progressively exploit auto-tuning as long as we're limited by a small buffer, but not later. At some point we want to finally switch to the window advertised by the peer. Well, I tried with the bandwidth-delay product in any case, but it's not really helping with auto-tuning. It turns out that auto-tuning is fundamentally different at the beginning anyway.
Alas, I don't see a way to estimate either of those from the information we already track - we'd need additional bookkeeping.
It's all in struct tcp_info, it's called tcpi_delivery_rate. There are other interesting bits there, by the way, that could be used in a further refinement.
#define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
#define CONN_IS_CLOSING(conn) \ @@ -1137,6 +1144,9 @@ int tcp_update_seqack_wnd(const struct ctx *c, struct tcp_tap_conn *conn,
if ((int)sendq > SNDBUF_GET(conn)) /* Due to memory pressure? */ limit = 0; + else if ((int)tinfo->tcpi_snd_wnd > SNDBUF_GET(conn) && + tinfo->tcpi_bytes_acked > SHORT_CONN_BYTES)
This is pretty subtle, I think it would be worth having some rationale in a comment, not just the commit message.
I turned the macro into a new function and added comments there, in v2.
+ limit = SNDBUF_BOOST(SNDBUF_GET(conn)) - (int)sendq; else limit = SNDBUF_GET(conn) - (int)sendq;
-- 2.43.0
-- Stefano