On Wed, 29 Oct 2025 16:59:51 +0800
Yumei Huang
On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 15:32:33 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: > > > > On Tue, 28 Oct 2025 16:09:03 +0800 > > Yumei Huang wrote: > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio wrote: > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > > > Yumei Huang wrote: > > > > > > > > > Use an exponential backoff timeout for data retransmission according > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > > > in Appendix A of RFC 6298. > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > > > > > Signed-off-by: Yumei Huang > > > > > Reviewed-by: David Gibson > > > > > --- > > > > > tcp.c | 27 ++++++++++++--------------- > > > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > index 9385132..dc0ec6c 100644 > > > > > --- a/tcp.c > > > > > +++ b/tcp.c > > > > > @@ -179,16 +179,14 @@ > > > > > * > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > > > * > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > > > - * > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > > > + * timeout for the first retry, in seconds. If this persists too many times > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > > > + * handshake. > > > > > * > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > > > @@ -342,8 +340,7 @@ enum { > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > > -#define ACK_TIMEOUT 2 > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > > > #define FIN_TIMEOUT 60 > > > > > #define ACT_TIMEOUT 7200 > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > > > if (!(conn->events & ESTABLISHED)) { > > > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > > > + it.it_value.tv_sec = RTO_INIT; > > > > > else > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > > > + it.it_value.tv_sec = RTO_INIT << > > > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > > > } > > > > > else > > > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > > > RTO_INIT, and multiply as needed in the if / else clauses. > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. > > > > Why can't you do that? Say: > > > > it.it_value.tv_sec = RTO_INIT > > if (!(conn->events & ESTABLISHED)) { > > if (conn->retries >= c->tcp.syn_linear_timeouts) > > it.it_value.tv_sec <<= (conn->retries - > > c->tcp.syn_linear_timeouts); > > > > but anyway, see below. > > > > > But it seems more complicated. What do you think? > > > > Or maybe, building on my latest comment to 3/4: > > > > int factor = conn->retries; > > > > if (!(conn->events & ESTABLISHED)) > > factor -= c->tcp.syn_linear_timeouts; > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); > > > > ? > > Yeah, I understand this part now. > > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > > > } else { > > > > > > > > The rest of the series looks good to me. > > > > > > > > It might be slightly more practical to factor in directly the RTO > > > > clamp, and I don't think it's complicated now that you have the helper > > > > from 2/4, but it's not a strong preference from my side, as the series > > > > makes sense in any case. > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > > > about the way we get the total time for retries. > > > > > > I used to do it like this in v2, > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \ > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > > > > > Though the formula is not quite right, we could refine it as below: > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > > > > > Does it make sense to get the time this way? > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right? > > Not really, it's only used for data retransmission, so > syn_linear_timeouts is not relevant. Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if (c->conn->events & ESTABLISHED)", I guess you mean:
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Right.
Then the code would be:
it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) { int exp = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec <<= MAX(exp, 0); } else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); it.it_value.tv_sec <<= conn->retries;
Hmm what's wrong with my previous suggestion of initialising 'exp' with conn->retries? Then it becomes (equivalent to your code snippet, but shorter):
} it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max);
something like: int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else min = /* add a constant for this */ 3; it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max); ...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in util.h, but now that we need one, I would add it, similar to the one from the Linux kernel but without type checking (as it's not really practical here). Inspired from include/linux/kernel.h (current Linux kernel tree): #define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi) we can do something similar with two differences: 1. all our macros are uppercase (we don't have as many as the kernel, and it's nice to know that they are macros from the name) and 2. as I mentioned we don't need/want type checking, so, say: #define CLAMP(x, min, max) MIN(MAX((x), (min)), (max)) ...or keep val / lo / hi if it's clearer, I don't really have a preference. I added parentheses around the arguments by the way because I think it's good practice, even though not needed in this case. It's the PRE02-C recommendation in CERT C (we generally stick to those recommendations whenever practical): https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lis... https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.ht... Note that the Linux kernel has a compatible license (it's the same, actually, GPLv2+), and, regardless of that, the implementation is trivial and the idea is obvious, so I don't think we need to give explicit attribution in this case. But in other cases we do, or I guess it's fair to the author anyway, see for example siphash.h.
we basically set the starting value to 3 for data retransmissions no matter if we have retried SYN. And the max total timeout would be 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
Ah, yes. In the discussion I previously assumed that the default clamp value was 60 seconds, but it's actually 120 seconds. It looks correct to me.
Besides, I guess I need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already have a RTO_INIT at this point, and this constant is very closely related to that one.
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
> Probably I should name it more clearly. > > > > > But in any case, do you really need to calculate that explicitly? I was > > thinking that you can just clamp the value when you use it in > > tcp_timer_ctl(). > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value > > you read from the kernel, then I guess it's just: > > > > --- > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max); > > After reading the comments in v3 when tcp_rto_max_ms was first > mentioned again, I realized I got something wrong again. I thought it > was for the total timeout for all retries, so I need to calculate that > and decide to reset the connection as in v2.
I think it actually applies to all the retries.
> Anyway, you are right. We don't need to do that. Thanks for your patience. > > > } ... > > > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > > flow_perror(conn, "failed to set timer"); > > ---
-- Stefano