[PATCH v7 0/5] Retry SYNs for inbound connections
When a client connects, SYN would be sent to guest only once. If the guest is not connected or ready at that time, the connection will be reset in 10s. These patches introduce the SYN retry mechanism using the similar backoff timeout as linux kernel. Also update the data retransmission timeout using the backoff timeout. v7: - Update read_file() and read_file_integer() - Rename tcp_syn_params_init() to tcp_get_rto_params() - Modify the implementation of the timeout assignment - Add a patch to clamp the retry timeout Yumei Huang (5): tcp: Rename "retrans" to "retries" util: Introduce read_file() and read_file_integer() function tcp: Resend SYN for inbound connections tcp: Update data retransmission timeout tcp: Clamp the retry timeout tcp.c | 86 ++++++++++++++++++++++++++++++++++++++++++------------ tcp.h | 6 ++++ tcp_conn.h | 12 ++++---- util.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 2 ++ 5 files changed, 168 insertions(+), 24 deletions(-) -- 2.49.0
Rename "retrans" to "retries" so it can be used for SYN retries.
Signed-off-by: Yumei Huang
Signed-off-by: Yumei Huang
If a client connects while guest is not connected or ready yet,
resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153
Signed-off-by: Yumei Huang
Use an exponential backoff timeout for data retransmission according
to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Clamp the TCP retry timeout as Linux kernel does. If RTO is less
than 3 seconds, re-initialize it to 3 seconds for data retransmissions
according to RFC 6298.
Suggested-by: Stefano Brivio
On Fri, Oct 31, 2025 at 1:43 PM Yumei Huang
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
I probably should have removed the reviewed-by since there are some code changes.
--- tcp.c | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c index bada88a..96ee56a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,13 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. Retry for - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, - * reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times + * for established connections, or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +339,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); - } - else - it.it_value.tv_sec = ACK_TIMEOUT; + int exp = conn->retries; + if (!(conn->events & ESTABLISHED)) + exp -= c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.49.0
-- Thanks, Yumei Huang
On Fri, 31 Oct 2025 13:56:41 +0800
Yumei Huang
On Fri, Oct 31, 2025 at 1:43 PM Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson I probably should have removed the reviewed-by since there are some code changes.
Thanks for reporting. I eventually notice during review and drop tags as needed but if you notice you carried a tag by mistake, it's very appreciated that you report it as it saves me some time and probably avoid confusion for other reviewers. -- Stefano
On Fri, 31 Oct 2025 13:42:42 +0800
Yumei Huang
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
Nit: we started with BrE (British English) spelling and later tried to keep that consistent for the sake of grep, so we'll usually use "initialise" (for consistency, at this point).
+ * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK); + timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
size_t looks like a rather weird choice for tcp_rto_max: size_t is used to represent the size of objects, and nothing else (not a timeout in milliseconds). It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c (Linux kernel). Any reason for picking size_t here? I mean, it will work, it just looks weird (and wastes a tiny bit of space, even though I guess we don't care about 4 bytes there).
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s)
Nit: "maximal" has a slightly different meaning compared to "maximum". The highest value allowed for a field would typically be called "maximum", while "maximal" is more commonly used to indicate a value / element that's the biggest of all values. Yes, I know, it's complicated.
* @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max; uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; };
The rest of the series looks good to me at a *very* quick glance, but I can't claim I really reviewed it yet. -- Stefano
On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
--- tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 4 ++++ 2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..bada88a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. Retry for + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, + * reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" + +#define TCP_SYN_RETRIES_DEFAULT 6 +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + int exp = conn->retries - c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); + } else
A non-obvious detail of the style we use is that if one branch of an if uses { }, then the other branch should as well, even if it's one line. So above should be "} else {" and the bit below changed to match.
it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) {
Here's the theoretical bug. We only clamp tcp_syn_retries and syn_linear_timeouts to a uint8_t, so if the host has these set to strangely large values, this addition could overflow.
+ flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */ -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 31, 2025 at 01:42:39PM +0800, Yumei Huang wrote:
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
--- util.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 2 ++ 2 files changed, 88 insertions(+)
diff --git a/util.c b/util.c index c492f90..b83de0d 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,92 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a NULL-terminated buffer + * @path: Path to file to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on error, -ENOBUFS on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s contents exceed buffer size %zu", path, + buf_size); + buf[buf_size - 1] = '\0'; + return -ENOBUFS; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: Path to file to read + * @fallback: Default value if file can't be read + * + * Return: integer value, @fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + ssize_t bytes_read; + char buf[BUFSIZ]; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Non-numeric content in %s", path); + return fallback; + } + if (errno) { + debug("Out of range value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..595a17d 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for + * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK);
Possibly I missed something, since I only skimmed your discussion of this behaviour with Stefano. But I'm not convinced this is a correct interpretation of the RFC. (5.7) says "If the timer expires awaiting the ACK of a SYN segment ...". That is, I think it's only suggesting increasing the RTO to 3 at the data stage *if* we had at least one retry during the handshake. That is, unfortunately, much fiddlier to implement, since we need to remember what happened during the handshake to apply it here. Additionally, if I'm reading the RFC correctly, it's treating this as a one-time adjustment of the RTO, which won't necessarily remain the case for the entire data phase. Here this minimum will apply for the entire data phase. Even though it's a "MUST" in the RFC, I kind of think we could just skip this for two reasons: 1) We already don't bother with RTT measurements, which the RFC assumes the implementation is doing to adjust the RTO. 2) We expect to be talking to a guest. The chance of a high RTT is vanishingly small compared to a path to potentially any host on the 2011 internet.
+ timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s) * @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max; uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; }; -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
As reported, the carried over R-b was a minor mistake, since the code
has changed, but here's a new one:
Reviewed-by: David Gibson
--- tcp.c | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c index bada88a..96ee56a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,13 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. Retry for - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, - * reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times + * for established connections, or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +339,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically) buggy. conn->retries is unsigned, so the subtraction will be performed unsigned and only then cast to signed. I think that will probably do the right thing in practice, but I don't think that's guaranteed by the C standard (and might even be UB).
- it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); - } - else - it.it_value.tv_sec = ACK_TIMEOUT; + int exp = conn->retries;
This change fixes it, by forcing the cast to a signed int before the subtraction. It also removes the minor style error I noted in the previous patch. Given that, I don't think we need to worry about either of them.
+ if (!(conn->events & ESTABLISHED)) + exp -= c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 3, 2025 at 9:10 AM David Gibson
On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
There are a couple of tiny style nits and a mostly theoretical bug noted below.
--- tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 4 ++++ 2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..bada88a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. Retry for + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, + * reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" + +#define TCP_SYN_RETRIES_DEFAULT 6 +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + int exp = conn->retries - c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); + } else
A non-obvious detail of the style we use is that if one branch of an if uses { }, then the other branch should as well, even if it's one line. So above should be "} else {" and the bit below changed to match.
Oh, I didn't know about this style. Will update it later.
it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) {
Here's the theoretical bug. We only clamp tcp_syn_retries and syn_linear_timeouts to a uint8_t, so if the host has these set to strangely large values, this addition could overflow.
Just checked net/ipv4/sysctl_net_ipv4.c, static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT; And MAX_TCP_SYNCNT is defined as 127: #define MAX_TCP_SYNCNT 127 they have a max value as 127, so there won't be overflows.
+ flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */ -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Mon, Nov 3, 2025 at 9:38 AM David Gibson
On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson As reported, the carried over R-b was a minor mistake, since the code has changed, but here's a new one:
Reviewed-by: David Gibson
Small comment below, though.
--- tcp.c | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c index bada88a..96ee56a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,13 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. Retry for - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, - * reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times + * for established connections, or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +339,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically) buggy. conn->retries is unsigned, so the subtraction will be performed unsigned and only then cast to signed. I think that will probably do the right thing in practice, but I don't think that's guaranteed by the C standard (and might even be UB).
I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries and c->tcp.syn_linear_timeouts) will go through integer promotion before subtraction. So the line is like: int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts; Please correct me if I'm wrong.
- it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); - } - else - it.it_value.tv_sec = ACK_TIMEOUT; + int exp = conn->retries;
This change fixes it, by forcing the cast to a signed int before the subtraction. It also removes the minor style error I noted in the previous patch. Given that, I don't think we need to worry about either of them.
+ if (!(conn->events & ESTABLISHED)) + exp -= c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio
On Fri, 31 Oct 2025 13:42:42 +0800 Yumei Huang
wrote: Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
Nit: we started with BrE (British English) spelling and later tried to keep that consistent for the sake of grep, so we'll usually use "initialise" (for consistency, at this point).
Got it.
+ * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK); + timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
size_t looks like a rather weird choice for tcp_rto_max: size_t is used to represent the size of objects, and nothing else (not a timeout in milliseconds).
It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c (Linux kernel). Any reason for picking size_t here?
No particular reason. Just thought it can be used for counting as well. I can change it to int in v8.
I mean, it will work, it just looks weird (and wastes a tiny bit of space, even though I guess we don't care about 4 bytes there).
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s)
Nit: "maximal" has a slightly different meaning compared to "maximum".
The highest value allowed for a field would typically be called "maximum", while "maximal" is more commonly used to indicate a value / element that's the biggest of all values. Yes, I know, it's complicated.
Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst. tcp_rto_max_ms - INTEGER Maximal TCP retransmission timeout (in ms). "maximum" might be more appropriate as you explained.
* @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max; uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; };
The rest of the series looks good to me at a *very* quick glance, but I can't claim I really reviewed it yet.
Thank you. Please take your time if you'd like to review them further.
-- Stefano
-- Thanks, Yumei Huang
On Mon, Nov 3, 2025 at 9:38 AM David Gibson
On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for + * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK);
Possibly I missed something, since I only skimmed your discussion of this behaviour with Stefano. But I'm not convinced this is a correct interpretation of the RFC. (5.7) says "If the timer expires awaiting the ACK of a SYN segment ...". That is, I think it's only suggesting increasing the RTO to 3 at the data stage *if* we had at least one retry during the handshake. That is, unfortunately, much fiddlier to implement, since we need to remember what happened during the handshake to apply it here.
Yes, you are right. Stefano thought it would be simpler than re-introducing separate starting values.
Additionally, if I'm reading the RFC correctly, it's treating this as a one-time adjustment of the RTO, which won't necessarily remain the case for the entire data phase. Here this minimum will apply for the entire data phase.
Even though it's a "MUST" in the RFC, I kind of think we could just skip this for two reasons:
I will leave the discussion to the two of you :)
1) We already don't bother with RTT measurements, which the RFC assumes the implementation is doing to adjust the RTO.
2) We expect to be talking to a guest. The chance of a high RTT is vanishingly small compared to a path to potentially any host on the 2011 internet.
+ timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s) * @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max; uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; }; -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Mon, Nov 03, 2025 at 10:31:21AM +0800, Yumei Huang wrote:
On Mon, Nov 3, 2025 at 9:10 AM David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
There are a couple of tiny style nits and a mostly theoretical bug noted below.
--- tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 4 ++++ 2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..bada88a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. Retry for + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, + * reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" + +#define TCP_SYN_RETRIES_DEFAULT 6 +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + int exp = conn->retries - c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); + } else
A non-obvious detail of the style we use is that if one branch of an if uses { }, then the other branch should as well, even if it's one line. So above should be "} else {" and the bit below changed to match.
Oh, I didn't know about this style. Will update it later.
Since it's gone in the next patch, it doesn't really matter.
it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) {
Here's the theoretical bug. We only clamp tcp_syn_retries and syn_linear_timeouts to a uint8_t, so if the host has these set to strangely large values, this addition could overflow.
Just checked net/ipv4/sysctl_net_ipv4.c,
static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
And MAX_TCP_SYNCNT is defined as 127:
#define MAX_TCP_SYNCNT 127
they have a max value as 127, so there won't be overflows.
Ok. That's not at all obvious from within the passt code, though. Again, this is really only a theoretical problem, but I think ideally we'd clamp to 127 where we read the values out of the kernel, with a comment saying that limit is derived from the kernel's own limit.
+ flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */ -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks,
Yumei Huang
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
On Mon, Nov 3, 2025 at 9:38 AM David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson As reported, the carried over R-b was a minor mistake, since the code has changed, but here's a new one:
Reviewed-by: David Gibson
Small comment below, though.
--- tcp.c | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c index bada88a..96ee56a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,13 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. Retry for - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, - * reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times + * for established connections, or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +339,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically) buggy. conn->retries is unsigned, so the subtraction will be performed unsigned and only then cast to signed. I think that will probably do the right thing in practice, but I don't think that's guaranteed by the C standard (and might even be UB).
I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries and c->tcp.syn_linear_timeouts) will go through integer promotion before subtraction. So the line is like:
int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
Please correct me if I'm wrong.
Huh, I thought it would only be promoted if one of the operands was an int. But C promotion rules are really confusing, so I could well be wrong.
- it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); - } - else - it.it_value.tv_sec = ACK_TIMEOUT; + int exp = conn->retries;
This change fixes it, by forcing the cast to a signed int before the subtraction. It also removes the minor style error I noted in the previous patch. Given that, I don't think we need to worry about either of them.
+ if (!(conn->events & ESTABLISHED)) + exp -= c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.49.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks,
Yumei Huang
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Nov 2025 12:37:52 +1100
David Gibson
On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for + * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK);
Possibly I missed something, since I only skimmed your discussion of this behaviour with Stefano. But I'm not convinced this is a correct interpretation of the RFC. (5.7) says "If the timer expires awaiting the ACK of a SYN segment ...". That is, I think it's only suggesting increasing the RTO to 3 at the data stage *if* we had at least one retry during the handshake.
Oops, true, my bad. I guess I suggested the interpretation as of v7 because I just skimmed Appendix A of RFC 6298 whose main function is to justify the reasons behind lowering the initial timeout to one second, and I thought these reason simply don't apply to the established phase, so we use three seconds there. But that's clearly not the case, hence the "When this happens" clause in the middle of Appendix A.
That is, unfortunately, much fiddlier to implement, since we need to remember what happened during the handshake to apply it here.
Hmm, --- diff --git a/tcp.c b/tcp.c index c1eb5de..90c3ca1 100644 --- a/tcp.c +++ b/tcp.c @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = { static const char *tcp_flag_str[] __attribute((__unused__)) = { "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", - "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED", }; /* Listening sockets, used for automatic port forwarding in pasta mode only */ @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - else + else if (conn->flags & SYN_RETRIED) timeout = MAX(timeout, RTO_INIT_ACK); timeout <<= MAX(exp, 0); it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) flow_trace(conn, "SYN timeout, retry"); tcp_send_flag(c, conn, SYN); conn->retries++; + conn_flag(c, conn, SYN_RETRIED); tcp_timer_ctl(c, conn); } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { diff --git a/tcp_conn.h b/tcp_conn.h index c006d56..87f4a2d 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -77,6 +77,7 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) #define ACK_FROM_TAP_BLOCKS BIT(5) +#define SYN_RETRIED BIT(6) #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; --- ?
Additionally, if I'm reading the RFC correctly, it's treating this as a one-time adjustment of the RTO, which won't necessarily remain the case for the entire data phase. Here this minimum will apply for the entire data phase.
But it's the initial RTO (see Appendix A, which states it clearly), and any exponentiation is based on the initial value, so this should fit the requirement.
Even though it's a "MUST" in the RFC, I kind of think we could just skip this for two reasons:
1) We already don't bother with RTT measurements, which the RFC assumes the implementation is doing to adjust the RTO.
Kind of: (2.1) Until a round-trip time (RTT) measurement has been made for a segment sent between the sender and receiver, the sender SHOULD set RTO <- 1 second and given that this condition (no round-trip time measurement done yet) is explicitly considered, I guess we can reasonably expect TCP stacks we might be talking to to be fully compatible with what we're doing, as long as we stick to the RFC.
2) We expect to be talking to a guest. The chance of a high RTT is vanishingly small compared to a path to potentially any host on the 2011 internet.
...until we move the guest / container and we implement a TCP transport (somewhat overdue) in place of the existing tap / UNIX domain / vhost-user connection. The chance is still small and the consequences of ignoring this part of the RFC are, I'm fairly sure, negligible, but if it's that easy to implement, should we really depart from it? -- Stefano
On Mon, Nov 03, 2025 at 11:11:10AM +0800, Yumei Huang wrote:
On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio
wrote: On Fri, 31 Oct 2025 13:42:42 +0800 Yumei Huang
wrote: [snip]
diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s)
Nit: "maximal" has a slightly different meaning compared to "maximum".
The highest value allowed for a field would typically be called "maximum", while "maximal" is more commonly used to indicate a value / element that's the biggest of all values. Yes, I know, it's complicated.
Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst.
tcp_rto_max_ms - INTEGER Maximal TCP retransmission timeout (in ms).
"maximum" might be more appropriate as you explained.
In mathematics, the difference is well defined. "maximal" means nothing else is bigger than it, "maximum" means everything else is smaller than it. Those are the same thing for a total order, but not for a partial order. Since integers have a total order, either would be correct here, according to mathematician's English. Of course, that only partly overlaps with everyday English usage. I think either would be fine, but "maximum" is probably slightly better. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, 3 Nov 2025 11:11:10 +0800
Yumei Huang
On Fri, Oct 31, 2025 at 4:38 PM Stefano Brivio
wrote: On Fri, 31 Oct 2025 13:42:42 +0800 Yumei Huang
wrote: Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for
Nit: we started with BrE (British English) spelling and later tried to keep that consistent for the sake of grep, so we'll usually use "initialise" (for consistency, at this point).
Got it.
+ * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK); + timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
size_t looks like a rather weird choice for tcp_rto_max: size_t is used to represent the size of objects, and nothing else (not a timeout in milliseconds).
It's also an int in ipv4_net_table[], net/ipv4/sysctl_net_ipv4.c (Linux kernel). Any reason for picking size_t here?
No particular reason. Just thought it can be used for counting as well. I can change it to int in v8.
I mean, it will work, it just looks weird (and wastes a tiny bit of space, even though I guess we don't care about 4 bytes there).
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s)
Nit: "maximal" has a slightly different meaning compared to "maximum".
The highest value allowed for a field would typically be called "maximum", while "maximal" is more commonly used to indicate a value / element that's the biggest of all values. Yes, I know, it's complicated.
Yeah, it's complicated. Actually I get the word from networking/ip-sysctl.rst.
tcp_rto_max_ms - INTEGER Maximal TCP retransmission timeout (in ms).
Oh, that's from 1280c26228bd ("tcp: add tcp_rto_max_ms sysctl), so probably from the French adjective "maximal" (in French, "maximum" is only a noun).
"maximum" might be more appropriate as you explained.
* @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max; uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; };
The rest of the series looks good to me at a *very* quick glance, but I can't claim I really reviewed it yet.
Thank you. Please take your time if you'd like to review them further.
Yes, I already started, didn't quite finish yet. -- Stefano
On Mon, 3 Nov 2025 20:32:14 +1100
David Gibson
On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
On Mon, Nov 3, 2025 at 9:38 AM David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson As reported, the carried over R-b was a minor mistake, since the code has changed, but here's a new one:
Reviewed-by: David Gibson
Small comment below, though.
--- tcp.c | 30 ++++++++++++------------------ 1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/tcp.c b/tcp.c index bada88a..96ee56a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,13 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. Retry for - * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, - * reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. Retry for TCP_MAX_RETRIES times + * for established connections, or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +339,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically) buggy. conn->retries is unsigned, so the subtraction will be performed unsigned and only then cast to signed. I think that will probably do the right thing in practice, but I don't think that's guaranteed by the C standard (and might even be UB).
I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries and c->tcp.syn_linear_timeouts) will go through integer promotion before subtraction. So the line is like:
int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
Please correct me if I'm wrong.
Huh, I thought it would only be promoted if one of the operands was an int.
I thought and I think so too, yes.
But C promotion rules are really confusing, so I could well be wrong.
Some references linked at:
https://archives.passt.top/passt-dev/20240103080834.24fa0a7a@elisabeth/
and here, it's about paragraph 6.3.1.8, "Usual arithmetic conversions"
of C11 (that's what passt uses right now).
No double or float here, so this is the relevant text:
Otherwise, the integer promotions are performed on both operands.
Then the following rules are applied to the promoted operands:
If both operands have the same type, then no further conversion
is needed
now, are they really the same type? The standard doesn't define uint8_t.
If we see it as an unsigned int, with storage limited to 8 bits, then I
would call them the same type.
If we see c->tcp.syn_linear_timeouts as a character type instead
(6.3.1.1, "Boolean, characters, and integers"), then the integer
conversion rank of unsigned int (conn->retries) is greater than the rank
of char, and the same text applies (promotion of char to unsigned int).
But I'm fairly sure that's not the case. We used uint8_t, not short,
and not char.
All in all I don't think there's any possibility of promotion to a
signed int: for that, you would need one of the operands to be signed.
Or two unsigned chars/shorts (see examples below).
If the operation is performed unsigned, and that's what gcc appears to
do here, promoting both operands to unsigned int (32-bit):
---
$ cat pro_uh_oh_motion.c
#include
On Fri, 31 Oct 2025 13:42:40 +0800
Yumei Huang
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 4 ++++ 2 files changed, 53 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..bada88a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. Retry for + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, + * reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" + +#define TCP_SYN_RETRIES_DEFAULT 6 +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + int exp = conn->retries - c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); + } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) {
This: int max; max = c->tcp.syn_retries + c->tcp.syn_linear_timeouts; max = MIN(TCP_MAX_RETRIES, max); if (conn->retries >= max) { is one more line but a bit easier to understand. Not a strong preference on my side though.
+ flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,26 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now: tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); would be a bit closer to the coding style we're adopting, but I wonder: - does read_file_integer() really need to have "integer" in the name? In a language where integers are called 'int', perhaps read_file_int() is clear enough? - the constants are defined here, in tcp.c, so they obviously refer to TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough - you don't really need to store both values in two different variables, one is enough as you're assigning them right away. And: v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX); v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX); is four lines instead of six, and more readable if you ask me.
+ + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries;
Why 'tcp' again?
+ uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
On Mon, 3 Nov 2025 20:01:46 +1100
David Gibson
On Mon, Nov 03, 2025 at 10:31:21AM +0800, Yumei Huang wrote:
On Mon, Nov 3, 2025 at 9:10 AM David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:40PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as Linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
There are a couple of tiny style nits and a mostly theoretical bug noted below.
--- tcp.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 4 ++++ 2 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..bada88a 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake
Nit: I'd maybe say SYN,ACK here to distinguish it from other ACKs.
+ * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. Retry for + * TCP_MAX_RETRIES or (tcp_syn_retries + tcp_syn_linear_timeouts) times, + * reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s, RFC 6928 */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,12 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" + +#define TCP_SYN_RETRIES_DEFAULT 6 +#define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +589,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + int exp = conn->retries - c->tcp.syn_linear_timeouts; + it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(exp, 0); + } else
A non-obvious detail of the style we use is that if one branch of an if uses { }, then the other branch should as well, even if it's one line. So above should be "} else {" and the bit below changed to match.
Oh, I didn't know about this style. Will update it later.
Since it's gone in the next patch, it doesn't really matter.
it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) {
Here's the theoretical bug. We only clamp tcp_syn_retries and syn_linear_timeouts to a uint8_t, so if the host has these set to strangely large values, this addition could overflow.
Actually, I think it won't, because it's performed as 'unsigned int', see my other email on the subject. But in any case:
Just checked net/ipv4/sysctl_net_ipv4.c,
static int tcp_syn_retries_min = 1; static int tcp_syn_retries_max = MAX_TCP_SYNCNT; static int tcp_syn_linear_timeouts_max = MAX_TCP_SYNCNT;
And MAX_TCP_SYNCNT is defined as 127:
#define MAX_TCP_SYNCNT 127
...these limits come and go, so...
they have a max value as 127, so there won't be overflows.
Ok. That's not at all obvious from within the passt code, though. Again, this is really only a theoretical problem, but I think ideally we'd clamp to 127 where we read the values out of the kernel, with a comment saying that limit is derived from the kernel's own limit.
that would sound like a good idea anyway, instead of clamping to 255 as done later in this patch. -- Stefano
On Fri, 31 Oct 2025 13:42:42 +0800
Yumei Huang
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for + * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
Same as my previous comment on 3/5: you could skip "TCP_" in this name, and:
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
here too, and:
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK); + timeout <<= MAX(exp, 0); + it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { @@ -2785,18 +2794,24 @@ static socklen_t tcp_probe_tcp_info(void) */ void tcp_get_rto_params(struct ctx *c) { - intmax_t tcp_syn_retries, syn_linear_timeouts; + intmax_t tcp_syn_retries, syn_linear_timeouts, tcp_rto_max_ms;
tcp_syn_retries = read_file_integer( TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer( TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); + tcp_rto_max_ms = read_file_integer( + TCP_RTO_MAX_MS, TCP_RTO_MAX_MS_DEFAULT);
c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + c->tcp.tcp_rto_max = MIN( + DIV_ROUND_CLOSEST(tcp_rto_max_ms, 1000), SIZE_MAX);
- debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, - c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); + debug("Read sysctl values tcp_syn_retries: %"PRIu8 + ", linear_timeouts: %"PRIu8", tcp_rto_max: %zu", + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts, + c->tcp.tcp_rto_max); }
/** diff --git a/tcp.h b/tcp.h index befedde..a238bb7 100644 --- a/tcp.h +++ b/tcp.h @@ -59,6 +59,7 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_rto_max: Maximal retry timeout (in s) * @tcp_syn_retries: SYN retries using exponential backoff timeout * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ @@ -67,6 +68,7 @@ struct tcp_ctx { struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + size_t tcp_rto_max;
here too.
uint8_t tcp_syn_retries; uint8_t syn_linear_timeouts; };
I finally finished reviewing, minus pending comments the series looks good to me. -- Stefano
On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] +/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
- the constants are defined here, in tcp.c, so they obviously refer to TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough
- you don't really need to store both values in two different variables, one is enough as you're assigning them right away. And:
v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
is four lines instead of six, and more readable if you ask me.
+ + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries;
Why 'tcp' again?
+ uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Mon, Nov 03, 2025 at 11:38:57AM +0100, Stefano Brivio wrote:
On Mon, 3 Nov 2025 12:37:52 +1100 David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:42PM +0800, Yumei Huang wrote:
Clamp the TCP retry timeout as Linux kernel does. If RTO is less than 3 seconds, re-initialize it to 3 seconds for data retransmissions according to RFC 6298.
Suggested-by: Stefano Brivio
Signed-off-by: Yumei Huang --- tcp.c | 25 ++++++++++++++++++++----- tcp.h | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 96ee56a..84a6700 100644 --- a/tcp.c +++ b/tcp.c @@ -187,6 +187,9 @@ * for established connections, or (tcp_syn_retries + * tcp_syn_linear_timeouts) times during the handshake, reset the connection * + * - RTO_INIT_ACK: if the RTO is less than this, re-initialize RTO to this for + * data retransmissions. + * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset * the connection @@ -340,6 +343,7 @@ enum {
#define ACK_INTERVAL 10 /* ms */ #define RTO_INIT 1 /* s, RFC 6298 */ +#define RTO_INIT_ACK 3 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -365,9 +369,11 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" #define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" +#define TCP_RTO_MAX_MS "/proc/sys/net/ipv4/tcp_rto_max_ms"
#define TCP_SYN_RETRIES_DEFAULT 6 #define TCP_SYN_LINEAR_TIMEOUTS_DEFAULT 4 +#define TCP_RTO_MAX_MS_DEFAULT 120000
/* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX]; @@ -585,10 +591,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - int exp = conn->retries; + int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); + else + timeout = MAX(timeout, RTO_INIT_ACK);
Possibly I missed something, since I only skimmed your discussion of this behaviour with Stefano. But I'm not convinced this is a correct interpretation of the RFC. (5.7) says "If the timer expires awaiting the ACK of a SYN segment ...". That is, I think it's only suggesting increasing the RTO to 3 at the data stage *if* we had at least one retry during the handshake.
Oops, true, my bad.
I guess I suggested the interpretation as of v7 because I just skimmed Appendix A of RFC 6298 whose main function is to justify the reasons behind lowering the initial timeout to one second, and I thought these reason simply don't apply to the established phase, so we use three seconds there.
But that's clearly not the case, hence the "When this happens" clause in the middle of Appendix A.
That is, unfortunately, much fiddlier to implement, since we need to remember what happened during the handshake to apply it here.
Hmm,
--- diff --git a/tcp.c b/tcp.c index c1eb5de..90c3ca1 100644 --- a/tcp.c +++ b/tcp.c @@ -397,7 +397,7 @@ static const char *tcp_state_str[] __attribute((__unused__)) = {
static const char *tcp_flag_str[] __attribute((__unused__)) = { "STALLED", "LOCAL", "ACTIVE_CLOSE", "ACK_TO_TAP_DUE", - "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", + "ACK_FROM_TAP_DUE", "ACK_FROM_TAP_BLOCKS", "SYN_RETRIED", };
/* Listening sockets, used for automatic port forwarding in pasta mode only */ @@ -597,7 +597,7 @@ static void tcp_timer_ctl(struct tcp_tap_conn *conn) int exp = conn->retries, timeout = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; - else + else if (conn->flags & SYN_RETRIED) timeout = MAX(timeout, RTO_INIT_ACK); timeout <<= MAX(exp, 0); it.it_value.tv_sec = MIN(timeout, c->tcp.tcp_rto_max); @@ -2446,6 +2446,7 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) flow_trace(conn, "SYN timeout, retry"); tcp_send_flag(c, conn, SYN); conn->retries++; + conn_flag(c, conn, SYN_RETRIED); tcp_timer_ctl(c, conn); } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { diff --git a/tcp_conn.h b/tcp_conn.h index c006d56..87f4a2d 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -77,6 +77,7 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4) #define ACK_FROM_TAP_BLOCKS BIT(5) +#define SYN_RETRIED BIT(6)
#define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; ---
?
Ok, not as fiddly as I feared. \o/
Additionally, if I'm reading the RFC correctly, it's treating this as a one-time adjustment of the RTO, which won't necessarily remain the case for the entire data phase. Here this minimum will apply for the entire data phase.
But it's the initial RTO (see Appendix A, which states it clearly), and any exponentiation is based on the initial value, so this should fit the requirement.
Even though it's a "MUST" in the RFC, I kind of think we could just skip this for two reasons:
1) We already don't bother with RTT measurements, which the RFC assumes the implementation is doing to adjust the RTO.
Kind of:
(2.1) Until a round-trip time (RTT) measurement has been made for a segment sent between the sender and receiver, the sender SHOULD set RTO <- 1 second
and given that this condition (no round-trip time measurement done yet) is explicitly considered, I guess we can reasonably expect TCP stacks we might be talking to to be fully compatible with what we're doing, as long as we stick to the RFC.
Good points, you convinced me.
2) We expect to be talking to a guest. The chance of a high RTT is vanishingly small compared to a path to potentially any host on the 2011 internet.
...until we move the guest / container and we implement a TCP transport (somewhat overdue) in place of the existing tap / UNIX domain / vhost-user connection.
Right. Or even right now, if the guest then NATs the connection onto a slow VPN link.
The chance is still small and the consequences of ignoring this part of the RFC are, I'm fairly sure, negligible, but if it's that easy to implement, should we really depart from it?
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Nov 04, 2025 at 05:42:25AM +0100, Stefano Brivio wrote:
On Mon, 3 Nov 2025 20:32:14 +1100 David Gibson
wrote: On Mon, Nov 03, 2025 at 10:57:27AM +0800, Yumei Huang wrote:
On Mon, Nov 3, 2025 at 9:38 AM David Gibson
wrote: On Fri, Oct 31, 2025 at 01:42:41PM +0800, Yumei Huang wrote:
[snip]
@@ -589,12 +585,10 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) { - int exp = conn->retries - c->tcp.syn_linear_timeouts;
I didn't spot it in the previous patch, but this is (theoretically) buggy. conn->retries is unsigned, so the subtraction will be performed unsigned and only then cast to signed. I think that will probably do the right thing in practice, but I don't think that's guaranteed by the C standard (and might even be UB).
I'm not sure, but I just googled it. IIUC, the uint8_t (conn->retries and c->tcp.syn_linear_timeouts) will go through integer promotion before subtraction. So the line is like:
int exp = (int) conn->retries - (int) c->tcp.syn_linear_timeouts;
Please correct me if I'm wrong.
Huh, I thought it would only be promoted if one of the operands was an int.
I thought and I think so too, yes.
But C promotion rules are really confusing, so I could well be wrong.
Some references linked at:
https://archives.passt.top/passt-dev/20240103080834.24fa0a7a@elisabeth/
and here, it's about paragraph 6.3.1.8, "Usual arithmetic conversions" of C11 (that's what passt uses right now).
No double or float here, so this is the relevant text:
Otherwise, the integer promotions are performed on both operands. Then the following rules are applied to the promoted operands:
If both operands have the same type, then no further conversion is needed
now, are they really the same type? The standard doesn't define uint8_t. If we see it as an unsigned int, with storage limited to 8 bits, then I would call them the same type.
So, I was a bit confused in the first place - I had thought both operands were literally uint8_t typed. But that's not the case, 'retries' is an unsigned int that's bit limited.
If we see c->tcp.syn_linear_timeouts as a character type instead (6.3.1.1, "Boolean, characters, and integers"), then the integer conversion rank of unsigned int (conn->retries) is greater than the rank of char, and the same text applies (promotion of char to unsigned int).
But I'm fairly sure that's not the case. We used uint8_t, not short, and not char.
All in all I don't think there's any possibility of promotion to a signed int: for that, you would need one of the operands to be signed. Or two unsigned chars/shorts (see examples below).
Right. I'm fairly confident the arguments will be promoted to unsigned int, not signed.
If the operation is performed unsigned, and that's what gcc appears to do here, promoting both operands to unsigned int (32-bit):
--- $ cat pro_uh_oh_motion.c #include
#include int main() { uint8_t a = 0; struct { unsigned b:3; } x = { 7 };
printf("%u %i\n", a - x.b, a - x.b); } $ gcc -o pro_uh_oh_motion{,.c} $ ./pro_uh_oh_motion 4294967289 -7 ---
...do we really have a problem with that? To me it looks like the behaviour is defined, and it's what we want.
So, this is a second step, where we cast the result of the subtraction into a signed int of the same size. I'm not sure if that behaviour is defined for values outside the positive range the signed type can represent. In practice the obvious result on a two's complement machine will do what we need, but I'm not sure that C guarantees that.
These pages:
https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+co... https://wiki.sei.cmu.edu/confluence/display/c/EXP14-C.+Beware+of+integer+pro...
show some problematic examples. This one is close to our case:
unsigned short x = 45000, y = 50000; unsigned int z = x * y;
but there we have implicit promotion of the unsigned shorts to int, which doesn't happen in our case, because we already have an unsigned int.
Also multiplication which is a whole other thing.
See also the examples with two unsigned shorts from:
https://cryptoservices.github.io/fde/2018/11/30/undefined-behavior.html
and there the operation would be performed as signed int.
On top of that, you have the fact that the original types can't store the result of the operation. Here it's not the case.
They can't though - the result is negative which can't be stored in an unsigned type. Now, wraparound / mod 2^n behaviour *is* defined for unsigned (but not signed) integer types. The second part where we assign an unsigned "negative" value into a signed variable might not be. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 5 Nov 2025 15:24:44 +1100
David Gibson
On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] +/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less accurate, or it can even remain read_file_integer(). If something like this: v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX); v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX); fits 80 columns, I'm taking it as a sign that the function name isn't exceedingly long. No particular preference from me.
- the constants are defined here, in tcp.c, so they obviously refer to TCP, so maybe SYN_LINEAR_TIMEOUTS is clear enough
- you don't really need to store both values in two different variables, one is enough as you're assigning them right away. And:
v = read_file_int(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_int(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
is four lines instead of six, and more readable if you ask me.
+ + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("Read sysctl values tcp_syn_retries: %"PRIu8", linear_timeouts: %"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2815,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_get_rto_params(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..befedde 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,16 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: SYN retries using exponential backoff timeout + * @syn_linear_timeouts: SYN retries before using exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries;
Why 'tcp' again?
+ uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio
On Wed, 5 Nov 2025 15:24:44 +1100 David Gibson
wrote: On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] +/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less accurate, or it can even remain read_file_integer(). If something like this:
v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
fits 80 columns, I'm taking it as a sign that the function name isn't exceedingly long. No particular preference from me.
The longest line just exceeds 80 columns. I can rename the function to read_file_num() if there is no objection from David. Another thing, while I was revising the patches, I noticed the parameter "const struct ctx *c" was removed from tcp_timer_ctl() by commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common") from Laurent, but we need it to access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we add it back? (Adding Laurent in the loop.) -- Thanks, Yumei Huang
On Fri, 7 Nov 2025 17:56:54 +0800
Yumei Huang
On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio
wrote: On Wed, 5 Nov 2025 15:24:44 +1100 David Gibson
wrote: On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] +/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less accurate, or it can even remain read_file_integer(). If something like this:
v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
fits 80 columns, I'm taking it as a sign that the function name isn't exceedingly long. No particular preference from me.
The longest line just exceeds 80 columns.
No, why? It's 80 columns, look: v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); 01234567890123456789012345678901234567890123456789012345678901234567890123456789
I can rename the function to read_file_num() if there is no objection from David.
Sure, up to you and David.
Another thing, while I was revising the patches, I noticed the parameter "const struct ctx *c" was removed from tcp_timer_ctl() by commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common") from Laurent, but we need it to access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we add it back?
Yes, I don't see any other way around it, and I think I'll need it back anyway for some fixes I'm working on. It's one type of trivial conflict I would typically fix up on merge, by the way, but given that you will send another version, you could/should rebase on latest HEAD anyway.
(Adding Laurent in the loop.)
-- Stefano
On Fri, Nov 7, 2025 at 6:05 PM Stefano Brivio
On Fri, 7 Nov 2025 17:56:54 +0800 Yumei Huang
wrote: On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio
wrote: On Wed, 5 Nov 2025 15:24:44 +1100 David Gibson
wrote: On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] +/** + * tcp_get_rto_params() - Get host kernel RTO parameters + * @c: Execution context +*/ +void tcp_get_rto_params(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer( + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); + syn_linear_timeouts = read_file_integer( + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less accurate, or it can even remain read_file_integer(). If something like this:
v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
fits 80 columns, I'm taking it as a sign that the function name isn't exceedingly long. No particular preference from me.
The longest line just exceeds 80 columns.
No, why? It's 80 columns, look:
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); 01234567890123456789012345678901234567890123456789012345678901234567890123456789
You are right. Somehow I counted 81 columns. Then I will keep the name as it is. (To me, read_file_num() sounds like it can read any kind of number.)
I can rename the function to read_file_num() if there is no objection from David.
Sure, up to you and David.
Another thing, while I was revising the patches, I noticed the parameter "const struct ctx *c" was removed from tcp_timer_ctl() by commit dd5302dd7bf51 ("tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common") from Laurent, but we need it to access c->tcp.syn_retries and c->tcp.syn_linear_timeouts. Should we add it back?
Yes, I don't see any other way around it, and I think I'll need it back anyway for some fixes I'm working on.
It's one type of trivial conflict I would typically fix up on merge, by the way, but given that you will send another version, you could/should rebase on latest HEAD anyway.
Sure.
(Adding Laurent in the loop.)
-- Stefano
-- Thanks, Yumei Huang
On Mon, Nov 10, 2025 at 10:52:05AM +0800, Yumei Huang wrote:
On Fri, Nov 7, 2025 at 6:05 PM Stefano Brivio
wrote: On Fri, 7 Nov 2025 17:56:54 +0800 Yumei Huang
wrote: On Wed, Nov 5, 2025 at 3:01 PM Stefano Brivio
wrote: On Wed, 5 Nov 2025 15:24:44 +1100 David Gibson
wrote: On Tue, Nov 04, 2025 at 05:42:33AM +0100, Stefano Brivio wrote:
On Fri, 31 Oct 2025 13:42:40 +0800 Yumei Huang
wrote: [snip] > +/** > + * tcp_get_rto_params() - Get host kernel RTO parameters > + * @c: Execution context > +*/ > +void tcp_get_rto_params(struct ctx *c) > +{ > + intmax_t tcp_syn_retries, syn_linear_timeouts; > + > + tcp_syn_retries = read_file_integer( > + TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); > + syn_linear_timeouts = read_file_integer( > + TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT); I think this is a bit hard to read. Now:
tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, TCP_SYN_RETRIES_DEFAULT); syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, TCP_SYN_LINEAR_TIMEOUTS_DEFAULT);
would be a bit closer to the coding style we're adopting, but I wonder:
- does read_file_integer() really need to have "integer" in the name?
In a language where integers are called 'int', perhaps read_file_int() is clear enough?
I think the idea is that read_file_integer() can be used for any (signed) integer type (with range checking performed after the call). read_file_int() might suggest it reads exactly an 'int', not anything bigger or smaller.
Oh, I see. It could be read_file_num() then, even if it's slightly less accurate, or it can even remain read_file_integer(). If something like this:
v = read_file_integer(SYN_RETRIES, SYN_RETRIES_DEFAULT); c->tcp.tcp_syn_retries = MIN(v, UINT8_MAX);
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); c->tcp.syn_linear_timeouts = MIN(v, UINT8_MAX);
fits 80 columns, I'm taking it as a sign that the function name isn't exceedingly long. No particular preference from me.
The longest line just exceeds 80 columns.
No, why? It's 80 columns, look:
v = read_file_integer(SYN_LINEAR_TIMEOUTS, SYN_LINEAR_TIMEOUTS_DEFAULT); 01234567890123456789012345678901234567890123456789012345678901234567890123456789
You are right. Somehow I counted 81 columns. Then I will keep the name as it is. (To me, read_file_num() sounds like it can read any kind of number.)
I also slightly prefer read_file_integer(). -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (3)
-
David Gibson
-
Stefano Brivio
-
Yumei Huang