[PATCH v6 0/4] Retry SYNs for inbound connections
When a client connects, SYN would be sent to guest only once. If the guest is not connected or ready at that time, the connection will be reset in 10s. These patches introduce the SYN retry mechanism using the similar backoff timeout as linux kernel. Also update the data retransmission timeout using the backoff timeout. Yumei Huang (4): tcp: Rename "retrans" to "retries" util: Introduce read_file() and read_file_integer() function tcp: Resend SYN for inbound connections tcp: Update data retransmission timeout tcp.c | 76 +++++++++++++++++++++++++++++++++++------------- tcp.h | 5 ++++ tcp_conn.h | 12 ++++---- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 5 files changed, 159 insertions(+), 26 deletions(-) -- 2.47.0
Rename "retrans" to "retries" so it can be used for SYN retries.
Signed-off-by: Yumei Huang
Signed-off-by: Yumei Huang
If a client connects while guest is not connected or ready yet,
resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153
Signed-off-by: Yumei Huang
Use an exponential backoff timeout for data retransmission according
to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed
in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
On Fri, 17 Oct 2025 14:28:36 +0800
Yumei Huang
Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See: https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/ and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE). As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though. Note that I didn't review this patch or the rest of this series, yet. -- Stefano
On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio
On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2? David replied as quote: "We could use BUFSIZ, but it's massive overkill for reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if intmax_t were 128-bit)." It's my fault. I should have waited to see if you have more comments before sending the new version.
Note that I didn't review this patch or the rest of this series, yet.
-- Stefano
-- Thanks, Yumei Huang
On Tue, 21 Oct 2025 17:32:58 +0800
Yumei Huang
On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2?
Right, but on v4, and that was just Friday for everybody involved...
David replied as quote:
"We could use BUFSIZ, but it's massive overkill for reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if intmax_t were 128-bit)."
I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer doesn't really make a difference. It's not like we allocate that memory anyway, and I don't think any of that memory (or unused holes on the stack we create) is prefetched. And regardless of all that... we don't use these functions on any data path, it's just during configuration. It can be (relatively) slow.
It's my fault. I should have waited to see if you have more comments before sending the new version.
Thanks for understanding. My biggest problem with this is: where do I comment now? On v4 or v6? Should I even read v5? Well, anyway, I'll directly review v6 (in a bit) to keep things sane. -- Stefano
On Wed, Oct 22, 2025 at 5:51 AM Stefano Brivio
On Tue, 21 Oct 2025 17:32:58 +0800 Yumei Huang
wrote: On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2?
Right, but on v4, and that was just Friday for everybody involved...
David replied as quote:
"We could use BUFSIZ, but it's massive overkill for reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if intmax_t were 128-bit)."
I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer doesn't really make a difference.
It's not like we allocate that memory anyway, and I don't think any of that memory (or unused holes on the stack we create) is prefetched. And regardless of all that... we don't use these functions on any data path, it's just during configuration. It can be (relatively) slow.
It's my fault. I should have waited to see if you have more comments before sending the new version.
Thanks for understanding. My biggest problem with this is: where do I comment now? On v4 or v6? Should I even read v5?
Well, anyway, I'll directly review v6 (in a bit) to keep things sane.
Sure, please go with v6.
-- Stefano
-- Thanks, Yumei Huang
On Tue, Oct 21, 2025 at 11:50:59PM +0200, Stefano Brivio wrote:
On Tue, 21 Oct 2025 17:32:58 +0800 Yumei Huang
wrote: On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2?
Right, but on v4, and that was just Friday for everybody involved...
David replied as quote:
"We could use BUFSIZ, but it's massive overkill for reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if intmax_t were 128-bit)."
I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer doesn't really make a difference.
It's not like we allocate that memory anyway, and I don't think any of that memory (or unused holes on the stack we create) is prefetched. And regardless of all that... we don't use these functions on any data path, it's just during configuration. It can be (relatively) slow.
Right. Sorry, I didn't follow up yet, but after I wrote that, Stefano convinced me that BUFSIZ is fine. I originally suggested avoiding the arbitrary buffer largely because I thought having an INTMAX_STRLEN constant might be useful for other reasons too, but honestly, not very. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists + * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \ + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts); + } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection + * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8); + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */ -- 2.47.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 17, 2025 at 02:28:38PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
Still LGTM. Stefano has more or less convinced me that clamping the maximum backoff time is worthwhile, but that can easily be a later patch rather than respinning.
--- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.47.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 22, 2025 at 9:17 AM David Gibson
On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
I guess I need to update the doc string for tcp_syn_retries of the struct tcp_ctx, as discussed in v5. Sorry for the inconvenience brought by the multiple versions in a short time.
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-)
diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists + * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */ #define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \ + /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts); + } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++; + tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection + * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8); + syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8, + c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout */ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */ -- 2.47.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Wed, Oct 22, 2025 at 09:30:55AM +0800, Yumei Huang wrote:
On Wed, Oct 22, 2025 at 9:17 AM David Gibson
wrote: On Fri, Oct 17, 2025 at 02:28:37PM +0800, Yumei Huang wrote:
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
Reviewed-by: David Gibson
I guess I need to update the doc string for tcp_syn_retries of the struct tcp_ctx, as discussed in v5.
True
Sorry for the inconvenience brought by the multiple versions in a short time.
It's ok, taking a while to figure out the culture of a project is expected. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, Oct 22, 2025 at 9:19 AM David Gibson
On Fri, Oct 17, 2025 at 02:28:38PM +0800, Yumei Huang wrote:
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson Still LGTM. Stefano has more or less convinced me that clamping the maximum backoff time is worthwhile, but that can easily be a later patch rather than respinning.
Thank you. Either way, I need to respin for the doc string thing. Once Stefano finishes the review for v6, I can send v7 with clamping the maximum timeout.
--- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)
diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else { -- 2.47.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Wed, Oct 22, 2025 at 9:01 AM David Gibson
On Tue, Oct 21, 2025 at 11:50:59PM +0200, Stefano Brivio wrote:
On Tue, 21 Oct 2025 17:32:58 +0800 Yumei Huang
wrote: On Sun, Oct 19, 2025 at 6:07 PM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read + * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation +*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc; + } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path); + return -2; + } + + buf[total_read] = '\0'; + + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read + * + * Return: Integer value, fallback on failure +*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path); + return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf); + return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254.
This is not an acceptable form of attribution according to the CC BY-SA 3.0 terms. See:
https://stackoverflow.com/help/licensing https://creativecommons.org/licenses/by-sa/3.0/
and checksum.h in this tree for some examples of how to combine different licensing terms in a single file, in a way that's human-readable but still machine-friendly (for license / compliance scanners such as REUSE).
As I commented on a previous version, anyway, I don't think we need this at all. I guess my comment was ignored though.
I guess you meant the comment of suggesting using BUFSIZ in V2?
Right, but on v4, and that was just Friday for everybody involved...
David replied as quote:
"We could use BUFSIZ, but it's massive overkill for reading a single integer: 8192 versus ~21 bytes (or ~42 bytes if intmax_t were 128-bit)."
I wanted to reply to that because sure, BUFSIZ is typically 8192 bytes on glibc and 1024 with musl, but adding 10 or 8192 to the stack pointer doesn't really make a difference.
It's not like we allocate that memory anyway, and I don't think any of that memory (or unused holes on the stack we create) is prefetched. And regardless of all that... we don't use these functions on any data path, it's just during configuration. It can be (relatively) slow.
Right. Sorry, I didn't follow up yet, but after I wrote that, Stefano convinced me that BUFSIZ is fine. I originally suggested avoiding the arbitrary buffer largely because I thought having an INTMAX_STRLEN constant might be useful for other reasons too, but honestly, not very.
I will update that in v7.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800
Yumei Huang
Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read
I see this is the same as write_file(), so in some sense it's pre-existing, but @path isn't really a "file" in the sense that it's not a file descriptor as one might expect from the description alone. I'd rather say "Path to file" or "Path to file to read" or something like that. On the other hand, if you want to keep this consistent with write_file(), never mind. Not a strong preference from me.
+ * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but it's yet another convention we are introducing, because of the -2 special value. Other somewhat related functions in util.c return with a meaningful errno value set, this one doesn't. The majority of helpers in passt, though, return with a negative errno-like value, and truncation can be very well represented by returning -ENOBUFS, see snprintf_check(). I think that's preferable. Again, if the intention is to make this consistent to write_file(), it can be left as it is.
+*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc;
Coverity Scan (I can provide instructions separately if desired) reports one issue below, but I'll mention it here for clarity: you are adding 'rc', of type ssize_t, to total_read, of type size_t, and buf_size is also of type size_t, so you could overflow total_read by adding for example the maximum value for ssize_t twice, to it. We can't run into the (theoretical) issue fixed by d836d9e34586 ("util: Remove possible quadratic behaviour from write_remainder()") but the solution here might be similar. In general we should make sure that rc is less than whatever value we might sum to total_read to make it overflow at any point in time. I didn't really check this in detail, I can do that if needed, and perhaps David remembers more clearly what we did in a similar situation. It might also be a false positive, by the way. --- /home/sbrivio/passt/util.c:625:2: Type: Overflowed return value (INTEGER_OVERFLOW) /home/sbrivio/passt/util.c:596:2: 1. path: Condition "fd < 0", taking false branch. /home/sbrivio/passt/util.c:601:2: 2. path: Condition "total_read < buf_size", taking true branch. /home/sbrivio/passt/util.c:604:3: 3. path: Condition "rc < 0", taking false branch. /home/sbrivio/passt/util.c:610:3: 4. path: Condition "rc == 0", taking false branch. /home/sbrivio/passt/util.c:614:2: 5. path: Jumping back to the beginning of the loop. /home/sbrivio/passt/util.c:601:2: 6. path: Condition "total_read < buf_size", taking true branch. /home/sbrivio/passt/util.c:602:3: 7. tainted_data_return: Called function "read(fd, buf + total_read, buf_size - total_read)", and a possible return value may be less than zero. /home/sbrivio/passt/util.c:602:3: 8. assign: Assigning: "rc" = "read(fd, buf + total_read, buf_size - total_read)". /home/sbrivio/passt/util.c:604:3: 9. path: Condition "rc < 0", taking false branch. /home/sbrivio/passt/util.c:610:3: 10. path: Condition "rc == 0", taking false branch. /home/sbrivio/passt/util.c:613:3: 11. overflow: The expression "total_read" is considered to have possibly overflowed. /home/sbrivio/passt/util.c:614:2: 12. path: Jumping back to the beginning of the loop. /home/sbrivio/passt/util.c:601:2: 13. path: Condition "total_read < buf_size", taking true branch. /home/sbrivio/passt/util.c:604:3: 14. path: Condition "rc < 0", taking false branch. /home/sbrivio/passt/util.c:610:3: 15. path: Condition "rc == 0", taking true branch. /home/sbrivio/passt/util.c:611:4: 16. path: Breaking from loop. /home/sbrivio/passt/util.c:618:2: 17. path: Condition "total_read == buf_size", taking false branch. /home/sbrivio/passt/util.c:625:2: 18. return_overflow: "total_read", which might have overflowed, is returned from the function. ---
+ } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to indicate. I'd rather say "File contents exceed buffer size", or "Partial file read", something like that. While at it, you could print the size we read (it's %zu, see similar examples where we print size_t types).
+ return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller might handle -2 (or equivalent) as a "partial" failure and use the buffer anyway, so not NULL-terminating it is rather subtle.
+ } + + buf[total_read] = '\0';
This is not in the comment to the function ("Read contents of file into NULL-terminated buffer"?).
+ + return total_read; +} + +/** + * read_file_integer() - Read an integer value from a file + * @path: File to read + * @fallback: Default value if file can't be read
These need to be aligned in the usual way (tabs).
+ * + * Return: Integer value, fallback on failure
That would be @fallback if you mean 'fallback' the argument. See also: 9e0423e13541 ("style: Fix 'Return' comment style").
+*/ +intmax_t read_file_integer(const char *path, intmax_t fallback) +{ + char buf[INTMAX_STRLEN]; + ssize_t bytes_read; + intmax_t value; + char *end; + + bytes_read = read_file(path, buf, sizeof(buf)); + + if (bytes_read < 0) + return fallback; + + if (bytes_read == 0) { + debug("Empty file %s", path); + return fallback; + } + + errno = 0; + value = strtoimax(buf, &end, 10); + if (*end && *end != '\n') { + debug("Invalid format in %s", path);
It took me a bit to understand that you mean that the file contains non-numbers except for '-'. Perhaps mention that explicitly, otherwise it looks like the file needs to have a somewhat special format? Say, "Non-numeric content in %s".
+ return fallback; + } + if (errno) { + debug("Invalid value in %s: %s", path, buf);
I just learnt from strtoimax(3) that this can only be an underflow or overflow. Maybe mention that ("Out of range value in..."), instead of "invalid", which is kind of arbitrary?
+ return fallback; + } + + return value; +} + #ifdef __ia64__ /* Needed by do_clone() below: glibc doesn't export the prototype of __clone2(), * use the description from clone(2). diff --git a/util.h b/util.h index 22eaac5..3f9f296 100644 --- a/util.h +++ b/util.h @@ -222,6 +222,8 @@ void pidfile_write(int fd, pid_t pid); int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); +ssize_t read_file(const char *path, char *buf, size_t buf_size); +intmax_t read_file_integer(const char *path, intmax_t fallback); int write_all_buf(int fd, const void *buf, size_t len); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); int read_all_buf(int fd, void *buf, size_t len); @@ -250,6 +252,12 @@ static inline const char *af_name(sa_family_t af)
#define UINT16_STRLEN (sizeof("65535"))
+/* Each byte expands to at most 3 decimal digits since 0xff == 255. + * Plus 2 extra bytes for the sign and null terminator. + * See https://stackoverflow.com/a/10536254. + */ +#define INTMAX_STRLEN (sizeof(intmax_t) * 3 + 2) +
As discussed, we can probably use BUFSIZ instead, and keep this simple.
/* inet address (- '\0') + port (u16) (- '\0') + ':' + '\0' */ #define SOCKADDR_INET_STRLEN \ (INET_ADDRSTRLEN-1 + UINT16_STRLEN-1 + sizeof(":"))
-- Stefano
On Fri, 17 Oct 2025 14:28:37 +0800
Yumei Huang
If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout". Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is. Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
+ * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */
Maybe mention RFC 6928 as done above? That's where this value comes from. I just noticed you do that in 4/4, so it's slightly nicer if you do that right away here for ease of future reference, but not really needed.
#define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
This uses 121 columns. I'm not sure where all those tabs and \ come from.
+ /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts);
Probably more readable, but I haven't tried: always start from SYN_TIMEOUT_INIT, then multiply/shift if conn->retries >= c->tcp.syn_linear_timeouts.
+ } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly. Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS). We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact: * @syn_linear_timeouts: SYN retries before using exponential timeout
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
On Fri, 17 Oct 2025 14:28:38 +0800
Yumei Huang
Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me. It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case. -- Stefano
On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read
I see this is the same as write_file(), so in some sense it's pre-existing, but @path isn't really a "file" in the sense that it's not a file descriptor as one might expect from the description alone.
I'd rather say "Path to file" or "Path to file to read" or something like that. On the other hand, if you want to keep this consistent with write_file(), never mind. Not a strong preference from me.
That's a good idea, but it's not crucial to the aim of this series, so I'd suggest doing it as a later patch.
+ * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but it's yet another convention we are introducing, because of the -2 special value.
Other somewhat related functions in util.c return with a meaningful errno value set, this one doesn't.
The majority of helpers in passt, though, return with a negative errno-like value, and truncation can be very well represented by returning -ENOBUFS, see snprintf_check(). I think that's preferable.
Again, if the intention is to make this consistent to write_file(), it can be left as it is.
Similarly. I considered commenting earlier on the -2 or truncation - we don't actually use this, and it's a bit ugly. On the other hand it doesn't hurt anything, so again, I think it can wait.
+*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc;
Coverity Scan (I can provide instructions separately if desired) reports one issue below, but I'll mention it here for clarity: you are adding 'rc', of type ssize_t, to total_read, of type size_t, and buf_size is also of type size_t, so you could overflow total_read by adding for example the maximum value for ssize_t twice, to it.
We can't run into the (theoretical) issue fixed by d836d9e34586 ("util: Remove possible quadratic behaviour from write_remainder()") but the solution here might be similar.
In general we should make sure that rc is less than whatever value we might sum to total_read to make it overflow at any point in time.
I didn't really check this in detail, I can do that if needed, and perhaps David remembers more clearly what we did in a similar situation. It might also be a false positive, by the way.
I think there are two slightly overlapping issues here. 1) I'm not sure Coverity knows/trusts that read() will never return more than its third argument. That's what stops total_read from ever exceeding buf_size. I'd need to think a bit harder about how to convince it that's the case. 2) buf_size is size_t, but we're returning ssize_t. If we passed a buf_size greater than ssize_t can hold, it would make a mess (UB, I think). I don't think there are any perfectly elegant solutions in C, so I'd suggest: ASSERT(buf_size <= SSIZE_MAX); at the top of the function. I'd try (2) first because it's a real (if unlikely to be triggered) bug. Then we can see if Coverity still complains (Yumei, I can walk you through how to install and run Coverity locally using Red Hat's subscription). [snip]
+ } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to indicate. I'd rather say "File contents exceed buffer size", or "Partial file read", something like that.
While at it, you could print the size we read (it's %zu, see similar examples where we print size_t types).
+ return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller might handle -2 (or equivalent) as a "partial" failure and use the buffer anyway, so not NULL-terminating it is rather subtle.
That's a good idea. Given the purpose of the function, I think a caller _should_ ignore the buffer if it gets an error, but it's worthwhile to limit the damage if a caller forgets to check. That applies for other error cases too. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout".
Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is.
Yeah. This was my suggested wording, based on the existing wording for ACK_TIMEOUT. It's not great, but I struggled a bit to find better wording.
Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
There it is :). "Retry (NNN) times, then reset the connection". [snip]
@@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
Yes, you raised that, but then I realised it's already handled. I think I put that in the thread, not just direct to Yumei, but maybe not? Or it just got lost in the minutiae. When we receive a SYN-ACK, it will have th->ack_seq advanced a byte acknowledging the SYN. tcp_tap_handler() calls tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see the new ack_seq and clear retries (retrans before this series).
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
"Get host kernel RTO parameters"? Since we're thinking of extending this to cover the RTO upper bound as well as the SYN specific parameters.
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 24, 2025 at 11:30 AM David Gibson
On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read
I see this is the same as write_file(), so in some sense it's pre-existing, but @path isn't really a "file" in the sense that it's not a file descriptor as one might expect from the description alone.
I'd rather say "Path to file" or "Path to file to read" or something like that. On the other hand, if you want to keep this consistent with write_file(), never mind. Not a strong preference from me.
That's a good idea, but it's not crucial to the aim of this series, so I'd suggest doing it as a later patch.
+ * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but it's yet another convention we are introducing, because of the -2 special value.
Other somewhat related functions in util.c return with a meaningful errno value set, this one doesn't.
The majority of helpers in passt, though, return with a negative errno-like value, and truncation can be very well represented by returning -ENOBUFS, see snprintf_check(). I think that's preferable.
Again, if the intention is to make this consistent to write_file(), it can be left as it is.
Similarly. I considered commenting earlier on the -2 or truncation - we don't actually use this, and it's a bit ugly. On the other hand it doesn't hurt anything, so again, I think it can wait.
+*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc;
Coverity Scan (I can provide instructions separately if desired) reports one issue below, but I'll mention it here for clarity: you are adding 'rc', of type ssize_t, to total_read, of type size_t, and buf_size is also of type size_t, so you could overflow total_read by adding for example the maximum value for ssize_t twice, to it.
We can't run into the (theoretical) issue fixed by d836d9e34586 ("util: Remove possible quadratic behaviour from write_remainder()") but the solution here might be similar.
In general we should make sure that rc is less than whatever value we might sum to total_read to make it overflow at any point in time.
I didn't really check this in detail, I can do that if needed, and perhaps David remembers more clearly what we did in a similar situation. It might also be a false positive, by the way.
I think there are two slightly overlapping issues here.
1) I'm not sure Coverity knows/trusts that read() will never return more than its third argument. That's what stops total_read from ever exceeding buf_size. I'd need to think a bit harder about how to convince it that's the case.
2) buf_size is size_t, but we're returning ssize_t. If we passed a buf_size greater than ssize_t can hold, it would make a mess (UB, I think). I don't think there are any perfectly elegant solutions in C, so I'd suggest: ASSERT(buf_size <= SSIZE_MAX);
at the top of the function.
I'd try (2) first because it's a real (if unlikely to be triggered) Yumei, I can walk you through how to install and run Coverity locally using Red Hat's subscription.
That would be great!
[snip]
+ } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to indicate. I'd rather say "File contents exceed buffer size", or "Partial file read", something like that.
While at it, you could print the size we read (it's %zu, see similar examples where we print size_t types).
+ return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller might handle -2 (or equivalent) as a "partial" failure and use the buffer anyway, so not NULL-terminating it is rather subtle.
That's a good idea. Given the purpose of the function, I think a caller _should_ ignore the buffer if it gets an error, but it's worthwhile to limit the damage if a caller forgets to check. That applies for other error cases too.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Fri, 24 Oct 2025 14:30:09 +1100
David Gibson
On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout".
Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is.
Yeah. This was my suggested wording, based on the existing wording for ACK_TIMEOUT. It's not great, but I struggled a bit to find better wording.
Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
There it is :). "Retry (NNN) times, then reset the connection".
[snip]
@@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
Yes, you raised that, but then I realised it's already handled. I think I put that in the thread, not just direct to Yumei, but maybe not? Or it just got lost in the minutiae.
Yes, here: https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit this is another example of what I meant about (potential) advantages of a fully threaded (email) workflow. In this case, I didn't review v2, which came before you could post this to my comment on v1, but in a normal case, we could have settled this earlier, once for all.
When we receive a SYN-ACK, it will have th->ack_seq advanced a byte acknowledging the SYN. tcp_tap_handler() calls tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see the new ack_seq and clear retries (retrans before this series).
It doesn't look obvious at all to me. We're unlikely to break it in the future, so I don't think it's fragile in the long term, but... can one of you double check that it's actually the case with a manual one-off test?
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
"Get host kernel RTO parameters"? Since we're thinking of extending this to cover the RTO upper bound as well as the SYN specific parameters.
Ah, maybe yes, better.
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1); + + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
On Fri, 24 Oct 2025 14:30:09 +1100 David Gibson
wrote: On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: [snip] @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
Yes, you raised that, but then I realised it's already handled. I think I put that in the thread, not just direct to Yumei, but maybe not? Or it just got lost in the minutiae.
Yes, here:
https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
this is another example of what I meant about (potential) advantages of a fully threaded (email) workflow.
In this case, I didn't review v2, which came before you could post this to my comment on v1, but in a normal case, we could have settled this earlier, once for all.
Ah, right, that'd do it.
When we receive a SYN-ACK, it will have th->ack_seq advanced a byte acknowledging the SYN. tcp_tap_handler() calls tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see the new ack_seq and clear retries (retrans before this series).
It doesn't look obvious at all to me.
Oh, it's definitely not obvious, but I'm pretty confident it's correct. Fwiw, I spotted this because I thought the explicit handling in v2 wasn't at quite the right point logically (though close enough to be fine in practice). I went looking for the precise right point - when we receive the SYN-ACK - and there it was, already handled. It does make a kind of logical sense. The RFCs don't generally treat SYN (or SYN-ACK, or FIN) retransmits any differently from data retransmits. We do treat them differently, but less so after this series, which is a good thing, I think.
We're unlikely to break it in the future, so I don't think it's fragile in the long term, but... can one of you double check that it's actually the case with a manual one-off test?
Yeah, I guess that's wise. Easiest way is probably to add a temporary debug message here, and try it against a qemu guest that's temporarily suspended. Yumei, I can walk you through this, too. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Oct 24, 2025 at 6:56 PM David Gibson
On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
On Fri, 24 Oct 2025 14:30:09 +1100 David Gibson
wrote: On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: [snip] @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
Yes, you raised that, but then I realised it's already handled. I think I put that in the thread, not just direct to Yumei, but maybe not? Or it just got lost in the minutiae.
Yes, here:
https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
this is another example of what I meant about (potential) advantages of a fully threaded (email) workflow.
In this case, I didn't review v2, which came before you could post this to my comment on v1, but in a normal case, we could have settled this earlier, once for all.
Ah, right, that'd do it.
When we receive a SYN-ACK, it will have th->ack_seq advanced a byte acknowledging the SYN. tcp_tap_handler() calls tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see the new ack_seq and clear retries (retrans before this series).
It doesn't look obvious at all to me.
Oh, it's definitely not obvious, but I'm pretty confident it's correct. Fwiw, I spotted this because I thought the explicit handling in v2 wasn't at quite the right point logically (though close enough to be fine in practice). I went looking for the precise right point - when we receive the SYN-ACK - and there it was, already handled.
It does make a kind of logical sense. The RFCs don't generally treat SYN (or SYN-ACK, or FIN) retransmits any differently from data retransmits. We do treat them differently, but less so after this series, which is a good thing, I think.
We're unlikely to break it in the future, so I don't think it's fragile in the long term, but... can one of you double check that it's actually the case with a manual one-off test?
Yeah, I guess that's wise. Easiest way is probably to add a temporary debug message here, and try it against a qemu guest that's temporarily suspended. Yumei, I can walk you through this, too.
Thank you, I've verified this by adding debug messages in the if block: if (th->ack && !(conn->events & ESTABLISHED)) tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq)); conn->retries gets reset in tcp_update_seqack_from_tap().
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Mon, 27 Oct 2025 11:37:20 +0800
Yumei Huang
On Fri, Oct 24, 2025 at 6:56 PM David Gibson
wrote: On Fri, Oct 24, 2025 at 10:37:17AM +0200, Stefano Brivio wrote:
On Fri, 24 Oct 2025 14:30:09 +1100 David Gibson
wrote: On Fri, Oct 24, 2025 at 01:04:31AM +0200, Stefano Brivio wrote:
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: [snip] @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
Yes, you raised that, but then I realised it's already handled. I think I put that in the thread, not just direct to Yumei, but maybe not? Or it just got lost in the minutiae.
Yes, here:
https://archives.passt.top/passt-dev/aOxFRfJjPWy0ZW0M@zatzit
this is another example of what I meant about (potential) advantages of a fully threaded (email) workflow.
In this case, I didn't review v2, which came before you could post this to my comment on v1, but in a normal case, we could have settled this earlier, once for all.
Ah, right, that'd do it.
When we receive a SYN-ACK, it will have th->ack_seq advanced a byte acknowledging the SYN. tcp_tap_handler() calls tcp_update_seqack_from_tap() in the !ESTABLISHED case which will see the new ack_seq and clear retries (retrans before this series).
It doesn't look obvious at all to me.
Oh, it's definitely not obvious, but I'm pretty confident it's correct. Fwiw, I spotted this because I thought the explicit handling in v2 wasn't at quite the right point logically (though close enough to be fine in practice). I went looking for the precise right point - when we receive the SYN-ACK - and there it was, already handled.
It does make a kind of logical sense. The RFCs don't generally treat SYN (or SYN-ACK, or FIN) retransmits any differently from data retransmits. We do treat them differently, but less so after this series, which is a good thing, I think.
We're unlikely to break it in the future, so I don't think it's fragile in the long term, but... can one of you double check that it's actually the case with a manual one-off test?
Yeah, I guess that's wise. Easiest way is probably to add a temporary debug message here, and try it against a qemu guest that's temporarily suspended. Yumei, I can walk you through this, too.
Thank you, I've verified this by adding debug messages in the if block:
if (th->ack && !(conn->events & ESTABLISHED)) tcp_update_seqack_from_tap(c, conn, ntohl(th->ack_seq));
conn->retries gets reset in tcp_update_seqack_from_tap().
Great, thanks for checking. Maybe we could mention this in a comment, but I'm not exactly sure where we would add this comment, so unless somebody has a good idea we can also skip that. I don't have further comments on v6 by the way, the rest of the series looks good to me. -- Stefano
On Fri, Oct 24, 2025 at 11:30 AM David Gibson
On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read
I see this is the same as write_file(), so in some sense it's pre-existing, but @path isn't really a "file" in the sense that it's not a file descriptor as one might expect from the description alone.
I'd rather say "Path to file" or "Path to file to read" or something like that. On the other hand, if you want to keep this consistent with write_file(), never mind. Not a strong preference from me.
That's a good idea, but it's not crucial to the aim of this series, so I'd suggest doing it as a later patch.
Thank you. As I have to respin and this a minor change, I will update that in v7.
+ * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but it's yet another convention we are introducing, because of the -2 special value.
Other somewhat related functions in util.c return with a meaningful errno value set, this one doesn't.
The majority of helpers in passt, though, return with a negative errno-like value, and truncation can be very well represented by returning -ENOBUFS, see snprintf_check(). I think that's preferable.
Again, if the intention is to make this consistent to write_file(), it can be left as it is.
Similarly. I considered commenting earlier on the -2 or truncation - we don't actually use this, and it's a bit ugly. On the other hand it doesn't hurt anything, so again, I think it can wait.
Same here.
+*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc;
Coverity Scan (I can provide instructions separately if desired) reports one issue below, but I'll mention it here for clarity: you are adding 'rc', of type ssize_t, to total_read, of type size_t, and buf_size is also of type size_t, so you could overflow total_read by adding for example the maximum value for ssize_t twice, to it.
We can't run into the (theoretical) issue fixed by d836d9e34586 ("util: Remove possible quadratic behaviour from write_remainder()") but the solution here might be similar.
In general we should make sure that rc is less than whatever value we might sum to total_read to make it overflow at any point in time.
I didn't really check this in detail, I can do that if needed, and perhaps David remembers more clearly what we did in a similar situation. It might also be a false positive, by the way.
I think there are two slightly overlapping issues here.
1) I'm not sure Coverity knows/trusts that read() will never return more than its third argument. That's what stops total_read from ever exceeding buf_size. I'd need to think a bit harder about how to convince it that's the case.
2) buf_size is size_t, but we're returning ssize_t. If we passed a buf_size greater than ssize_t can hold, it would make a mess (UB, I think). I don't think there are any perfectly elegant solutions in C, so I'd suggest: ASSERT(buf_size <= SSIZE_MAX);
at the top of the function.
I'd try (2) first because it's a real (if unlikely to be triggered) bug. Then we can see if Coverity still complains (Yumei, I can walk you through how to install and run Coverity locally using Red Hat's subscription).
Coverity doesn't complain about it in my setup. Stefano may give more info on that. In a word, it's a false positive.
[snip]
+ } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to indicate. I'd rather say "File contents exceed buffer size", or "Partial file read", something like that.
While at it, you could print the size we read (it's %zu, see similar examples where we print size_t types).
+ return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller might handle -2 (or equivalent) as a "partial" failure and use the buffer anyway, so not NULL-terminating it is rather subtle.
That's a good idea. Given the purpose of the function, I think a caller _should_ ignore the buffer if it gets an error, but it's worthwhile to limit the damage if a caller forgets to check. That applies for other error cases too.
The rest will update in v7 as well.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
-- Thanks, Yumei Huang
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout".
Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is.
Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
+ * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */
Maybe mention RFC 6928 as done above? That's where this value comes from.
I just noticed you do that in 4/4, so it's slightly nicer if you do that right away here for ease of future reference, but not really needed.
#define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
This uses 121 columns. I'm not sure where all those tabs and \ come from.
My bad. I renamed the macro and it fits in 80 columns but I forgot to remove the '\' and the spaces before that.
+ /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts);
Probably more readable, but I haven't tried: always start from SYN_TIMEOUT_INIT, then multiply/shift if conn->retries >= c->tcp.syn_linear_timeouts.
I guess you meant: if (!(conn->events & ESTABLISHED)) { it.it_value.tv_sec = SYN_TIMEOUT_INIT; if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec = SYN_TIMEOUT_INIT << (conn->retries - c->tcp.syn_linear_timeouts); other than something like: it.it_value.tv_sec <<= 1 in above second if block, since the latter would cause it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition. If I'm correct, I'm not sure if it's more readable to change to that. I feel the if/else clause quite matches the way the kernel handles it. What do you think?
+ } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
Do you have a better idea for the function name? I named it just like other functions called in tcp_init(), such as tcp_sock_iov_init(), tcp_splice_init().
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
The 8 actually comes from your suggestion in v1: " (optional, and might be in another series, but keeping it together with the rest might be more convenient): read tcp_syn_retries, limit to 8, and also read tcp_syn_linear_timeouts " I guess it can be replaced with 6 as it's the default value according to Documentation/networking/ip-sysctl.rst. We already have TCP_SYN_RETRIES for the sysctl file, maybe have another TCP_SYN_RETRIES_DEFAULT for the number?
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
The default value is 4 as Documentation/networking/ip-sysctl.rst. I don't quite remember where the 1 is from. Should we update it as well and add the similar macro as above?
+ + c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
Got it.
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
So we remove the '"Number of" as well? I guess we should make it consistent for both tcp_syn_retries and syn_linear_timeouts.
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
#endif /* TCP_H */
-- Stefano
-- Thanks, Yumei Huang
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1. But it seems more complicated. What do you think?
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries. I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: +#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) Though the formula is not quite right, we could refine it as below: #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) Does it make sense to get the time this way?
-- Stefano
-- Thanks, Yumei Huang
On Tue, 28 Oct 2025 15:11:32 +0800
Yumei Huang
On Fri, Oct 24, 2025 at 11:30 AM David Gibson
wrote: On Fri, Oct 24, 2025 at 01:04:27AM +0200, Stefano Brivio wrote:
Sorry for the delay, mostly nits but a couple of substantial comments:
On Fri, 17 Oct 2025 14:28:36 +0800 Yumei Huang
wrote: Signed-off-by: Yumei Huang
--- util.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ util.h | 8 ++++++ 2 files changed, 92 insertions(+) diff --git a/util.c b/util.c index c492f90..5c8c4bc 100644 --- a/util.c +++ b/util.c @@ -579,6 +579,90 @@ int write_file(const char *path, const char *buf) return len == 0 ? 0 : -1; }
+/** + * read_file() - Read contents of file into a buffer + * @path: File to read
I see this is the same as write_file(), so in some sense it's pre-existing, but @path isn't really a "file" in the sense that it's not a file descriptor as one might expect from the description alone.
I'd rather say "Path to file" or "Path to file to read" or something like that. On the other hand, if you want to keep this consistent with write_file(), never mind. Not a strong preference from me.
That's a good idea, but it's not crucial to the aim of this series, so I'd suggest doing it as a later patch.
Thank you. As I have to respin and this a minor change, I will update that in v7.
Just for clarity: I think David meant that it *should* be done as a separate patch (maybe in the same series, if you have a moment, but it's not that critical). I don't have a particular preference, but David's point makes sense to me.
+ * @buf: Buffer to store file contents + * @buf_size: Size of buffer + * + * Return: number of bytes read on success, -1 on any error, -2 on truncation
Similar comment here: this is partially symmetric to read_file, but it's yet another convention we are introducing, because of the -2 special value.
Other somewhat related functions in util.c return with a meaningful errno value set, this one doesn't.
The majority of helpers in passt, though, return with a negative errno-like value, and truncation can be very well represented by returning -ENOBUFS, see snprintf_check(). I think that's preferable.
Again, if the intention is to make this consistent to write_file(), it can be left as it is.
Similarly. I considered commenting earlier on the -2 or truncation - we don't actually use this, and it's a bit ugly. On the other hand it doesn't hurt anything, so again, I think it can wait.
Same here.
Same as above ("I think it can wait"). Well, up to you, really.
+*/ +ssize_t read_file(const char *path, char *buf, size_t buf_size) +{ + int fd = open(path, O_RDONLY | O_CLOEXEC); + size_t total_read = 0; + ssize_t rc; + + if (fd < 0) { + warn_perror("Could not open %s", path); + return -1; + } + + while (total_read < buf_size) { + rc = read(fd, buf + total_read, buf_size - total_read); + + if (rc < 0) { + warn_perror("Couldn't read from %s", path); + close(fd); + return -1; + } + + if (rc == 0) + break; + + total_read += rc;
Coverity Scan (I can provide instructions separately if desired) reports one issue below, but I'll mention it here for clarity: you are adding 'rc', of type ssize_t, to total_read, of type size_t, and buf_size is also of type size_t, so you could overflow total_read by adding for example the maximum value for ssize_t twice, to it.
We can't run into the (theoretical) issue fixed by d836d9e34586 ("util: Remove possible quadratic behaviour from write_remainder()") but the solution here might be similar.
In general we should make sure that rc is less than whatever value we might sum to total_read to make it overflow at any point in time.
I didn't really check this in detail, I can do that if needed, and perhaps David remembers more clearly what we did in a similar situation. It might also be a false positive, by the way.
I think there are two slightly overlapping issues here.
1) I'm not sure Coverity knows/trusts that read() will never return more than its third argument. That's what stops total_read from ever exceeding buf_size. I'd need to think a bit harder about how to convince it that's the case.
2) buf_size is size_t, but we're returning ssize_t. If we passed a buf_size greater than ssize_t can hold, it would make a mess (UB, I think). I don't think there are any perfectly elegant solutions in C, so I'd suggest: ASSERT(buf_size <= SSIZE_MAX);
at the top of the function.
I'd try (2) first because it's a real (if unlikely to be triggered) bug. Then we can see if Coverity still complains (Yumei, I can walk you through how to install and run Coverity locally using Red Hat's subscription).
Coverity doesn't complain about it in my setup. Stefano may give more info on that. In a word, it's a false positive.
Right, my bad, it turns out I was using an outdated version and this false positive doesn't appear anymore.
[snip]
+ } + + close(fd); + + if (total_read == buf_size) { + warn("File %s truncated, buffer too small", path);
The file wasn't truncated (on disk) as this comment might seem to indicate. I'd rather say "File contents exceed buffer size", or "Partial file read", something like that.
While at it, you could print the size we read (it's %zu, see similar examples where we print size_t types).
+ return -2;
Safer to NULL-terminate also in this case, perhaps? A future caller might handle -2 (or equivalent) as a "partial" failure and use the buffer anyway, so not NULL-terminating it is rather subtle.
That's a good idea. Given the purpose of the function, I think a caller _should_ ignore the buffer if it gets an error, but it's worthwhile to limit the damage if a caller forgets to check. That applies for other error cases too.
The rest will update in v7 as well.
Thanks. -- Stefano
On Tue, 28 Oct 2025 15:43:25 +0800
Yumei Huang
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout".
Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is.
Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
+ * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */
Maybe mention RFC 6928 as done above? That's where this value comes from.
I just noticed you do that in 4/4, so it's slightly nicer if you do that right away here for ease of future reference, but not really needed.
#define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
This uses 121 columns. I'm not sure where all those tabs and \ come from.
My bad. I renamed the macro and it fits in 80 columns but I forgot to remove the '\' and the spaces before that.
+ /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts);
Probably more readable, but I haven't tried: always start from SYN_TIMEOUT_INIT, then multiply/shift if conn->retries >= c->tcp.syn_linear_timeouts.
I guess you meant:
if (!(conn->events & ESTABLISHED)) { it.it_value.tv_sec = SYN_TIMEOUT_INIT; if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec = SYN_TIMEOUT_INIT << (conn->retries - c->tcp.syn_linear_timeouts);
Well, yes, this, but without repeating the assignment, so <<= (conn->retries - c->tcp.syn_linear_timeouts).
other than something like:
it.it_value.tv_sec <<= 1
Hmm, no, why 1?
in above second if block, since the latter would cause it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition.
If I'm correct, I'm not sure if it's more readable to change to that. I feel the if/else clause quite matches the way the kernel handles it. What do you think?
Sorry, I took this for granted, but I rather meant: it.it_value.tv_sec = SYN_TIMEOUT_INIT; if (conn->retries >= c->tcp.syn_linear_timeouts) { it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts); } because in both cases we're starting from SYN_TIMEOUT_INIT. And now that I wrote that, I'll pretend I actually meant this: if (!(conn->events & ESTABLISHED)) { int factor = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(factor, 0); } else { ... } ...where we make the conditional implicit (but I think intuitive) by adding 0 as a lower bound. Am I missing something...?
+ } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
Do you have a better idea for the function name? I named it just like other functions called in tcp_init(), such as tcp_sock_iov_init(), tcp_splice_init().
Given that David was suggesting "Get host kernel RTO parameters" as a comment to this function, maybe tcp_get_rto_params()? The only similar function we have is tcp_get_sndbuf(), so I would make it consistent with that. The most peculiar task of this function is that it gets parameters ("get"), I think, rather than assigning them to variables ("init").
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
The 8 actually comes from your suggestion in v1: " (optional, and might be in another series, but keeping it together with the rest might be more convenient): read tcp_syn_retries, limit to 8, and also read tcp_syn_linear_timeouts "
That was a *limit* of 8, coming from the fact that we have 3 bits to store the retry count. This is a default instead.
I guess it can be replaced with 6 as it's the default value according to Documentation/networking/ip-sysctl.rst.
I'd rather check the code because sometimes people forget to update that documentation but yes, it matches: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/incl...
We already have TCP_SYN_RETRIES for the sysctl file, maybe have another TCP_SYN_RETRIES_DEFAULT for the number?
It sounds reasonable to me.
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
The default value is 4 as Documentation/networking/ip-sysctl.rst. I don't quite remember where the 1 is from. Should we update it as well and add the similar macro as above?
I would, yes. By the way, it's hardcoded here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/... but defined here for thin streams (connections where observed throughput is low: https://docs.kernel.org/networking/tcp-thin.html): https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/incl... I'd suggest to define a constant with value '4' and use it.
+ c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
Got it.
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
So we remove the '"Number of" as well? I guess we should make it consistent for both tcp_syn_retries and syn_linear_timeouts.
Ah, yes, we should. I mean, I guess it's clear that we're talking about the count / number of those and not their colour or taste...
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
-- Stefano
On Tue, 28 Oct 2025 16:09:03 +0800
Yumei Huang
On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say: it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts); but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4: int factor = conn->retries; if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts; it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); ?
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right? But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl(). If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just: --- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max); } ... if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); --- -- Stefano
On Tue, 28 Oct 2025 12:44:26 +0100
Stefano Brivio
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
It just occurred to me that 'factor' might be misleading as this is rather an exponent, so maybe 'exp' instead. -- Stefano
On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
On Tue, 28 Oct 2025 15:43:25 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:37 +0800 Yumei Huang
wrote: If a client connects while guest is not connected or ready yet, resend SYN instead of just resetting connection after 10 seconds.
Use the same backoff calculation for the timeout as linux kernel.
Linux.
Link: https://bugs.passt.top/show_bug.cgi?id=153 Signed-off-by: Yumei Huang
--- tcp.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++-------- tcp.h | 5 +++++ 2 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 2ec4b0c..9385132 100644 --- a/tcp.c +++ b/tcp.c @@ -179,9 +179,11 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT: if no ACK is received from tap/guest during handshake (flag - * ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, reset the - * connection + * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake + * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend + * SYN. It's the starting timeout for the first SYN retry. If this persists
"If this persists" makes sense for the existing ACK_TIMEOUT description but not here, because it looks like it refers to "starting timeout".
Coupled with the next patch, it becomes increasingly difficult to understand what "this" persisting thing is.
Maybe directly say "Retry for ..., then reset the connection"? It's shorter and clearer.
+ * for more than TCP_MAX_RETRIES or (tcp_syn_retries + + * tcp_syn_linear_timeouts) times in a row, reset the connection * * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the @@ -340,7 +342,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT 10 /* s */ +#define SYN_TIMEOUT_INIT 1 /* s */
Maybe mention RFC 6928 as done above? That's where this value comes from.
I just noticed you do that in 4/4, so it's slightly nicer if you do that right away here for ease of future reference, but not really needed.
#define ACK_TIMEOUT 2 #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200 @@ -365,6 +367,9 @@ uint8_t tcp_migrate_rcv_queue [TCP_MIGRATE_RCV_QUEUE_MAX];
#define TCP_MIGRATE_RESTORE_CHUNK_MIN 1024 /* Try smaller when above this */
+#define TCP_SYN_RETRIES "/proc/sys/net/ipv4/tcp_syn_retries" +#define TCP_SYN_LINEAR_TIMEOUTS "/proc/sys/net/ipv4/tcp_syn_linear_timeouts" \
This uses 121 columns. I'm not sure where all those tabs and \ come from.
My bad. I renamed the macro and it fits in 80 columns but I forgot to remove the '\' and the spaces before that.
+ /* "Extended" data (not stored in the flow table) for TCP flow migration */ static struct tcp_tap_transfer_ext migrate_ext[FLOW_MAX];
@@ -581,8 +586,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (conn->flags & ACK_TO_TAP_DUE) { it.it_value.tv_nsec = (long)ACK_INTERVAL * 1000 * 1000; } else if (conn->flags & ACK_FROM_TAP_DUE) { - if (!(conn->events & ESTABLISHED)) - it.it_value.tv_sec = SYN_TIMEOUT; + if (!(conn->events & ESTABLISHED)) { + if (conn->retries < c->tcp.syn_linear_timeouts) + it.it_value.tv_sec = SYN_TIMEOUT_INIT; + else + it.it_value.tv_sec = SYN_TIMEOUT_INIT << + (conn->retries - c->tcp.syn_linear_timeouts);
Probably more readable, but I haven't tried: always start from SYN_TIMEOUT_INIT, then multiply/shift if conn->retries >= c->tcp.syn_linear_timeouts.
I guess you meant:
if (!(conn->events & ESTABLISHED)) { it.it_value.tv_sec = SYN_TIMEOUT_INIT; if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec = SYN_TIMEOUT_INIT << (conn->retries - c->tcp.syn_linear_timeouts);
Well, yes, this, but without repeating the assignment, so <<= (conn->retries - c->tcp.syn_linear_timeouts).
other than something like:
it.it_value.tv_sec <<= 1
Hmm, no, why 1?
I thought you meant: when retries >= syn_linear_timeouts, we don't use SYN_TIMEOUT_INIT anymore, but to multiply it.it_value.tv_sec itself, like if it's 2 this time, next time it's 4 (2<<=1). Now I get your point, and it makes sense to me.
in above second if block, since the latter would cause it.it_value.tv_sec=2 always when retries>=syn_linear_timeouts, as it.it_value.tv_sec is set to SYN_TIMEOUT_INIT before the if condition.
If I'm correct, I'm not sure if it's more readable to change to that. I feel the if/else clause quite matches the way the kernel handles it. What do you think?
Sorry, I took this for granted, but I rather meant:
it.it_value.tv_sec = SYN_TIMEOUT_INIT; if (conn->retries >= c->tcp.syn_linear_timeouts) { it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts); }
because in both cases we're starting from SYN_TIMEOUT_INIT.
And now that I wrote that, I'll pretend I actually meant this:
if (!(conn->events & ESTABLISHED)) { int factor = conn->retries - c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = SYN_TIMEOUT_INIT << MAX(factor, 0); } else { ... }
...where we make the conditional implicit (but I think intuitive) by adding 0 as a lower bound. Am I missing something...?
No, it was me misunderstanding your point. I will update accordingly.
+ } else it.it_value.tv_sec = ACK_TIMEOUT; } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { @@ -2409,8 +2419,17 @@ void tcp_timer_handler(const struct ctx *c, union epoll_ref ref) tcp_timer_ctl(c, conn); } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { - flow_dbg(conn, "handshake timeout"); - tcp_rst(c, conn); + if (conn->retries >= TCP_MAX_RETRIES || + conn->retries >= (c->tcp.tcp_syn_retries + + c->tcp.syn_linear_timeouts)) { + flow_dbg(conn, "handshake timeout"); + tcp_rst(c, conn); + } else { + flow_trace(conn, "SYN timeout, retry"); + tcp_send_flag(c, conn, SYN); + conn->retries++;
I think I already raised this point on a previous revision: this needs to be zeroed as the connection is established, but I don't see that in the current version.
+ tcp_timer_ctl(c, conn); + } } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { flow_dbg(conn, "FIN timeout"); tcp_rst(c, conn); @@ -2766,6 +2785,24 @@ static socklen_t tcp_probe_tcp_info(void) return sl; }
+/** + * tcp_syn_params_init() - Get initial SYN parameters for inbound connection
They're not initial, they'll be used for all the connections if I understand correctly.
Maybe "Get SYN retries sysctl values"? I think the _init() in the function name is also somewhat misleading.
Do you have a better idea for the function name? I named it just like other functions called in tcp_init(), such as tcp_sock_iov_init(), tcp_splice_init().
Given that David was suggesting "Get host kernel RTO parameters" as a comment to this function, maybe tcp_get_rto_params()?
The only similar function we have is tcp_get_sndbuf(), so I would make it consistent with that. The most peculiar task of this function is that it gets parameters ("get"), I think, rather than assigning them to variables ("init").
Make sense.
+ * @c: Execution context +*/ +void tcp_syn_params_init(struct ctx *c) +{ + intmax_t tcp_syn_retries, syn_linear_timeouts; + + tcp_syn_retries = read_file_integer(TCP_SYN_RETRIES, 8);
Why 8? Perhaps a #define would help?
The 8 actually comes from your suggestion in v1: " (optional, and might be in another series, but keeping it together with the rest might be more convenient): read tcp_syn_retries, limit to 8, and also read tcp_syn_linear_timeouts "
That was a *limit* of 8, coming from the fact that we have 3 bits to store the retry count. This is a default instead.
I guess it can be replaced with 6 as it's the default value according to Documentation/networking/ip-sysctl.rst.
I'd rather check the code because sometimes people forget to update that documentation but yes, it matches:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/incl...
We already have TCP_SYN_RETRIES for the sysctl file, maybe have another TCP_SYN_RETRIES_DEFAULT for the number?
It sounds reasonable to me.
+ syn_linear_timeouts = read_file_integer(TCP_SYN_LINEAR_TIMEOUTS, 1);
The default value is 4 as Documentation/networking/ip-sysctl.rst. I don't quite remember where the 1 is from. Should we update it as well and add the similar macro as above?
I would, yes. By the way, it's hardcoded here:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/...
but defined here for thin streams (connections where observed throughput is low: https://docs.kernel.org/networking/tcp-thin.html):
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/incl...
I'd suggest to define a constant with value '4' and use it.
Got it.
+ c->tcp.tcp_syn_retries = MIN(tcp_syn_retries, UINT8_MAX); + c->tcp.syn_linear_timeouts = MIN(syn_linear_timeouts, UINT8_MAX); + + debug("TCP SYN parameters: retries=%"PRIu8", linear_timeouts=%"PRIu8,
Similar to the comment above: these are not parameters of SYN segments (which would seem to imply TCP options, such as the MSS).
We typically don't print C assignments, rather human-readable messages, so that could be "Read sysctl values tcp_syn_retries: ..., syn_linear_timeouts: ...".
Got it.
+ c->tcp.tcp_syn_retries, c->tcp.syn_linear_timeouts); +} + /** * tcp_init() - Get initial sequence, hash secret, initialise per-socket data * @c: Execution context @@ -2776,6 +2813,8 @@ int tcp_init(struct ctx *c) { ASSERT(!c->no_tcp);
+ tcp_syn_params_init(c); + tcp_sock_iov_init(c);
memset(init_sock_pool4, 0xff, sizeof(init_sock_pool4)); diff --git a/tcp.h b/tcp.h index 234a803..4369b52 100644 --- a/tcp.h +++ b/tcp.h @@ -59,12 +59,17 @@ union tcp_listen_epoll_ref { * @fwd_out: Port forwarding configuration for outbound packets * @timer_run: Timestamp of most recent timer run * @pipe_size: Size of pipes for spliced connections + * @tcp_syn_retries: Number of SYN retries during handshake + * @syn_linear_timeouts: Number of SYN retries using linear backoff timeout + * before switching to exponential backoff timeout
Maybe more compact:
* @syn_linear_timeouts: SYN retries before using exponential timeout
So we remove the '"Number of" as well? I guess we should make it consistent for both tcp_syn_retries and syn_linear_timeouts.
Ah, yes, we should.
I mean, I guess it's clear that we're talking about the count / number of those and not their colour or taste...
True.
*/ struct tcp_ctx { struct fwd_ports fwd_in; struct fwd_ports fwd_out; struct timespec timer_run; size_t pipe_size; + uint8_t tcp_syn_retries; + uint8_t syn_linear_timeouts; };
-- Stefano
-- Thanks, Yumei Huang
On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant. Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2. Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
-- Thanks, Yumei Huang
On Wed, 29 Oct 2025 11:06:44 +0800
Yumei Huang
On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating: (2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds. ...the only thing that I don't see implemented in this version of the patch is paragraph 5.7: (5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes). I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: Use an exponential backoff timeout for data retransmission according to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed in Appendix A of RFC 6298.
Also combine the macros defining the initial RTO for both SYN and ACK.
Signed-off-by: Yumei Huang
Reviewed-by: David Gibson --- tcp.c | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-) diff --git a/tcp.c b/tcp.c index 9385132..dc0ec6c 100644 --- a/tcp.c +++ b/tcp.c @@ -179,16 +179,14 @@ * * Timeouts are implemented by means of timerfd timers, set based on flags: * - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend - * SYN. It's the starting timeout for the first SYN retry. If this persists - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + - * tcp_syn_linear_timeouts) times in a row, reset the connection - * - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the - * socket and reset sequence to what was acknowledged. If this persists for - * more than TCP_MAX_RETRIES times in a row, reset the connection + * - RTO_INIT: if no ACK segment was received from tap/guest, either during + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data + * from the socket and reset sequence to what was acknowledged. This is the + * timeout for the first retry, in seconds. If this persists too many times + * in a row, reset the connection: TCP_MAX_RETRIES for established + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the + * handshake. * * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE * with TAP_FIN_SENT event), and no ACK is received within this time, reset @@ -342,8 +340,7 @@ enum { #define WINDOW_DEFAULT 14600 /* RFC 6928 */
#define ACK_INTERVAL 10 /* ms */ -#define SYN_TIMEOUT_INIT 1 /* s */ -#define ACK_TIMEOUT 2 +#define RTO_INIT 1 /* s, RFC 6298 */ #define FIN_TIMEOUT 60 #define ACT_TIMEOUT 7200
@@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } else if (conn->flags & ACK_FROM_TAP_DUE) { if (!(conn->events & ESTABLISHED)) { if (conn->retries < c->tcp.syn_linear_timeouts) - it.it_value.tv_sec = SYN_TIMEOUT_INIT; + it.it_value.tv_sec = RTO_INIT; else - it.it_value.tv_sec = SYN_TIMEOUT_INIT << + it.it_value.tv_sec = RTO_INIT << (conn->retries - c->tcp.syn_linear_timeouts); } else - it.it_value.tv_sec = ACK_TIMEOUT; + it.it_value.tv_sec = RTO_INIT << conn->retries;
Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
} else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { it.it_value.tv_sec = FIN_TIMEOUT; } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
-- Thanks, Yumei Huang
On Wed, 29 Oct 2025 13:11:48 +0800
Yumei Huang
On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: On Fri, 17 Oct 2025 14:28:38 +0800 Yumei Huang
wrote: > Use an exponential backoff timeout for data retransmission according > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > in Appendix A of RFC 6298. > > Also combine the macros defining the initial RTO for both SYN and ACK. > > Signed-off-by: Yumei Huang
> Reviewed-by: David Gibson > --- > tcp.c | 27 ++++++++++++--------------- > 1 file changed, 12 insertions(+), 15 deletions(-) > > diff --git a/tcp.c b/tcp.c > index 9385132..dc0ec6c 100644 > --- a/tcp.c > +++ b/tcp.c > @@ -179,16 +179,14 @@ > * > * Timeouts are implemented by means of timerfd timers, set based on flags: > * > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > - * SYN. It's the starting timeout for the first SYN retry. If this persists > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > - * tcp_syn_linear_timeouts) times in a row, reset the connection > - * > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > - * socket and reset sequence to what was acknowledged. If this persists for > - * more than TCP_MAX_RETRIES times in a row, reset the connection > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > + * from the socket and reset sequence to what was acknowledged. This is the > + * timeout for the first retry, in seconds. If this persists too many times > + * in a row, reset the connection: TCP_MAX_RETRIES for established > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > + * handshake. > * > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > @@ -342,8 +340,7 @@ enum { > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > #define ACK_INTERVAL 10 /* ms */ > -#define SYN_TIMEOUT_INIT 1 /* s */ > -#define ACK_TIMEOUT 2 > +#define RTO_INIT 1 /* s, RFC 6298 */ > #define FIN_TIMEOUT 60 > #define ACT_TIMEOUT 7200 > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > } else if (conn->flags & ACK_FROM_TAP_DUE) { > if (!(conn->events & ESTABLISHED)) { > if (conn->retries < c->tcp.syn_linear_timeouts) > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > + it.it_value.tv_sec = RTO_INIT; > else > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > + it.it_value.tv_sec = RTO_INIT << > (conn->retries - c->tcp.syn_linear_timeouts); > } > else > - it.it_value.tv_sec = ACK_TIMEOUT; > + it.it_value.tv_sec = RTO_INIT << conn->retries; Same as on 3/4, but here it's clearly more convenient: just assign RTO_INIT, and multiply as needed in the if / else clauses.
I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
> } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > it.it_value.tv_sec = FIN_TIMEOUT; > } else {
The rest of the series looks good to me.
It might be slightly more practical to factor in directly the RTO clamp, and I don't think it's complicated now that you have the helper from 2/4, but it's not a strong preference from my side, as the series makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds? In any case, what is the difference compared to data retransmissions? Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering. The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: > > On Fri, 17 Oct 2025 14:28:38 +0800 > Yumei Huang wrote: > > > Use an exponential backoff timeout for data retransmission according > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > in Appendix A of RFC 6298. > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > Signed-off-by: Yumei Huang > > Reviewed-by: David Gibson > > --- > > tcp.c | 27 ++++++++++++--------------- > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > diff --git a/tcp.c b/tcp.c > > index 9385132..dc0ec6c 100644 > > --- a/tcp.c > > +++ b/tcp.c > > @@ -179,16 +179,14 @@ > > * > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > * > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > - * > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > - * socket and reset sequence to what was acknowledged. If this persists for > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > + * from the socket and reset sequence to what was acknowledged. This is the > > + * timeout for the first retry, in seconds. If this persists too many times > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > + * handshake. > > * > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > @@ -342,8 +340,7 @@ enum { > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > #define ACK_INTERVAL 10 /* ms */ > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > -#define ACK_TIMEOUT 2 > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > #define FIN_TIMEOUT 60 > > #define ACT_TIMEOUT 7200 > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > if (!(conn->events & ESTABLISHED)) { > > if (conn->retries < c->tcp.syn_linear_timeouts) > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > + it.it_value.tv_sec = RTO_INIT; > > else > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > + it.it_value.tv_sec = RTO_INIT << > > (conn->retries - c->tcp.syn_linear_timeouts); > > } > > else > > - it.it_value.tv_sec = ACK_TIMEOUT; > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > Same as on 3/4, but here it's clearly more convenient: just assign > RTO_INIT, and multiply as needed in the if / else clauses. I guess we can't just assign RTO_INIT. Maybe assign it only when retries==0, otherwise multiply as it.it_value.tv_sec <<=1.
Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
> > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > it.it_value.tv_sec = FIN_TIMEOUT; > > } else { > > The rest of the series looks good to me. > > It might be slightly more practical to factor in directly the RTO > clamp, and I don't think it's complicated now that you have the helper > from 2/4, but it's not a strong preference from my side, as the series > makes sense in any case.
Reading tcp_rto_max_ms can be easy with the helper. My concern is about the way we get the total time for retries.
I used to do it like this in v2, https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...:
+#define RETRY_ELAPSED(timeout_init, retries) \ + ((timeout_init) * ((1 << ((retries) + 1)) - 2))
Though the formula is not quite right, we could refine it as below:
#define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1))
Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see. If you agree, I can add a new patch in this series to address the clamping.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
-- Thanks, Yumei Huang
On Wed, 29 Oct 2025 15:32:33 +0800
Yumei Huang
On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: On Tue, 28 Oct 2025 16:09:03 +0800 Yumei Huang
wrote: > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio
wrote: > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > Yumei Huang wrote: > > > > > Use an exponential backoff timeout for data retransmission according > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > in Appendix A of RFC 6298. > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > Signed-off-by: Yumei Huang > > > Reviewed-by: David Gibson > > > --- > > > tcp.c | 27 ++++++++++++--------------- > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > diff --git a/tcp.c b/tcp.c > > > index 9385132..dc0ec6c 100644 > > > --- a/tcp.c > > > +++ b/tcp.c > > > @@ -179,16 +179,14 @@ > > > * > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > * > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > - * > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > + * timeout for the first retry, in seconds. If this persists too many times > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > + * handshake. > > > * > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > @@ -342,8 +340,7 @@ enum { > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > -#define ACK_TIMEOUT 2 > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > #define FIN_TIMEOUT 60 > > > #define ACT_TIMEOUT 7200 > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > if (!(conn->events & ESTABLISHED)) { > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > + it.it_value.tv_sec = RTO_INIT; > > > else > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > + it.it_value.tv_sec = RTO_INIT << > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > } > > > else > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > RTO_INIT, and multiply as needed in the if / else clauses. > > I guess we can't just assign RTO_INIT. Maybe assign it only when > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. Why can't you do that? Say:
it.it_value.tv_sec = RTO_INIT if (!(conn->events & ESTABLISHED)) { if (conn->retries >= c->tcp.syn_linear_timeouts) it.it_value.tv_sec <<= (conn->retries - c->tcp.syn_linear_timeouts);
but anyway, see below.
> But it seems more complicated. What do you think?
Or maybe, building on my latest comment to 3/4:
int factor = conn->retries;
if (!(conn->events & ESTABLISHED)) factor -= c->tcp.syn_linear_timeouts;
it.it_value.tv_sec = RTO_INIT << MAX(factor, 0);
?
Yeah, I understand this part now.
> > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > } else { > > > > The rest of the series looks good to me. > > > > It might be slightly more practical to factor in directly the RTO > > clamp, and I don't think it's complicated now that you have the helper > > from 2/4, but it's not a strong preference from my side, as the series > > makes sense in any case. > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > about the way we get the total time for retries. > > I used to do it like this in v2, > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > +#define RETRY_ELAPSED(timeout_init, retries) \ > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > Though the formula is not quite right, we could refine it as below: > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > Does it make sense to get the time this way?
Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
Probably I should name it more clearly.
But in any case, do you really need to calculate that explicitly? I was thinking that you can just clamp the value when you use it in tcp_timer_ctl().
If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value you read from the kernel, then I guess it's just:
--- it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
} ...
if (timerfd_settime(conn->timer, 0, &it, NULL)) flow_perror(conn, "failed to set timer"); ---
-- Stefano
On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio
On Wed, 29 Oct 2025 15:32:33 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: > > On Tue, 28 Oct 2025 16:09:03 +0800 > Yumei Huang wrote: > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio wrote: > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > > Yumei Huang wrote: > > > > > > > Use an exponential backoff timeout for data retransmission according > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > > in Appendix A of RFC 6298. > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > > > Signed-off-by: Yumei Huang > > > > Reviewed-by: David Gibson > > > > --- > > > > tcp.c | 27 ++++++++++++--------------- > > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > index 9385132..dc0ec6c 100644 > > > > --- a/tcp.c > > > > +++ b/tcp.c > > > > @@ -179,16 +179,14 @@ > > > > * > > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > > * > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > > - * > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > > + * timeout for the first retry, in seconds. If this persists too many times > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > > + * handshake. > > > > * > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > > @@ -342,8 +340,7 @@ enum { > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > -#define ACK_TIMEOUT 2 > > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > > #define FIN_TIMEOUT 60 > > > > #define ACT_TIMEOUT 7200 > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > > if (!(conn->events & ESTABLISHED)) { > > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > > + it.it_value.tv_sec = RTO_INIT; > > > > else > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > > + it.it_value.tv_sec = RTO_INIT << > > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > > } > > > > else > > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > > RTO_INIT, and multiply as needed in the if / else clauses. > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. > > Why can't you do that? Say: > > it.it_value.tv_sec = RTO_INIT > if (!(conn->events & ESTABLISHED)) { > if (conn->retries >= c->tcp.syn_linear_timeouts) > it.it_value.tv_sec <<= (conn->retries - > c->tcp.syn_linear_timeouts); > > but anyway, see below. > > > But it seems more complicated. What do you think? > > Or maybe, building on my latest comment to 3/4: > > int factor = conn->retries; > > if (!(conn->events & ESTABLISHED)) > factor -= c->tcp.syn_linear_timeouts; > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); > > ? Yeah, I understand this part now.
> > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > > } else { > > > > > > The rest of the series looks good to me. > > > > > > It might be slightly more practical to factor in directly the RTO > > > clamp, and I don't think it's complicated now that you have the helper > > > from 2/4, but it's not a strong preference from my side, as the series > > > makes sense in any case. > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > > about the way we get the total time for retries. > > > > I used to do it like this in v2, > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > > > +#define RETRY_ELAPSED(timeout_init, retries) \ > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > > > Though the formula is not quite right, we could refine it as below: > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > > > Does it make sense to get the time this way? > > Well, it also depends on c->tcp.syn_linear_timeouts, right?
Not really, it's only used for data retransmission, so syn_linear_timeouts is not relevant.
Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if (c->conn->events & ESTABLISHED)", I guess you mean: it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); Then the code would be: it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) { int exp = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec <<= MAX(exp, 0); } else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); it.it_value.tv_sec <<= conn->retries; } it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max); we basically set the starting value to 3 for data retransmissions no matter if we have retried SYN. And the max total timeout would be 3+6+12+24+48+96+120+120 = 429s. Is it what we want? Besides, I guess I need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
Probably I should name it more clearly.
> > But in any case, do you really need to calculate that explicitly? I was > thinking that you can just clamp the value when you use it in > tcp_timer_ctl(). > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value > you read from the kernel, then I guess it's just: > > --- > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max);
After reading the comments in v3 when tcp_rto_max_ms was first mentioned again, I realized I got something wrong again. I thought it was for the total timeout for all retries, so I need to calculate that and decide to reset the connection as in v2.
I think it actually applies to all the retries.
Anyway, you are right. We don't need to do that. Thanks for your patience.
> } ... > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > flow_perror(conn, "failed to set timer"); > ---
-- Stefano
-- Thanks, Yumei Huang
On Wed, 29 Oct 2025 16:59:51 +0800
Yumei Huang
On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 15:32:33 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 11:06:44 +0800 Yumei Huang
wrote: > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio
wrote: > > > > On Tue, 28 Oct 2025 16:09:03 +0800 > > Yumei Huang wrote: > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio wrote: > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > > > Yumei Huang wrote: > > > > > > > > > Use an exponential backoff timeout for data retransmission according > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > > > in Appendix A of RFC 6298. > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > > > > > Signed-off-by: Yumei Huang > > > > > Reviewed-by: David Gibson > > > > > --- > > > > > tcp.c | 27 ++++++++++++--------------- > > > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > index 9385132..dc0ec6c 100644 > > > > > --- a/tcp.c > > > > > +++ b/tcp.c > > > > > @@ -179,16 +179,14 @@ > > > > > * > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > > > * > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > > > - * > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > > > + * timeout for the first retry, in seconds. If this persists too many times > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > > > + * handshake. > > > > > * > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > > > @@ -342,8 +340,7 @@ enum { > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > > -#define ACK_TIMEOUT 2 > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > > > #define FIN_TIMEOUT 60 > > > > > #define ACT_TIMEOUT 7200 > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > > > if (!(conn->events & ESTABLISHED)) { > > > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > > > + it.it_value.tv_sec = RTO_INIT; > > > > > else > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > > > + it.it_value.tv_sec = RTO_INIT << > > > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > > > } > > > > > else > > > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > > > RTO_INIT, and multiply as needed in the if / else clauses. > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. > > > > Why can't you do that? Say: > > > > it.it_value.tv_sec = RTO_INIT > > if (!(conn->events & ESTABLISHED)) { > > if (conn->retries >= c->tcp.syn_linear_timeouts) > > it.it_value.tv_sec <<= (conn->retries - > > c->tcp.syn_linear_timeouts); > > > > but anyway, see below. > > > > > But it seems more complicated. What do you think? > > > > Or maybe, building on my latest comment to 3/4: > > > > int factor = conn->retries; > > > > if (!(conn->events & ESTABLISHED)) > > factor -= c->tcp.syn_linear_timeouts; > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); > > > > ? > > Yeah, I understand this part now. > > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > > > } else { > > > > > > > > The rest of the series looks good to me. > > > > > > > > It might be slightly more practical to factor in directly the RTO > > > > clamp, and I don't think it's complicated now that you have the helper > > > > from 2/4, but it's not a strong preference from my side, as the series > > > > makes sense in any case. > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > > > about the way we get the total time for retries. > > > > > > I used to do it like this in v2, > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \ > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > > > > > Though the formula is not quite right, we could refine it as below: > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > > > > > Does it make sense to get the time this way? > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right? > > Not really, it's only used for data retransmission, so > syn_linear_timeouts is not relevant. Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one stating:
I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
(2.5) A maximum value MAY be placed on RTO provided it is at least 60 seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
...the only thing that I don't see implemented in this version of the patch is paragraph 5.7:
(5.7) If the timer expires awaiting the ACK of a SYN segment and the TCP implementation is using an RTO less than 3 seconds, the RTO MUST be re-initialized to 3 seconds when data transmission begins (i.e., after the three-way handshake completes).
I missed that while reviewing earlier versions. I guess we need to use a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's simpler than re-introducing separate starting values (one second and three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if (c->conn->events & ESTABLISHED)", I guess you mean:
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Right.
Then the code would be:
it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) { int exp = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec <<= MAX(exp, 0); } else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); it.it_value.tv_sec <<= conn->retries;
Hmm what's wrong with my previous suggestion of initialising 'exp' with conn->retries? Then it becomes (equivalent to your code snippet, but shorter):
} it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max);
something like: int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else min = /* add a constant for this */ 3; it.it_value.tv_sec = RTO_INIT << MAX(exp, 0); it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max); ...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in util.h, but now that we need one, I would add it, similar to the one from the Linux kernel but without type checking (as it's not really practical here). Inspired from include/linux/kernel.h (current Linux kernel tree): #define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi) we can do something similar with two differences: 1. all our macros are uppercase (we don't have as many as the kernel, and it's nice to know that they are macros from the name) and 2. as I mentioned we don't need/want type checking, so, say: #define CLAMP(x, min, max) MIN(MAX((x), (min)), (max)) ...or keep val / lo / hi if it's clearer, I don't really have a preference. I added parentheses around the arguments by the way because I think it's good practice, even though not needed in this case. It's the PRE02-C recommendation in CERT C (we generally stick to those recommendations whenever practical): https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lis... https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.ht... Note that the Linux kernel has a compatible license (it's the same, actually, GPLv2+), and, regardless of that, the implementation is trivial and the idea is obvious, so I don't think we need to give explicit attribution in this case. But in other cases we do, or I guess it's fair to the author anyway, see for example siphash.h.
we basically set the starting value to 3 for data retransmissions no matter if we have retried SYN. And the max total timeout would be 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
Ah, yes. In the discussion I previously assumed that the default clamp value was 60 seconds, but it's actually 120 seconds. It looks correct to me.
Besides, I guess I need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already have a RTO_INIT at this point, and this constant is very closely related to that one.
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
> Probably I should name it more clearly. > > > > > But in any case, do you really need to calculate that explicitly? I was > > thinking that you can just clamp the value when you use it in > > tcp_timer_ctl(). > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value > > you read from the kernel, then I guess it's just: > > > > --- > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max); > > After reading the comments in v3 when tcp_rto_max_ms was first > mentioned again, I realized I got something wrong again. I thought it > was for the total timeout for all retries, so I need to calculate that > and decide to reset the connection as in v2.
I think it actually applies to all the retries.
> Anyway, you are right. We don't need to do that. Thanks for your patience. > > > } ... > > > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > > flow_perror(conn, "failed to set timer"); > > ---
-- Stefano
On Wed, Oct 29, 2025 at 8:18 PM Stefano Brivio
On Wed, 29 Oct 2025 16:59:51 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 15:32:33 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: > > On Wed, 29 Oct 2025 11:06:44 +0800 > Yumei Huang wrote: > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio wrote: > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800 > > > Yumei Huang wrote: > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio wrote: > > > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > > > > Yumei Huang wrote: > > > > > > > > > > > Use an exponential backoff timeout for data retransmission according > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > > > > in Appendix A of RFC 6298. > > > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > > > > > > > Signed-off-by: Yumei Huang > > > > > > Reviewed-by: David Gibson > > > > > > --- > > > > > > tcp.c | 27 ++++++++++++--------------- > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > index 9385132..dc0ec6c 100644 > > > > > > --- a/tcp.c > > > > > > +++ b/tcp.c > > > > > > @@ -179,16 +179,14 @@ > > > > > > * > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > > > > * > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > > > > - * > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > > > > + * timeout for the first retry, in seconds. If this persists too many times > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > > > > + * handshake. > > > > > > * > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > > > > @@ -342,8 +340,7 @@ enum { > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > > > -#define ACK_TIMEOUT 2 > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > > > > #define FIN_TIMEOUT 60 > > > > > > #define ACT_TIMEOUT 7200 > > > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > > > > if (!(conn->events & ESTABLISHED)) { > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > > > > + it.it_value.tv_sec = RTO_INIT; > > > > > > else > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > > > > + it.it_value.tv_sec = RTO_INIT << > > > > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > > > > } > > > > > > else > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > > > > RTO_INIT, and multiply as needed in the if / else clauses. > > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. > > > > > > Why can't you do that? Say: > > > > > > it.it_value.tv_sec = RTO_INIT > > > if (!(conn->events & ESTABLISHED)) { > > > if (conn->retries >= c->tcp.syn_linear_timeouts) > > > it.it_value.tv_sec <<= (conn->retries - > > > c->tcp.syn_linear_timeouts); > > > > > > but anyway, see below. > > > > > > > But it seems more complicated. What do you think? > > > > > > Or maybe, building on my latest comment to 3/4: > > > > > > int factor = conn->retries; > > > > > > if (!(conn->events & ESTABLISHED)) > > > factor -= c->tcp.syn_linear_timeouts; > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); > > > > > > ? > > > > Yeah, I understand this part now. > > > > > > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > > > > } else { > > > > > > > > > > The rest of the series looks good to me. > > > > > > > > > > It might be slightly more practical to factor in directly the RTO > > > > > clamp, and I don't think it's complicated now that you have the helper > > > > > from 2/4, but it's not a strong preference from my side, as the series > > > > > makes sense in any case. > > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > > > > about the way we get the total time for retries. > > > > > > > > I used to do it like this in v2, > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \ > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > > > > > > > Though the formula is not quite right, we could refine it as below: > > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > > > > > > > Does it make sense to get the time this way? > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right? > > > > Not really, it's only used for data retransmission, so > > syn_linear_timeouts is not relevant. > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one > stating: I meant RETRY_ELAPSED was only used for data retransmission, which uses exponential backoff timeout directly, so syn_linear_timeouts was not relevant.
Ah, okay.
> > (2.5) A maximum value MAY be placed on RTO provided it is at least 60 > seconds.
For SYN retries, as we used linear backoff + exponential backoff, and also limited by TCP_MAX_RETRIES, the possible max RTO is far less than 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
> ...the only thing that I don't see implemented in this version of the > patch is paragraph 5.7: > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the > TCP implementation is using an RTO less than 3 seconds, the RTO > MUST be re-initialized to 3 seconds when data transmission > begins (i.e., after the three-way handshake completes). > > I missed that while reviewing earlier versions. I guess we need to use > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's > simpler than re-introducing separate starting values (one second and > three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if (c->conn->events & ESTABLISHED)", I guess you mean:
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Right.
Then the code would be:
it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) { int exp = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec <<= MAX(exp, 0); } else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); it.it_value.tv_sec <<= conn->retries;
Hmm what's wrong with my previous suggestion of initialising 'exp' with conn->retries? Then it becomes (equivalent to your code snippet, but shorter):
Sorry, I read the comments several times but still missed that. And yes, it's shorter.
} it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max);
something like:
int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max;
if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else min = /* add a constant for this */ 3;
it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max);
But this seems not right. For data retransmission, only the first timeout will be changed from 1 to 3s, then the timeout becomes 4, 8, 16 ..., 120. I feel we don't need CLAMP, the code would be: #define RTO_INIT_ACK 3 ... int exp = conn->retries; it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, RTO_INIT_ACK); } it.it_value.tv_sec <<= MAX(exp, 0); it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max); Please correct me if I got anything wrong. Thanks.
...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in util.h, but now that we need one, I would add it, similar to the one from the Linux kernel but without type checking (as it's not really practical here).
Inspired from include/linux/kernel.h (current Linux kernel tree):
#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
we can do something similar with two differences: 1. all our macros are uppercase (we don't have as many as the kernel, and it's nice to know that they are macros from the name) and 2. as I mentioned we don't need/want type checking, so, say:
#define CLAMP(x, min, max) MIN(MAX((x), (min)), (max))
...or keep val / lo / hi if it's clearer, I don't really have a preference.
I added parentheses around the arguments by the way because I think it's good practice, even though not needed in this case. It's the PRE02-C recommendation in CERT C (we generally stick to those recommendations whenever practical):
https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lis...
https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.ht...
Note that the Linux kernel has a compatible license (it's the same, actually, GPLv2+), and, regardless of that, the implementation is trivial and the idea is obvious, so I don't think we need to give explicit attribution in this case.
But in other cases we do, or I guess it's fair to the author anyway, see for example siphash.h.
we basically set the starting value to 3 for data retransmissions no matter if we have retried SYN. And the max total timeout would be 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
Ah, yes. In the discussion I previously assumed that the default clamp value was 60 seconds, but it's actually 120 seconds. It looks correct to me.
Besides, I guess I need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already have a RTO_INIT at this point, and this constant is very closely related to that one.
I'm not sure I understand here. If the timer expires, didn't we reset the connection directly? We would never get to the data transmission phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
> > Probably I should name it more clearly. > > > > > > > > But in any case, do you really need to calculate that explicitly? I was > > > thinking that you can just clamp the value when you use it in > > > tcp_timer_ctl(). > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value > > > you read from the kernel, then I guess it's just: > > > > > > --- > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max); > > > > After reading the comments in v3 when tcp_rto_max_ms was first > > mentioned again, I realized I got something wrong again. I thought it > > was for the total timeout for all retries, so I need to calculate that > > and decide to reset the connection as in v2. > > I think it actually applies to all the retries. > > > Anyway, you are right. We don't need to do that. Thanks for your patience. > > > > > } ... > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > > > flow_perror(conn, "failed to set timer"); > > > ---
-- Stefano
-- Thanks, Yumei Huang
On Thu, 30 Oct 2025 16:25:53 +0800
Yumei Huang
On Wed, Oct 29, 2025 at 8:18 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 16:59:51 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:39 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 15:32:33 +0800 Yumei Huang
wrote: On Wed, Oct 29, 2025 at 3:10 PM Stefano Brivio
wrote: On Wed, 29 Oct 2025 13:11:48 +0800 Yumei Huang
wrote: > On Wed, Oct 29, 2025 at 12:38 PM Stefano Brivio
wrote: > > > > On Wed, 29 Oct 2025 11:06:44 +0800 > > Yumei Huang wrote: > > > > > On Tue, Oct 28, 2025 at 7:44 PM Stefano Brivio wrote: > > > > > > > > On Tue, 28 Oct 2025 16:09:03 +0800 > > > > Yumei Huang wrote: > > > > > > > > > On Fri, Oct 24, 2025 at 7:04 AM Stefano Brivio wrote: > > > > > > > > > > > > On Fri, 17 Oct 2025 14:28:38 +0800 > > > > > > Yumei Huang wrote: > > > > > > > > > > > > > Use an exponential backoff timeout for data retransmission according > > > > > > > to RFC 2988 and RFC 6298. Set the initial RTO to one second as discussed > > > > > > > in Appendix A of RFC 6298. > > > > > > > > > > > > > > Also combine the macros defining the initial RTO for both SYN and ACK. > > > > > > > > > > > > > > Signed-off-by: Yumei Huang > > > > > > > Reviewed-by: David Gibson > > > > > > > --- > > > > > > > tcp.c | 27 ++++++++++++--------------- > > > > > > > 1 file changed, 12 insertions(+), 15 deletions(-) > > > > > > > > > > > > > > diff --git a/tcp.c b/tcp.c > > > > > > > index 9385132..dc0ec6c 100644 > > > > > > > --- a/tcp.c > > > > > > > +++ b/tcp.c > > > > > > > @@ -179,16 +179,14 @@ > > > > > > > * > > > > > > > * Timeouts are implemented by means of timerfd timers, set based on flags: > > > > > > > * > > > > > > > - * - SYN_TIMEOUT_INIT: if no ACK is received from tap/guest during handshake > > > > > > > - * (flag ACK_FROM_TAP_DUE without ESTABLISHED event) within this time, resend > > > > > > > - * SYN. It's the starting timeout for the first SYN retry. If this persists > > > > > > > - * for more than TCP_MAX_RETRIES or (tcp_syn_retries + > > > > > > > - * tcp_syn_linear_timeouts) times in a row, reset the connection > > > > > > > - * > > > > > > > - * - ACK_TIMEOUT: if no ACK segment was received from tap/guest, after sending > > > > > > > - * data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data from the > > > > > > > - * socket and reset sequence to what was acknowledged. If this persists for > > > > > > > - * more than TCP_MAX_RETRIES times in a row, reset the connection > > > > > > > + * - RTO_INIT: if no ACK segment was received from tap/guest, either during > > > > > > > + * handshake (flag ACK_FROM_TAP_DUE without ESTABLISHED event) or after > > > > > > > + * sending data (flag ACK_FROM_TAP_DUE with ESTABLISHED event), re-send data > > > > > > > + * from the socket and reset sequence to what was acknowledged. This is the > > > > > > > + * timeout for the first retry, in seconds. If this persists too many times > > > > > > > + * in a row, reset the connection: TCP_MAX_RETRIES for established > > > > > > > + * connections, or (tcp_syn_retries + tcp_syn_linear_timeouts) during the > > > > > > > + * handshake. > > > > > > > * > > > > > > > * - FIN_TIMEOUT: if a FIN segment was sent to tap/guest (flag ACK_FROM_TAP_DUE > > > > > > > * with TAP_FIN_SENT event), and no ACK is received within this time, reset > > > > > > > @@ -342,8 +340,7 @@ enum { > > > > > > > #define WINDOW_DEFAULT 14600 /* RFC 6928 */ > > > > > > > > > > > > > > #define ACK_INTERVAL 10 /* ms */ > > > > > > > -#define SYN_TIMEOUT_INIT 1 /* s */ > > > > > > > -#define ACK_TIMEOUT 2 > > > > > > > +#define RTO_INIT 1 /* s, RFC 6298 */ > > > > > > > #define FIN_TIMEOUT 60 > > > > > > > #define ACT_TIMEOUT 7200 > > > > > > > > > > > > > > @@ -588,13 +585,13 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) > > > > > > > } else if (conn->flags & ACK_FROM_TAP_DUE) { > > > > > > > if (!(conn->events & ESTABLISHED)) { > > > > > > > if (conn->retries < c->tcp.syn_linear_timeouts) > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT; > > > > > > > + it.it_value.tv_sec = RTO_INIT; > > > > > > > else > > > > > > > - it.it_value.tv_sec = SYN_TIMEOUT_INIT << > > > > > > > + it.it_value.tv_sec = RTO_INIT << > > > > > > > (conn->retries - c->tcp.syn_linear_timeouts); > > > > > > > } > > > > > > > else > > > > > > > - it.it_value.tv_sec = ACK_TIMEOUT; > > > > > > > + it.it_value.tv_sec = RTO_INIT << conn->retries; > > > > > > > > > > > > Same as on 3/4, but here it's clearly more convenient: just assign > > > > > > RTO_INIT, and multiply as needed in the if / else clauses. > > > > > > > > > > I guess we can't just assign RTO_INIT. Maybe assign it only when > > > > > retries==0, otherwise multiply as it.it_value.tv_sec <<=1. > > > > > > > > Why can't you do that? Say: > > > > > > > > it.it_value.tv_sec = RTO_INIT > > > > if (!(conn->events & ESTABLISHED)) { > > > > if (conn->retries >= c->tcp.syn_linear_timeouts) > > > > it.it_value.tv_sec <<= (conn->retries - > > > > c->tcp.syn_linear_timeouts); > > > > > > > > but anyway, see below. > > > > > > > > > But it seems more complicated. What do you think? > > > > > > > > Or maybe, building on my latest comment to 3/4: > > > > > > > > int factor = conn->retries; > > > > > > > > if (!(conn->events & ESTABLISHED)) > > > > factor -= c->tcp.syn_linear_timeouts; > > > > > > > > it.it_value.tv_sec = RTO_INIT << MAX(factor, 0); > > > > > > > > ? > > > > > > Yeah, I understand this part now. > > > > > > > > > > > > > > > > > > > > } else if (CONN_HAS(conn, SOCK_FIN_SENT | TAP_FIN_ACKED)) { > > > > > > > it.it_value.tv_sec = FIN_TIMEOUT; > > > > > > > } else { > > > > > > > > > > > > The rest of the series looks good to me. > > > > > > > > > > > > It might be slightly more practical to factor in directly the RTO > > > > > > clamp, and I don't think it's complicated now that you have the helper > > > > > > from 2/4, but it's not a strong preference from my side, as the series > > > > > > makes sense in any case. > > > > > > > > > > Reading tcp_rto_max_ms can be easy with the helper. My concern is > > > > > about the way we get the total time for retries. > > > > > > > > > > I used to do it like this in v2, > > > > > https://archives.passt.top/passt-dev/20251010074700.22177-4-yuhuang@redhat.c...: > > > > > > > > > > +#define RETRY_ELAPSED(timeout_init, retries) \ > > > > > + ((timeout_init) * ((1 << ((retries) + 1)) - 2)) > > > > > > > > > > Though the formula is not quite right, we could refine it as below: > > > > > > > > > > #define RETRY_ELAPSED(retries) ((RTO_INIT) * ((1 << ((retries) + 1)) - 1)) > > > > > > > > > > Does it make sense to get the time this way? > > > > > > > > Well, it also depends on c->tcp.syn_linear_timeouts, right? > > > > > > Not really, it's only used for data retransmission, so > > > syn_linear_timeouts is not relevant. > > > > Hmm, no, why? RFC 6298 covers SYN retries as well, and that's the one > > stating: > > I meant RETRY_ELAPSED was only used for data retransmission, which > uses exponential backoff timeout directly, so syn_linear_timeouts was > not relevant. Ah, okay.
> > > > (2.5) A maximum value MAY be placed on RTO provided it is at least 60 > > seconds. > > For SYN retries, as we used linear backoff + exponential backoff, and > also limited by TCP_MAX_RETRIES, the possible max RTO is far less than > 60s. So we didn't clamp it. Do you think we need to clamp it as well?
If syn_linear_timeouts is 0 and tcp_syn_retries is 7, I guess we'll reach 2^0 + 2^1 + 2^2 + 2^3 + 2^4 + 2^5 + 2^6 + 2^7 = 247 seconds? Or just up to ... + 2^6, that is, 119 seconds?
In any case, what is the difference compared to data retransmissions?
You are right. I assumed wrongly that syn_linear_timeouts is always 4. When it's 0, it's the same as data retransmissions.
Don't we have 3 bits to store the retry count as well, so we're limited by TCP_MAX_RETRIES anyway? Looking at patch 1/4 I'd say it's the same counter.
Yes, it's the same counter. I guess you mean we should clamp it as well.
I didn't check this part, I thought we already did, but if we don't, we should do that, yes.
> > ...the only thing that I don't see implemented in this version of the > > patch is paragraph 5.7: > > > > (5.7) If the timer expires awaiting the ACK of a SYN segment and the > > TCP implementation is using an RTO less than 3 seconds, the RTO > > MUST be re-initialized to 3 seconds when data transmission > > begins (i.e., after the three-way handshake completes). > > > > I missed that while reviewing earlier versions. I guess we need to use > > a MAX(x, 3) clamp if (c->conn->events & ESTABLISHED). I think it's > > simpler than re-introducing separate starting values (one second and > > three seconds).
Sorry I'd like to confirm one more thing. By "use MAX(x, 3) clamp if (c->conn->events & ESTABLISHED)", I guess you mean:
it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3);
Right.
Then the code would be:
it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) { int exp = conn->retries - c->tcp.syn_linear_timeouts; it.it_value.tv_sec <<= MAX(exp, 0); } else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, 3); it.it_value.tv_sec <<= conn->retries;
Hmm what's wrong with my previous suggestion of initialising 'exp' with conn->retries? Then it becomes (equivalent to your code snippet, but shorter):
Sorry, I read the comments several times but still missed that. And yes, it's shorter.
} it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max);
something like:
int exp = conn->retries, min = 0, max = c->tcp.tcp_rto_max;
if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else min = /* add a constant for this */ 3;
it.it_value.tv_sec = RTO_INIT << MAX(exp, 0);
it.it_value.tv_sec = CLAMP(it.it_value.tv_sec, min, max);
But this seems not right. For data retransmission, only the first timeout will be changed from 1 to 3s, then the timeout becomes 4, 8, 16 ..., 120.
Oops, sorry, you're right. I proposed that because I had for a moment the thought that it.it_value.tv_sec would be just multiplied from the current value in the next iterations, but it's of course not the case.
I feel we don't need CLAMP, the code would be:
#define RTO_INIT_ACK 3 ... int exp = conn->retries; it.it_value.tv_sec = RTO_INIT; if (!(conn->events & ESTABLISHED)) exp -= c->tcp.syn_linear_timeouts; else { it.it_value.tv_sec = MAX(it.it_value.tv_sec, RTO_INIT_ACK); } it.it_value.tv_sec <<= MAX(exp, 0); it.it_value.tv_sec = MIN( it.it_value.tv_sec, c->tcp.tcp_rto_max);
Please correct me if I got anything wrong. Thanks.
It looks good to me now. Maybe try keeping 'min' or 'max' values assigned separately as I did and see if it helps readability, though. *Or* use a temporary 'timeout' variable and assign it.it_value.tv_sec only at the end, so that lines are shorter and more readable. Probably better than any 'min' or 'max' which force the reader to look at yet another place. But conceptually I think it should be like that, yes.
...we don't have a CLAMP() macro at the moment, just MIN() and MAX() in util.h, but now that we need one, I would add it, similar to the one from the Linux kernel but without type checking (as it's not really practical here).
Inspired from include/linux/kernel.h (current Linux kernel tree):
#define clamp(val, lo, hi) min((typeof(val))max(val, lo), hi)
we can do something similar with two differences: 1. all our macros are uppercase (we don't have as many as the kernel, and it's nice to know that they are macros from the name) and 2. as I mentioned we don't need/want type checking, so, say:
#define CLAMP(x, min, max) MIN(MAX((x), (min)), (max))
...or keep val / lo / hi if it's clearer, I don't really have a preference.
I added parentheses around the arguments by the way because I think it's good practice, even though not needed in this case. It's the PRE02-C recommendation in CERT C (we generally stick to those recommendations whenever practical):
https://wiki.sei.cmu.edu/confluence/display/c/PRE02-C.+Macro+replacement+lis...
https://clang.llvm.org/extra/clang-tidy/checks/bugprone/macro-parentheses.ht...
Note that the Linux kernel has a compatible license (it's the same, actually, GPLv2+), and, regardless of that, the implementation is trivial and the idea is obvious, so I don't think we need to give explicit attribution in this case.
But in other cases we do, or I guess it's fair to the author anyway, see for example siphash.h.
we basically set the starting value to 3 for data retransmissions no matter if we have retried SYN. And the max total timeout would be 3+6+12+24+48+96+120+120 = 429s. Is it what we want?
Ah, yes. In the discussion I previously assumed that the default clamp value was 60 seconds, but it's actually 120 seconds. It looks correct to me.
Besides, I guess I need to define a macro for "3" as well, like "ACK_TIMEOUT_INIT"?
Or RTO_INIT_ACK? Maybe RTO_INIT_DATA? I'm thinking that you already have a RTO_INIT at this point, and this constant is very closely related to that one.
> > I'm not sure I understand here. If the timer expires, didn't we reset > the connection directly? We would never get to the data transmission > phase?
In the RFC, "the timer expires" indicates one iteration of the timeout algorithm, that is, it's the same as our timer (tcp_timer_handler()) triggering.
I see.
If you agree, I can add a new patch in this series to address the clamping.
I don't really have a preference, I guess it could be directly in 4/4 or in a separate patch, neither option should complicate things too much.
The paragraph says "after the three-way handshake completes", so it looks like the connection wasn't reset.
> > > Probably I should name it more clearly. > > > > > > > > > > > But in any case, do you really need to calculate that explicitly? I was > > > > thinking that you can just clamp the value when you use it in > > > > tcp_timer_ctl(). > > > > > > > > If c->tcp.rto_max is DIV_ROUND_CLOSEST(x, 1000), where 'x' is the value > > > > you read from the kernel, then I guess it's just: > > > > > > > > --- > > > > it.it_value.tv_sec = MIN(it.it_value.tv_sec, c->tcp.rto_max); > > > > > > After reading the comments in v3 when tcp_rto_max_ms was first > > > mentioned again, I realized I got something wrong again. I thought it > > > was for the total timeout for all retries, so I need to calculate that > > > and decide to reset the connection as in v2. > > > > I think it actually applies to all the retries. > > > > > Anyway, you are right. We don't need to do that. Thanks for your patience. > > > > > > > } ... > > > > > > > > if (timerfd_settime(conn->timer, 0, &it, NULL)) > > > > flow_perror(conn, "failed to set timer"); > > > > ---
-- Stefano
participants (3)
-
David Gibson
-
Stefano Brivio
-
Yumei Huang