[PATCH v4 0/3] Support for SO_PEEK_OFF socket option

older
[PATCH v4 00/10] Add vhost-user...

Jon Maloy

15 May 2024 15 May '24

5:34 p.m.

Latest changes based on feedback. Notably, removed fast retransmit when peer advertises zero-window. Jon Maloy (3): tcp: move seq_to_tap update to when frame is queued tcp: leverage support of SO_PEEK_OFF socket option when available tcp: allow retransmit when peer receive window is zero tcp.c | 149 ++++++++++++++++++++++++++++++++++++++++------------- tcp_conn.h | 2 + 2 files changed, 116 insertions(+), 35 deletions(-) -- 2.42.0

Show replies by date

Jon Maloy

15 May 15 May

5:34 p.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem. This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue. We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value. Signed-off-by: Jon Maloy --- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap. v3: - Identical to v2. Called v3 because it was embedded in a series with that version. v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; -/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM]; static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516"); -static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used; static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM]; static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516"); -static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used; static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; } +/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs + * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { + struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) + continue; + + conn->seq_to_tap = seq; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { - unsigned i; size_t m; m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); - for (i = 0; i < m; i++) - *tcp6_seq_update[i].seq += tcp6_seq_update[i].len; + if (m != tcp6_payload_used) + tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m][0], + tcp6_payload_used - m); tcp6_payload_used = 0; m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS, tcp4_payload_used); - for (i = 0; i < m; i++) - *tcp4_seq_update[i].seq += tcp4_seq_update[i].len; + if (m != tcp4_payload_used) + tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m][0], + tcp4_payload_used - m); tcp4_payload_used = 0; } @@ -2129,10 +2145,11 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { - uint32_t *seq_update = &conn->seq_to_tap; struct iovec *iov; size_t l4len; + conn->seq_to_tap = seq + dlen; + if (CONN_V4(conn)) { struct iovec *iov_prev = tcp4_l2_iov[tcp4_payload_used - 1]; const uint16_t *check = NULL; @@ -2142,8 +2159,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, check = &iph->check; } - tcp4_seq_update[tcp4_payload_used].seq = seq_update; - tcp4_seq_update[tcp4_payload_used].len = dlen; + tcp4_frame_conns[tcp4_payload_used] = conn; iov = tcp4_l2_iov[tcp4_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); @@ -2151,8 +2167,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); } else if (CONN_V6(conn)) { - tcp6_seq_update[tcp6_payload_used].seq = seq_update; - tcp6_seq_update[tcp6_payload_used].len = dlen; + tcp6_frame_conns[tcp6_payload_used] = conn; iov = tcp6_l2_iov[tcp6_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq); -- 2.42.0

Stefano Brivio

10:20 p.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

On Wed, 15 May 2024 11:34:27 -0400 Jon Maloy wrote:

...

commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];

-/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];

static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516");

-static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used;

static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];

static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516");

-static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used;

static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; }

+/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs + * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { + struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) + continue; + + conn->seq_to_tap = seq; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { - unsigned i; size_t m;

m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); - for (i = 0; i < m; i++) - *tcp6_seq_update[i].seq += tcp6_seq_update[i].len; + if (m != tcp6_payload_used) + tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m][0], + tcp6_payload_used - m);

Nit, not worth respinning, and I can fix this up on merge: we always use curly brackets around multiple lines, even if it's a single statement, consistently with the current Linux kernel coding style.

...

tcp6_payload_used = 0;

m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS, tcp4_payload_used); - for (i = 0; i < m; i++) - *tcp4_seq_update[i].seq += tcp4_seq_update[i].len; + if (m != tcp4_payload_used) + tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m][0], + tcp4_payload_used - m);

Same here. -- Stefano

David Gibson

16 May 16 May

4:24 a.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

On Wed, May 15, 2024 at 11:34:27AM -0400, Jon Maloy wrote:

...

commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];

-/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];

static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516");

-static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used;

static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];

static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516");

-static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used;

static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; }

+/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs

You can make the 2d array explicit in the type as: struct iovec (*frames)[TCP_NUM_IOVS]; See, for example the 'tap_iov' local in udp_tap_send(). (I recommend the command line tool 'cdecl', also available online at cdecl.org for working out confusing pointer-to-array types).

...

+ * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) {

Nit: I find having the two parallel counters kind of confusing. It naturally goes away with the type change suggested above, but even without that I'd prefer an explicit multiply in the body. I strongly suspect the compiler will be better at working out if the strength reduction is worth it.

...

+ struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq))

Isn't this test inverted? We want to rewind seq_to_tap if seq is less than it, rather than the other way aruond.

...

+ continue; + + conn->seq_to_tap = seq; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { - unsigned i; size_t m;

m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); - for (i = 0; i < m; i++) - *tcp6_seq_update[i].seq += tcp6_seq_update[i].len; + if (m != tcp6_payload_used) + tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m][0],

With the type change above this would become just &tcp_l2_iov[m].

...

+ tcp6_payload_used - m); tcp6_payload_used = 0;

m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS, tcp4_payload_used); - for (i = 0; i < m; i++) - *tcp4_seq_update[i].seq += tcp4_seq_update[i].len; + if (m != tcp4_payload_used) + tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m][0], + tcp4_payload_used - m); tcp4_payload_used = 0; }

@@ -2129,10 +2145,11 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { - uint32_t *seq_update = &conn->seq_to_tap; struct iovec *iov; size_t l4len;

+ conn->seq_to_tap = seq + dlen; + if (CONN_V4(conn)) { struct iovec *iov_prev = tcp4_l2_iov[tcp4_payload_used - 1]; const uint16_t *check = NULL; @@ -2142,8 +2159,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, check = &iph->check; }

- tcp4_seq_update[tcp4_payload_used].seq = seq_update; - tcp4_seq_update[tcp4_payload_used].len = dlen; + tcp4_frame_conns[tcp4_payload_used] = conn;

iov = tcp4_l2_iov[tcp4_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); @@ -2151,8 +2167,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); } else if (CONN_V6(conn)) { - tcp6_seq_update[tcp6_payload_used].seq = seq_update; - tcp6_seq_update[tcp6_payload_used].len = dlen; + tcp6_frame_conns[tcp6_payload_used] = conn;

iov = tcp6_l2_iov[tcp6_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Jon Maloy

4:57 a.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

...

...
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE];

-/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];

static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516");

-static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used;

static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];

static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516");

-static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used;

static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; }

+/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs You can make the 2d array explicit in the type as: struct iovec (*frames)[TCP_NUM_IOVS]; See, for example the 'tap_iov' local in udp_tap_send(). (I recommend

On Wed, May 15, 2024 at 11:34:27AM -0400, Jon Maloy wrote: the command line tool 'cdecl', also available online at cdecl.org for working out confusing pointer-to-array types). Nice. I wasn't quite happy with this.

...
+ * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { Nit: I find having the two parallel counters kind of confusing. It naturally goes away with the type change suggested above, but even without that I'd prefer an explicit multiply in the body. I strongly suspect the compiler will be better at working out if the strength reduction is worth it.

...
+ struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) Isn't this test inverted? We want to rewind seq_to_tap if seq is less than it, rather than the other way aruond. No. We do 'continue', i.e., nothing, if this condition is fulfilled. This may look a little non-intuitive here, but makes sense when I add

On 2024-05-15 22:24, David Gibson wrote: the next patch.

...

...
+ continue; + + conn->seq_to_tap = seq; + } +} + /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { - unsigned i; size_t m;

m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); - for (i = 0; i < m; i++) - *tcp6_seq_update[i].seq += tcp6_seq_update[i].len; + if (m != tcp6_payload_used) + tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m][0], With the type change above this would become just &tcp_l2_iov[m].

ok ///jon

...

...
+ tcp6_payload_used - m); tcp6_payload_used = 0;

m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS, tcp4_payload_used); - for (i = 0; i < m; i++) - *tcp4_seq_update[i].seq += tcp4_seq_update[i].len; + if (m != tcp4_payload_used) + tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m][0], + tcp4_payload_used - m); tcp4_payload_used = 0; }

@@ -2129,10 +2145,11 @@ static int tcp_sock_consume(const struct tcp_tap_conn *conn, uint32_t ack_seq) static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, ssize_t dlen, int no_csum, uint32_t seq) { - uint32_t *seq_update = &conn->seq_to_tap; struct iovec *iov; size_t l4len;

+ conn->seq_to_tap = seq + dlen; + if (CONN_V4(conn)) { struct iovec *iov_prev = tcp4_l2_iov[tcp4_payload_used - 1]; const uint16_t *check = NULL; @@ -2142,8 +2159,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, check = &iph->check; }

- tcp4_seq_update[tcp4_payload_used].seq = seq_update; - tcp4_seq_update[tcp4_payload_used].len = dlen; + tcp4_frame_conns[tcp4_payload_used] = conn;

iov = tcp4_l2_iov[tcp4_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, check, seq); @@ -2151,8 +2167,7 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, if (tcp4_payload_used > TCP_FRAMES_MEM - 1) tcp_payload_flush(c); } else if (CONN_V6(conn)) { - tcp6_seq_update[tcp6_payload_used].seq = seq_update; - tcp6_seq_update[tcp6_payload_used].len = dlen; + tcp6_frame_conns[tcp6_payload_used] = conn;

iov = tcp6_l2_iov[tcp6_payload_used++]; l4len = tcp_l2_buf_fill_headers(c, conn, iov, dlen, NULL, seq);

David Gibson

6:16 a.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

On Wed, May 15, 2024 at 10:57:06PM -0400, Jon Maloy wrote:

...

On 2024-05-15 22:24, David Gibson wrote:

...
...
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; -/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM]; static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516"); -static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used; static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM]; static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516"); -static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used; static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; } +/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs You can make the 2d array explicit in the type as: struct iovec (*frames)[TCP_NUM_IOVS]; See, for example the 'tap_iov' local in udp_tap_send(). (I recommend

On Wed, May 15, 2024 at 11:34:27AM -0400, Jon Maloy wrote: the command line tool 'cdecl', also available online at cdecl.org for working out confusing pointer-to-array types). Nice. I wasn't quite happy with this.

...
+ * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { Nit: I find having the two parallel counters kind of confusing. It naturally goes away with the type change suggested above, but even without that I'd prefer an explicit multiply in the body. I strongly suspect the compiler will be better at working out if the strength reduction is worth it.

...
+ struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) Isn't this test inverted? We want to rewind seq_to_tap if seq is less than it, rather than the other way aruond. No. We do 'continue', i.e., nothing, if this condition is fulfilled. This may look a little non-intuitive here, but makes sense when I add the next patch.

Oh, of course, my mistake. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Jon Maloy

4 Jun 4 Jun

7:36 p.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

Hi David, This is the last comment I received from you regarding this patch. See below for further comment. On 2024-05-16 00:16, David Gibson wrote:

...

On Wed, May 15, 2024 at 10:57:06PM -0400, Jon Maloy wrote:

...
On 2024-05-15 22:24, David Gibson wrote:

...
...
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on feedback from Stefano Brivio. - Added paranoid test to avoid that seq_to_tap becomes lower than seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP header instead of keeping a copy in struct tcp_buf_seq_update. - Since the only remaining field in struct tcp_buf_seq_update is a pointer to struct tcp_tap_conn, we eliminate the struct altogether, and make the tcp6/tcp3_buf_seq_update arrays into arrays of said pointer. - Removed 'paranoid' test in tcp_revert_seq. If it happens, it is not fatal, and will be caught by other code anyway. - Separated from the series again. --- tcp.c | 59 +++++++++++++++++++++++++++++++++++++---------------------- 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; -/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq: Pointer to sequence number sent to tap-side, to be updated - * @len: TCP payload length - */ -struct tcp_buf_seq_update { - uint32_t *seq; - uint16_t len; -}; - /* Static buffers */ /** * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM]; static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516"); -static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp4_payload_used; static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM]; static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516"); -static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM]; static unsigned int tcp6_payload_used; static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c) tcp4_flags_used = 0; } +/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns: Array of connection pointers corresponding to queued frames + * @frames: Two-dimensional array containing queued frames with sub-iovs You can make the 2d array explicit in the type as: struct iovec (*frames)[TCP_NUM_IOVS]; See, for example the 'tap_iov' local in udp_tap_send(). (I recommend

On Wed, May 15, 2024 at 11:34:27AM -0400, Jon Maloy wrote: the command line tool 'cdecl', also available online at cdecl.org for working out confusing pointer-to-array types). Nice. I wasn't quite happy with this.

...
+ * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, + int num_frames) +{ + int c, f; + + for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { Nit: I find having the two parallel counters kind of confusing. It naturally goes away with the type change suggested above, but even without that I'd prefer an explicit multiply in the body. I strongly suspect the compiler will be better at working out if the strength reduction is worth it.

...
+ struct tcp_tap_conn *conn = conns[c]; + struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; + uint32_t seq = ntohl(th->seq); + + if (SEQ_LE(conn->seq_to_tap, seq)) Isn't this test inverted? We want to rewind seq_to_tap if seq is less than it, rather than the other way aruond. No. We do 'continue', i.e., nothing, if this condition is fulfilled. This may look a little non-intuitive here, but makes sense when I add the next patch.

Oh, of course, my mistake.

The code now (v7) looks as follows: /** * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission * @conns: Array of connection pointers corresponding to queued frames * @frames: Two-dimensional array containing queued frames with sub-iovs * @num_frames: Number of entries in the two arrays to be compared */ static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS], int num_frames) { int i; for (i = 0; i < num_frames; i++) { struct tcp_tap_conn *conn = conns[i]; struct tcphdr *th = frames[i][TCP_IOV_PAYLOAD].iov_base; uint32_t seq = ntohl(th->seq); if (SEQ_LE(conn->seq_to_tap, seq)) continue; conn->seq_to_tap = seq; tcp_set_peek_offset(conn->sock, seq - conn->seq_ack_from_tap); } } /** * tcp_payload_flush() - Send out buffers for segments with data * @c: Execution context */ static void tcp_payload_flush(const struct ctx *c) { size_t m; m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS, tcp6_payload_used); if (m != tcp6_payload_used) { tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m], tcp6_payload_used - m); } tcp6_payload_used = 0; m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS, tcp4_payload_used); if (m != tcp4_payload_used) { tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m], tcp4_payload_used - m); } tcp4_payload_used = 0; } Was this the version you were talking about on Monday morning? Did you spot some bug here which I am missing? Thanks ///jon

Jon Maloy

8:04 p.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

Hi David, Found it (not your missing comment, but the bug), and it fixed the problem. I'll post this patch separately shortly. ///jon On 2024-06-04 13:36, Jon Maloy wrote:

...

Hi David, This is the last comment I received from you regarding this patch. See below for further comment.

On 2024-05-16 00:16, David Gibson wrote:

...
On Wed, May 15, 2024 at 10:57:06PM -0400, Jon Maloy wrote:

...
On 2024-05-15 22:24, David Gibson wrote:

...
...
commit a469fc393fa1 ("tcp, tap: Don't increase tap-side sequence counter for dropped frames") delayed update of conn->seq_to_tap until the moment the corresponding frame has been successfully pushed out. This has the advantage that we immediately can make a new attempt to transmit a frame after a failed trasnmit, rather than waiting for the peer to later discover a gap and trigger the fast retransmit mechanism to solve the problem.

This approach has turned out to cause a problem with spurious sequence number updates during peer-initiated retransmits, and we have realized it may not be the best way to solve the above issue.

We now restore the previous method, by updating the said field at the moment a frame is added to the outqueue. To retain the advantage of having a quick re-attempt based on local failure detection, we now scan through the part of the outqueue that had do be dropped, and restore the sequence counter for each affected connection to the most appropriate value.

Signed-off-by: Jon Maloy

--- v2: - Re-spun loop in tcp_revert_seq() and some other changes based on         feedback from Stefano Brivio.       - Added paranoid test to avoid that seq_to_tap becomes lower than         seq_ack_from_tap.

v3: - Identical to v2. Called v3 because it was embedded in a series         with that version.

v4: - In tcp_revert_seq(), we read the sequence number from the TCP         header instead of keeping a copy in struct tcp_buf_seq_update.       - Since the only remaining field in struct tcp_buf_seq_update is         a pointer to struct tcp_tap_conn, we eliminate the struct         altogether, and make the tcp6/tcp3_buf_seq_update arrays into         arrays of said pointer.       - Removed 'paranoid' test in tcp_revert_seq. If it happens, it         is not fatal, and will be caught by other code anyway.       - Separated from the series again. ---    tcp.c | 59 +++++++++++++++++++++++++++++++++++++----------------------    1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/tcp.c b/tcp.c index 21d0af0..976dba8 100644 --- a/tcp.c +++ b/tcp.c @@ -410,16 +410,6 @@ static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS];     */    static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; -/** - * tcp_buf_seq_update - Sequences to update with length of frames once sent - * @seq:    Pointer to sequence number sent to tap-side, to be updated - * @len:    TCP payload length - */ -struct tcp_buf_seq_update { -    uint32_t *seq; -    uint16_t len; -}; -    /* Static buffers */    /**     * struct tcp_payload_t - TCP header and data to send segments with payload @@ -461,7 +451,8 @@ static struct tcp_payload_t tcp4_payload[TCP_FRAMES_MEM];    static_assert(MSS4 <= sizeof(tcp4_payload[0].data), "MSS4 is greater than 65516"); -static struct tcp_buf_seq_update tcp4_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp4_frame_conns[TCP_FRAMES_MEM];    static unsigned int tcp4_payload_used;    static struct tap_hdr tcp4_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -483,7 +474,8 @@ static struct tcp_payload_t tcp6_payload[TCP_FRAMES_MEM];    static_assert(MSS6 <= sizeof(tcp6_payload[0].data), "MSS6 is greater than 65516"); -static struct tcp_buf_seq_update tcp6_seq_update[TCP_FRAMES_MEM]; +/* References tracking the owner connection of frames in the tap outqueue */ +static struct tcp_tap_conn *tcp6_frame_conns[TCP_FRAMES_MEM];    static unsigned int tcp6_payload_used;    static struct tap_hdr tcp6_flags_tap_hdr[TCP_FRAMES_MEM]; @@ -1261,25 +1253,49 @@ static void tcp_flags_flush(const struct ctx *c)        tcp4_flags_used = 0;    } +/** + * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission + * @conns:       Array of connection pointers corresponding to queued frames + * @frames:      Two-dimensional array containing queued frames with sub-iovs You can make the 2d array explicit in the type as:     struct iovec (*frames)[TCP_NUM_IOVS]; See, for example the 'tap_iov' local in udp_tap_send().   (I recommend

On Wed, May 15, 2024 at 11:34:27AM -0400, Jon Maloy wrote: the command line tool 'cdecl', also available online at cdecl.org for working out confusing pointer-to-array types). Nice. I wasn't quite happy with this.

...
+ * @num_frames: Number of entries in the two arrays to be compared + */ +static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec *frames, +               int num_frames) +{ +    int c, f; + +    for (c = 0, f = 0; c < num_frames; c++, f += TCP_NUM_IOVS) { Nit: I find having the two parallel counters kind of confusing. It naturally goes away with the type change suggested above, but even without that I'd prefer an explicit multiply in the body. I strongly suspect the compiler will be better at working out if the strength reduction is worth it.

...
+        struct tcp_tap_conn *conn = conns[c]; +        struct tcphdr *th = frames[f + TCP_IOV_PAYLOAD].iov_base; +        uint32_t seq = ntohl(th->seq); + +        if (SEQ_LE(conn->seq_to_tap, seq)) Isn't this test inverted? We want to rewind seq_to_tap if seq is less than it, rather than the other way aruond. No. We do 'continue', i.e., nothing, if this condition is fulfilled. This may look a little non-intuitive here, but makes sense when I add the next patch.

Oh, of course, my mistake.

The code now (v7) looks as follows:

/** * tcp_revert_seq() - Revert affected conn->seq_to_tap after failed transmission * @conns:       Array of connection pointers corresponding to queued frames * @frames:      Two-dimensional array containing queued frames with sub-iovs * @num_frames: Number of entries in the two arrays to be compared */ static void tcp_revert_seq(struct tcp_tap_conn **conns, struct iovec (*frames)[TCP_NUM_IOVS],                        int num_frames) {         int i;

        for (i = 0; i < num_frames; i++) {               struct tcp_tap_conn *conn = conns[i];               struct tcphdr *th = frames[i][TCP_IOV_PAYLOAD].iov_base;               uint32_t seq = ntohl(th->seq);

              if (SEQ_LE(conn->seq_to_tap, seq))                     continue;

              conn->seq_to_tap = seq;               tcp_set_peek_offset(conn->sock, seq - conn->seq_ack_from_tap);         } }

/** * tcp_payload_flush() - Send out buffers for segments with data * @c:       Execution context */ static void tcp_payload_flush(const struct ctx *c) {         size_t m;

        m = tap_send_frames(c, &tcp6_l2_iov[0][0], TCP_NUM_IOVS,                         tcp6_payload_used);         if (m != tcp6_payload_used) {               tcp_revert_seq(tcp6_frame_conns, &tcp6_l2_iov[m],                            tcp6_payload_used - m);         }         tcp6_payload_used = 0;

        m = tap_send_frames(c, &tcp4_l2_iov[0][0], TCP_NUM_IOVS,                         tcp4_payload_used);         if (m != tcp4_payload_used) {               tcp_revert_seq(tcp4_frame_conns, &tcp4_l2_iov[m],                            tcp4_payload_used - m);         }         tcp4_payload_used = 0; }

Was this the version you were talking about on Monday morning? Did you spot some bug here which I am missing?

Thanks ///jon

Stefano Brivio

8:10 p.m.

New subject: [PATCH v4 1/3] tcp: move seq_to_tap update to when frame is queued

Jon, On Tue, 4 Jun 2024 14:04:20 -0400 Jon Maloy wrote:

...

Hi David, Found it (not your missing comment, but the bug), and it fixed the problem. I'll post this patch separately shortly.

See: https://archives.passt.top/passt-dev/Zkr_4LkjDImgFqSi@zatzit/ For David's comment on the subject. -- Stefano

Jon Maloy

15 May 15 May

5:34 p.m.

New subject: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available

From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option"). This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction. In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not. Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used. Signed-off-by: Jon Maloy --- v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. - Moved initial set_peek_offset(0) to only the locations where the socket is set to ESTABLISHED. - Removed the per-packet synchronization between sk_peek_off and already_sent. Instead only doing it in retransmit situations. - The problem I found when trouble shooting the occasionally occurring out of synch values between 'already_sent' and 'sk_peek_offset' may have deeper implications that we may need to be investigate. v3: - Rebased to most recent version of tcp.c, plus the previous patch in this series. - Some changes based on feedback from PASST team v4: - Some small changes based on feedback from Stefan/David. --- tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 8 deletions(-) diff --git a/tcp.c b/tcp.c index 976dba8..4163bf9 100644 --- a/tcp.c +++ b/tcp.c @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; +/* Does the kernel support TCP_PEEK_OFF? */ +static bool peek_offset_cap; + /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV]; @@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; +/** + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported + * @s: Socket to update + * @offset: Offset in bytes + */ +static void tcp_set_peek_offset(int s, int offset) +{ + if (!peek_offset_cap) + return; + + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s); +} + /** * tcp_conn_epoll_events() - epoll events mask for given connection state * @events: Current connection events @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) uint32_t already_sent, seq; struct iovec *iov; + /* How much have we read/sent since last received ack ? */ already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; - if (SEQ_LT(already_sent, 0)) { /* RFC 761, section 2.1. */ flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", conn->seq_ack_from_tap, conn->seq_to_tap); conn->seq_to_tap = conn->seq_ack_from_tap; already_sent = 0; + tcp_set_peek_offset(s, 0); } if (!wnd_scaled || already_sent >= wnd_scaled) { @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) iov_rem = (wnd_scaled - already_sent) % mss; } - mh_sock.msg_iov = iov_sock; - mh_sock.msg_iovlen = fill_bufs + 1; - - iov_sock[0].iov_base = tcp_buf_discard; - iov_sock[0].iov_len = already_sent; + /* Prepare iov according to kernel capability */ + if (!peek_offset_cap) { + mh_sock.msg_iov = iov_sock; + iov_sock[0].iov_base = tcp_buf_discard; + iov_sock[0].iov_len = already_sent; + mh_sock.msg_iovlen = fill_bufs + 1; + } else { + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen = fill_bufs; + } if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) return 0; } - sendlen = len - already_sent; + sendlen = len; + if (!peek_offset_cap) + sendlen -= already_sent; + if (sendlen <= 0) { conn_flag(c, conn, STALLED); return 0; @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, "fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); } @@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_ack_to_tap = conn->seq_from_tap; conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0); /* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, goto reset; conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0); if (th->fin) { conn->seq_from_tap++; @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "ACK timeout, retry"); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); tcp_timer_ctl(c, conn); } @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; + unsigned int b, optv = 0; + int s; for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) tc_hash[b] = FLOW_SIDX_NONE; @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) NS_CALL(tcp_ns_socks_init, c); } + /* Probe for SO_PEEK_OFF support */ + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn("Temporary TCP socket creation failed"); + } else { + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) + peek_offset_cap = true; + close(s); + } + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + return 0; } -- 2.42.0

Stefano Brivio

10:22 p.m.

New subject: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available

Just two nits, I would be fine applying this as it is: On Wed, 15 May 2024 11:34:28 -0400 Jon Maloy wrote:

...

From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used.

Signed-off-by: Jon Maloy

--- v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. - Moved initial set_peek_offset(0) to only the locations where the socket is set to ESTABLISHED. - Removed the per-packet synchronization between sk_peek_off and already_sent. Instead only doing it in retransmit situations. - The problem I found when trouble shooting the occasionally occurring out of synch values between 'already_sent' and 'sk_peek_offset' may have deeper implications that we may need to be investigate.

v3: - Rebased to most recent version of tcp.c, plus the previous patch in this series. - Some changes based on feedback from PASST team

v4: - Some small changes based on feedback from Stefan/David. --- tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/tcp.c b/tcp.c index 976dba8..4163bf9 100644 --- a/tcp.c +++ b/tcp.c @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];

+/* Does the kernel support TCP_PEEK_OFF? */ +static bool peek_offset_cap; + /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV];

@@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE];

+/** + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported + * @s: Socket to update + * @offset: Offset in bytes + */ +static void tcp_set_peek_offset(int s, int offset) +{ + if (!peek_offset_cap) + return; + + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s);

I thought we'd get a format warning if you use %u to print a signed value, but no, gcc seems to be happy with it.

...

+} + /** * tcp_conn_epoll_events() - epoll events mask for given connection state * @events: Current connection events @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) uint32_t already_sent, seq; struct iovec *iov;

+ /* How much have we read/sent since last received ack ? */ already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; -

I still maintain that dropping this newline is a spurious change, but if you really dislike it, I don't have a strong preference to keep it, either.

...

if (SEQ_LT(already_sent, 0)) { /* RFC 761, section 2.1. */ flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", conn->seq_ack_from_tap, conn->seq_to_tap); conn->seq_to_tap = conn->seq_ack_from_tap; already_sent = 0; + tcp_set_peek_offset(s, 0); }

if (!wnd_scaled || already_sent >= wnd_scaled) { @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) iov_rem = (wnd_scaled - already_sent) % mss; }

- mh_sock.msg_iov = iov_sock; - mh_sock.msg_iovlen = fill_bufs + 1; - - iov_sock[0].iov_base = tcp_buf_discard; - iov_sock[0].iov_len = already_sent; + /* Prepare iov according to kernel capability */ + if (!peek_offset_cap) { + mh_sock.msg_iov = iov_sock; + iov_sock[0].iov_base = tcp_buf_discard; + iov_sock[0].iov_len = already_sent; + mh_sock.msg_iovlen = fill_bufs + 1; + } else { + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen = fill_bufs; + }

if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) return 0; }

- sendlen = len - already_sent; + sendlen = len; + if (!peek_offset_cap) + sendlen -= already_sent; + if (sendlen <= 0) { conn_flag(c, conn, STALLED); return 0; @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, "fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); }

@@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_ack_to_tap = conn->seq_from_tap;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

/* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, goto reset;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

if (th->fin) { conn->seq_from_tap++; @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "ACK timeout, retry"); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); tcp_timer_ctl(c, conn); } @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; + unsigned int b, optv = 0; + int s;

for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) tc_hash[b] = FLOW_SIDX_NONE; @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) NS_CALL(tcp_ns_socks_init, c); }

+ /* Probe for SO_PEEK_OFF support */ + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn("Temporary TCP socket creation failed"); + } else { + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) + peek_offset_cap = true; + close(s); + } + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + return 0; }

-- Stefano

David Gibson

16 May 16 May

4:29 a.m.

New subject: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available

On Wed, May 15, 2024 at 11:34:28AM -0400, Jon Maloy wrote:

...

...
From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used.

Signed-off-by: Jon Maloy

I'm pretty sure this needs one more call to tcp_set_peek_offset() inside the revert function introduced in the previous patch. Otherwise looks good.

...

--- v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. - Moved initial set_peek_offset(0) to only the locations where the socket is set to ESTABLISHED. - Removed the per-packet synchronization between sk_peek_off and already_sent. Instead only doing it in retransmit situations. - The problem I found when trouble shooting the occasionally occurring out of synch values between 'already_sent' and 'sk_peek_offset' may have deeper implications that we may need to be investigate.

v3: - Rebased to most recent version of tcp.c, plus the previous patch in this series. - Some changes based on feedback from PASST team

v4: - Some small changes based on feedback from Stefan/David. --- tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/tcp.c b/tcp.c index 976dba8..4163bf9 100644 --- a/tcp.c +++ b/tcp.c @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];

+/* Does the kernel support TCP_PEEK_OFF? */ +static bool peek_offset_cap; + /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV];

@@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE];

+/** + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported + * @s: Socket to update + * @offset: Offset in bytes + */ +static void tcp_set_peek_offset(int s, int offset) +{ + if (!peek_offset_cap) + return; + + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s); +} + /** * tcp_conn_epoll_events() - epoll events mask for given connection state * @events: Current connection events @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) uint32_t already_sent, seq; struct iovec *iov;

+ /* How much have we read/sent since last received ack ? */ already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; - if (SEQ_LT(already_sent, 0)) { /* RFC 761, section 2.1. */ flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", conn->seq_ack_from_tap, conn->seq_to_tap); conn->seq_to_tap = conn->seq_ack_from_tap; already_sent = 0; + tcp_set_peek_offset(s, 0); }

if (!wnd_scaled || already_sent >= wnd_scaled) { @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) iov_rem = (wnd_scaled - already_sent) % mss; }

- mh_sock.msg_iov = iov_sock; - mh_sock.msg_iovlen = fill_bufs + 1; - - iov_sock[0].iov_base = tcp_buf_discard; - iov_sock[0].iov_len = already_sent; + /* Prepare iov according to kernel capability */ + if (!peek_offset_cap) { + mh_sock.msg_iov = iov_sock; + iov_sock[0].iov_base = tcp_buf_discard; + iov_sock[0].iov_len = already_sent; + mh_sock.msg_iovlen = fill_bufs + 1; + } else { + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen = fill_bufs; + }

if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) return 0; }

- sendlen = len - already_sent; + sendlen = len; + if (!peek_offset_cap) + sendlen -= already_sent; + if (sendlen <= 0) { conn_flag(c, conn, STALLED); return 0; @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, "fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); }

@@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_ack_to_tap = conn->seq_from_tap;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

/* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, goto reset;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

if (th->fin) { conn->seq_from_tap++; @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "ACK timeout, retry"); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); tcp_timer_ctl(c, conn); } @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; + unsigned int b, optv = 0; + int s;

for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) tc_hash[b] = FLOW_SIDX_NONE; @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) NS_CALL(tcp_ns_socks_init, c); }

+ /* Probe for SO_PEEK_OFF support */ + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn("Temporary TCP socket creation failed"); + } else { + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) + peek_offset_cap = true; + close(s); + } + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + return 0; }

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Jon Maloy

5:03 a.m.

New subject: [PATCH v4 2/3] tcp: leverage support of SO_PEEK_OFF socket option when available

On 2024-05-15 22:29, David Gibson wrote:

...

On Wed, May 15, 2024 at 11:34:28AM -0400, Jon Maloy wrote:

...
...
From linux-6.9.0 the kernel will contain commit 05ea491641d3 ("tcp: add support for SO_PEEK_OFF socket option").

This new feature makes is possible to call recv_msg(MSG_PEEK) and make it start reading data from a given offset set by the SO_PEEK_OFF socket option. This way, we can avoid repeated reading of already read bytes of a received message, hence saving read cycles when forwarding TCP messages in the host->name space direction.

In this commit, we add functionality to leverage this feature when available, while we fall back to the previous behavior when not.

Measurements with iperf3 shows that throughput increases with 15-20 percent in the host->namespace direction when this feature is used.

Signed-off-by: Jon Maloy I'm pretty sure this needs one more call to tcp_set_peek_offset() inside the revert function introduced in the previous patch. Otherwise looks good. Of course. I must have deleted it by accident when I removed the 'paranoid´ test SEQ_GE(conn->seq_to_tap, equal conn->seq_ack_from_tap). It worked in my test simply because this never happens. I actually had to test the logics in a separate program. Thank you for spotting this.

///jon

...

...
--- v2: - Some smaller changes as suggested by David Gibson and Stefano Brivio. - Moved initial set_peek_offset(0) to only the locations where the socket is set to ESTABLISHED. - Removed the per-packet synchronization between sk_peek_off and already_sent. Instead only doing it in retransmit situations. - The problem I found when trouble shooting the occasionally occurring out of synch values between 'already_sent' and 'sk_peek_offset' may have deeper implications that we may need to be investigate.

v3: - Rebased to most recent version of tcp.c, plus the previous patch in this series. - Some changes based on feedback from PASST team

v4: - Some small changes based on feedback from Stefan/David. --- tcp.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 50 insertions(+), 8 deletions(-)

diff --git a/tcp.c b/tcp.c index 976dba8..4163bf9 100644 --- a/tcp.c +++ b/tcp.c @@ -511,6 +511,9 @@ static struct iovec tcp6_l2_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp4_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS]; static struct iovec tcp6_l2_flags_iov [TCP_FRAMES_MEM][TCP_NUM_IOVS];

+/* Does the kernel support TCP_PEEK_OFF? */ +static bool peek_offset_cap; + /* sendmsg() to socket */ static struct iovec tcp_iov [UIO_MAXIOV];

@@ -526,6 +529,20 @@ static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE];

+/** + * tcp_set_peek_offset() - Set SO_PEEK_OFF offset on a socket if supported + * @s: Socket to update + * @offset: Offset in bytes + */ +static void tcp_set_peek_offset(int s, int offset) +{ + if (!peek_offset_cap) + return; + + if (setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &offset, sizeof(offset))) + err("Failed to set SO_PEEK_OFF to %u in socket %i", offset, s); +} + /** * tcp_conn_epoll_events() - epoll events mask for given connection state * @events: Current connection events @@ -2197,14 +2214,15 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) uint32_t already_sent, seq; struct iovec *iov;

+ /* How much have we read/sent since last received ack ? */ already_sent = conn->seq_to_tap - conn->seq_ack_from_tap; - if (SEQ_LT(already_sent, 0)) { /* RFC 761, section 2.1. */ flow_trace(conn, "ACK sequence gap: ACK for %u, sent: %u", conn->seq_ack_from_tap, conn->seq_to_tap); conn->seq_to_tap = conn->seq_ack_from_tap; already_sent = 0; + tcp_set_peek_offset(s, 0); }

if (!wnd_scaled || already_sent >= wnd_scaled) { @@ -2222,11 +2240,16 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) iov_rem = (wnd_scaled - already_sent) % mss; }

- mh_sock.msg_iov = iov_sock; - mh_sock.msg_iovlen = fill_bufs + 1; - - iov_sock[0].iov_base = tcp_buf_discard; - iov_sock[0].iov_len = already_sent; + /* Prepare iov according to kernel capability */ + if (!peek_offset_cap) { + mh_sock.msg_iov = iov_sock; + iov_sock[0].iov_base = tcp_buf_discard; + iov_sock[0].iov_len = already_sent; + mh_sock.msg_iovlen = fill_bufs + 1; + } else { + mh_sock.msg_iov = &iov_sock[1]; + mh_sock.msg_iovlen = fill_bufs; + }

if (( v4 && tcp4_payload_used + fill_bufs > TCP_FRAMES_MEM) || (!v4 && tcp6_payload_used + fill_bufs > TCP_FRAMES_MEM)) { @@ -2267,7 +2290,10 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) return 0; }

- sendlen = len - already_sent; + sendlen = len; + if (!peek_offset_cap) + sendlen -= already_sent; + if (sendlen <= 0) { conn_flag(c, conn, STALLED); return 0; @@ -2438,6 +2464,7 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, "fast re-transmit, ACK: %u, previous sequence: %u", max_ack_seq, conn->seq_to_tap); conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); }

@@ -2530,6 +2557,7 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_ack_to_tap = conn->seq_from_tap;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

/* The client might have sent data already, which we didn't * dequeue waiting for SYN,ACK from tap -- check now. @@ -2610,6 +2638,7 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, sa_family_t af, goto reset;

conn_event(c, conn, ESTABLISHED); + tcp_set_peek_offset(conn->sock, 0);

if (th->fin) { conn->seq_from_tap++; @@ -2863,6 +2892,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "ACK timeout, retry"); conn->retrans++; conn->seq_to_tap = conn->seq_ack_from_tap; + tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); tcp_timer_ctl(c, conn); } @@ -3154,7 +3184,8 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; + unsigned int b, optv = 0; + int s;

for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) tc_hash[b] = FLOW_SIDX_NONE; @@ -3178,6 +3209,17 @@ int tcp_init(struct ctx *c) NS_CALL(tcp_ns_socks_init, c); }

+ /* Probe for SO_PEEK_OFF support */ + s = socket(AF_INET, SOCK_STREAM | SOCK_CLOEXEC, IPPROTO_TCP); + if (s < 0) { + warn("Temporary TCP socket creation failed"); + } else { + if (!setsockopt(s, SOL_SOCKET, SO_PEEK_OFF, &optv, sizeof(int))) + peek_offset_cap = true; + close(s); + } + info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + return 0; }

Jon Maloy

15 May 15 May

5:34 p.m.

New subject: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero

A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side. However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer. We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink. RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34). If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)" We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero. We use the above mechanism only at timer-induced retransmits. In the case we receive duplicate ack and a zero window, but still know we have outstanding data acks waiting, we send out an empty "fast probe" instead of doing fast retransmit. This averts the risk of overwhelming a memory squeezed peer with retransmits, while still forcing it to send out a new window update when the probe is received. This entails a theoretical risk of redundant retransmits from the peer, but that is a risk worth taking. In case of a zero-window non-retransmission situation where there is no new data to be sent, we also add a simple zero-window probing feature. By sending an empty packet at regular timeout events we resolve the situation described above, since the peer receives the necessary trigger to advertise its window once it becomes non-zero again. It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation. Signed-off-by: Jon Maloy --- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. --- tcp.c | 32 +++++++++++++++++++++++++++----- tcp_conn.h | 2 ++ 2 files changed, 29 insertions(+), 5 deletions(-) diff --git a/tcp.c b/tcp.c index 4163bf9..a33f494 100644 --- a/tcp.c +++ b/tcp.c @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); + wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) + conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; } /** @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); + uint32_t already_sent, max_send, seq; int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent, seq; struct iovec *iov; /* How much have we read/sent since last received ack ? */ @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); } - if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much are we still allowed to send within current window ? */ + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; + if (SEQ_LE(max_send, 0)) { + flow_trace(conn, "Empty window: win_upper: %u, sent: %u", + conn->seq_wnd_edge_from_tap, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; } /* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; } /* Prepare iov according to kernel capability */ @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_to_tap = max_ack_seq; tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); + } else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) { + /* Force peer to send new advertisement now, but only once */ + flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", + max_ack_seq, conn->seq_to_tap); + tcp_send_flag(c, conn, ACK); + conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); } if (!iov_i) @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "activity timeout"); tcp_rst(c, conn); } + /* No data exchanged recently? Keep connection alive. */ + if (conn->seq_to_tap == conn->seq_ack_from_tap && + conn->seq_from_tap == conn->seq_ack_to_tap) + tcp_send_flag(c, conn, ACK); } } diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..5cbad2a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn { uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap; -- 2.42.0

Stefano Brivio

10:24 p.m.

New subject: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero

On Wed, 15 May 2024 11:34:29 -0400 Jon Maloy wrote:

...

A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer.

We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)"

We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero.

We use the above mechanism only at timer-induced retransmits. In the case we receive duplicate ack and a zero window, but still know we have outstanding data acks waiting, we send out an empty "fast probe" instead of doing fast retransmit. This averts the risk of overwhelming a memory squeezed peer with retransmits, while still forcing it to send out a new window update when the probe is received. This entails a theoretical risk of redundant retransmits from the peer, but that is a risk worth taking.

In case of a zero-window non-retransmission situation where there is no new data to be sent, we also add a simple zero-window probing feature. By sending an empty packet at regular timeout events we resolve the situation described above, since the peer receives the necessary trigger to advertise its window once it becomes non-zero again.

It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy

--- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. --- tcp.c | 32 +++++++++++++++++++++++++++----- tcp_conn.h | 2 ++ 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c index 4163bf9..a33f494 100644 --- a/tcp.c +++ b/tcp.c @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);

+ wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap))

Here, cppcheck ('make cppcheck') says: tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse] if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); ^ tcp.c:1770:6: note: Condition 'wnd' is always true if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ See the comment in tcp_update_seqack_wnd() and related suppression. It's clearly a false positive (if you omit the MIN() macro, it goes away), so we need that same suppression here.

...

+ conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;

conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; }

/** @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); + uint32_t already_sent, max_send, seq; int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent, seq; struct iovec *iov;

/* How much have we read/sent since last received ack ? */ @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); }

- if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much are we still allowed to send within current window ? */ + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; + if (SEQ_LE(max_send, 0)) { + flow_trace(conn, "Empty window: win_upper: %u, sent: %u",

This is not win_upper anymore, and the window is actually full rather than empty (it's... empty of space). Maybe: flow_trace(conn, "Window full: right edge: %u, sent: %u"

...

+ conn->seq_wnd_edge_from_tap, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; }

/* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; }

/* Prepare iov according to kernel capability */ @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_to_tap = max_ack_seq; tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); + } else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) { + /* Force peer to send new advertisement now, but only once */

Two questions: - which advertisement? We're sending a zero-window probe, not forcing the peer to do much really. I would rather just state that we're sending a probe - what guarantees it only happens once? If we get more data from the socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a bit, and send another ACK (duplicate) to the peer, without the peer necessarily ever advertising a non-zero window meanwhile. I'm struggling a bit to understand how this can work "cleanly", a packet capture of this mechanism in action would certainly help.

...

+ flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", + max_ack_seq, conn->seq_to_tap); + tcp_send_flag(c, conn, ACK); + conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); }

if (!iov_i) @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "activity timeout"); tcp_rst(c, conn); } + /* No data exchanged recently? Keep connection alive. */

...I just spotted this from v3: this is not the reason why we're sending a keep-alive. We're sending a keep-alive segment because the peer advertised its window as zero. I also realised that this is not scheduled additionally, so it will just trigger on an activity timeout, I suppose. We should reschedule this after ACK_TIMEOUT, instead (that was my earlier suggestion, I didn't check anymore) when the peer advertises a zero window.

...

+ if (conn->seq_to_tap == conn->seq_ack_from_tap &&

...this part will only work if we reset seq_to_tap to seq_ack_from_tap earlier, and we have no pending data to send, which is not necessarily the case if we want to send a zero-window probe.

...

+ conn->seq_from_tap == conn->seq_ack_to_tap) + tcp_send_flag(c, conn, ACK);

I think the conditions should simply be: - the window currently advertised by the peer is zero - we don't have pending data to acknowledge (otherwise the peer can interpret our keep-alive as a duplicate ACK)

...

} }

diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..5cbad2a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn {

uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap;

-- Stefano

Jon Maloy

16 May 16 May

1:10 a.m.

New subject: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero

On 2024-05-15 16:24, Stefano Brivio wrote:

...

On Wed, 15 May 2024 11:34:29 -0400 Jon Maloy wrote:

...
A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer.

We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)"

We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero.

We use the above mechanism only at timer-induced retransmits. In the case we receive duplicate ack and a zero window, but still know we have outstanding data acks waiting, we send out an empty "fast probe" instead of doing fast retransmit. This averts the risk of overwhelming a memory squeezed peer with retransmits, while still forcing it to send out a new window update when the probe is received. This entails a theoretical risk of redundant retransmits from the peer, but that is a risk worth taking.

In case of a zero-window non-retransmission situation where there is no new data to be sent, we also add a simple zero-window probing feature. By sending an empty packet at regular timeout events we resolve the situation described above, since the peer receives the necessary trigger to advertise its window once it becomes non-zero again.

It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy

--- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. --- tcp.c | 32 +++++++++++++++++++++++++++----- tcp_conn.h | 2 ++ 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c index 4163bf9..a33f494 100644 --- a/tcp.c +++ b/tcp.c @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);

+ wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) Here, cppcheck ('make cppcheck') says:

tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse] if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); ^ tcp.c:1770:6: note: Condition 'wnd' is always true if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^

See the comment in tcp_update_seqack_wnd() and related suppression.

It's clearly a false positive (if you omit the MIN() macro, it goes away), so we need that same suppression here. Ok. I'll change it. Still a little annoying when our tools are causing us extra job because they aren't up to the task.

...
+ conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;

conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; }

/** @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); + uint32_t already_sent, max_send, seq; int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent, seq; struct iovec *iov;

/* How much have we read/sent since last received ack ? */ @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); }

- if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much are we still allowed to send within current window ? */ + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; + if (SEQ_LE(max_send, 0)) { + flow_trace(conn, "Empty window: win_upper: %u, sent: %u", This is not win_upper anymore, and the window is actually full rather than empty (it's... empty of space). Maybe:

flow_trace(conn, "Window full: right edge: %u, sent: %u" yes.

...
+ conn->seq_wnd_edge_from_tap, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; }

/* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; }

/* Prepare iov according to kernel capability */ @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_to_tap = max_ack_seq; tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); + } else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) { + /* Force peer to send new advertisement now, but only once */ Two questions:

- which advertisement? We're sending a zero-window probe, not forcing the peer to do much really. I would rather just state that we're sending a probe Actually, it is only clear if you know the code of the (linux) peer. I realized this was maybe a too strong statement, but this is really what happens.

...

- what guarantees it only happens once? If we get more data from the socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a bit, and send another ACK (duplicate) to the peer, without the peer necessarily ever advertising a non-zero window meanwhile

Right. I need to give this a new spin, but I think I am on the right track. I *really* want this to be solved here, and not need to fall back to the timeout except under exceptional conditions. This happens really often in my tests, and with this fix it actually works like a charm.

...

I'm struggling a bit to understand how this can work "cleanly", a packet capture of this mechanism in action would certainly help.

Ok. In v5.

...

...
+ flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", + max_ack_seq, conn->seq_to_tap); + tcp_send_flag(c, conn, ACK); + conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); }

if (!iov_i) @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "activity timeout"); tcp_rst(c, conn); } + /* No data exchanged recently? Keep connection alive. */ ...I just spotted this from v3: this is not the reason why we're sending a keep-alive. We're sending a keep-alive segment because the peer advertised its window as zero.

I also realised that this is not scheduled additionally, so it will just trigger on an activity timeout, I suppose. We should reschedule this after ACK_TIMEOUT, instead (that was my earlier suggestion, I didn't check anymore) when the peer advertises a zero window.

...
+ if (conn->seq_to_tap == conn->seq_ack_from_tap && ...this part will only work if we reset seq_to_tap to seq_ack_from_tap earlier, and we have no pending data to send, which is not necessarily the case if we want to send a zero-window probe.

...
+ conn->seq_from_tap == conn->seq_ack_to_tap) + tcp_send_flag(c, conn, ACK); I think the conditions should simply be:

- the window currently advertised by the peer is zero

- we don't have pending data to acknowledge (otherwise the peer can interpret our keep-alive as a duplicate ACK)

That is ok. The deadlock situation we have ended up in will anyway be resolved by the real retransmit that will happen before we get here. I am now wondering if this probe it serves any purpose at all?

...

...
} }

diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..5cbad2a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn {

uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap;

I'll re-spin the two first ones asap, so you can apply them, and then I will try to improve this one further. ///jon

David Gibson

9:19 a.m.

New subject: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero

On Wed, May 15, 2024 at 07:10:49PM -0400, Jon Maloy wrote:

...

On 2024-05-15 16:24, Stefano Brivio wrote:

...
On Wed, 15 May 2024 11:34:29 -0400 Jon Maloy wrote:

...
A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer.

We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)"

We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero.

We use the above mechanism only at timer-induced retransmits. In the case we receive duplicate ack and a zero window, but still know we have outstanding data acks waiting, we send out an empty "fast probe" instead of doing fast retransmit. This averts the risk of overwhelming a memory squeezed peer with retransmits, while still forcing it to send out a new window update when the probe is received. This entails a theoretical risk of redundant retransmits from the peer, but that is a risk worth taking.

In case of a zero-window non-retransmission situation where there is no new data to be sent, we also add a simple zero-window probing feature. By sending an empty packet at regular timeout events we resolve the situation described above, since the peer receives the necessary trigger to advertise its window once it becomes non-zero again.

It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy

--- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. --- tcp.c | 32 +++++++++++++++++++++++++++----- tcp_conn.h | 2 ++ 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c index 4163bf9..a33f494 100644 --- a/tcp.c +++ b/tcp.c @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX); + wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) Here, cppcheck ('make cppcheck') says:

tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse] if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); ^ tcp.c:1770:6: note: Condition 'wnd' is always true if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^

See the comment in tcp_update_seqack_wnd() and related suppression.

It's clearly a false positive (if you omit the MIN() macro, it goes away), so we need that same suppression here. Ok. I'll change it. Still a little annoying when our tools are causing us extra job because they aren't up to the task.

Yeah, it's frustrating, particularly the fact that this was reported ages ago and there's no sign of motion on fixing it. But, I'm pretty sure cppcheck has caught considerably more than two bugs for me that might have taken a while to catch otherwise, so I still think it's worth using on balance. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

1:22 p.m.

New subject: [PATCH v4 3/3] tcp: allow retransmit when peer receive window is zero

On Wed, 15 May 2024 19:10:49 -0400 Jon Maloy wrote:

...

On 2024-05-15 16:24, Stefano Brivio wrote:

...
On Wed, 15 May 2024 11:34:29 -0400 Jon Maloy wrote:

...
A bug in kernel TCP may lead to a deadlock where a zero window is sent from the peer, while it is unable to send out window updates even after reads have freed up enough buffer space to permit a larger window. In this situation, new window advertisemnts from the peer can only be triggered by packets arriving from this side.

However, such packets are never sent, because the zero-window condition currently prevents this side from sending out any packets whatsoever to the peer.

We notice that the above bug is triggered *only* after the peer has dropped an arriving packet because of severe memory squeeze, and that we hence always enter a retransmission situation when this occurs. This also means that it goes against the RFC 9293 recommendation that a previously advertised window never should shrink.

RFC 9293 gives the solution to this situation. In chapter 3.6.1 we find the following statement: "A TCP receiver SHOULD NOT shrink the window, i.e., move the right window edge to the left (SHLD-14). However, a sending TCP peer MUST be robust against window shrinking, which may cause the "usable window" (see Section 3.8.6.2.1) to become negative (MUST-34).

If this happens, the sender SHOULD NOT send new data (SHLD-15), but SHOULD retransmit normally the old unacknowledged data between SND.UNA and SND.UNA+SND.WND (SHLD-16). The sender MAY also retransmit old data beyond SND.UNA+SND.WND (MAY-7)"

We never see the window become negative, but we interpret this as a recommendation to use the previously available window during retransmission even when the currently advertised window is zero.

We use the above mechanism only at timer-induced retransmits. In the case we receive duplicate ack and a zero window, but still know we have outstanding data acks waiting, we send out an empty "fast probe" instead of doing fast retransmit. This averts the risk of overwhelming a memory squeezed peer with retransmits, while still forcing it to send out a new window update when the probe is received. This entails a theoretical risk of redundant retransmits from the peer, but that is a risk worth taking.

In case of a zero-window non-retransmission situation where there is no new data to be sent, we also add a simple zero-window probing feature. By sending an empty packet at regular timeout events we resolve the situation described above, since the peer receives the necessary trigger to advertise its window once it becomes non-zero again.

It should be noted that although this solves the problem we have at hand, it is not a genuine solution to the kernel bug. There may well be TCP stacks around in other OS-es which don't do this, nor have keep-alive probing as an alternatve way to solve the situation.

Signed-off-by: Jon Maloy

--- v2: - Using previously advertised window during retransmission, instead highest send sequencece number in the cycle. v3: - Rebased to newest code - Changes based on feedback from PASST team - Sending out empty probe message at timer expiration when we are not in retransmit situation. v4: - Some small changes based on feedback from PASST team. - Replaced fast retransmit with a one-time 'fast probe' when window is zero. --- tcp.c | 32 +++++++++++++++++++++++++++----- tcp_conn.h | 2 ++ 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/tcp.c b/tcp.c index 4163bf9..a33f494 100644 --- a/tcp.c +++ b/tcp.c @@ -1761,9 +1761,15 @@ static void tcp_get_tap_ws(struct tcp_tap_conn *conn, */ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) { + uint32_t wnd_edge; + wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); conn->wnd_from_tap = MIN(wnd >> conn->ws_from_tap, USHRT_MAX);

+ wnd_edge = conn->seq_ack_from_tap + wnd; + if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) Here, cppcheck ('make cppcheck') says:

tcp.c:1770:6: style: Condition 'wnd' is always true [knownConditionTrueFalse] if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^ tcp.c:1766:8: note: Assignment 'wnd=((1<<(16+8))<(wnd<<conn->ws_from_tap))?(1<<(16+8)):(wnd<<conn->ws_from_tap)', assigned value is less than 1 wnd = MIN(MAX_WINDOW, wnd << conn->ws_from_tap); ^ tcp.c:1770:6: note: Condition 'wnd' is always true if (wnd && SEQ_GT(wnd_edge, conn->seq_wnd_edge_from_tap)) ^

See the comment in tcp_update_seqack_wnd() and related suppression.

It's clearly a false positive (if you omit the MIN() macro, it goes away), so we need that same suppression here. Ok. I'll change it. Still a little annoying when our tools are causing us extra job because they aren't up to the task.

Just like in David's experience, cppcheck probably saved me a substantial number of embarrassing situations. I would say it's very much up to the task to be honest, a few false positives are something we can definitely expect.

...

...
...
+ conn->seq_wnd_edge_from_tap = wnd_edge; + /* FIXME: reflect the tap-side receiver's window back to the sock-side * sender by adjusting SO_RCVBUF? */ } @@ -1796,6 +1802,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5;

conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; }

/** @@ -2205,13 +2212,12 @@ static void tcp_data_to_tap(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) { - uint32_t wnd_scaled = conn->wnd_from_tap << conn->ws_from_tap; int fill_bufs, send_bufs = 0, last_len, iov_rem = 0; int sendlen, len, dlen, v4 = CONN_V4(conn); + uint32_t already_sent, max_send, seq; int s = conn->sock, i, ret = 0; struct msghdr mh_sock = { 0 }; uint16_t mss = MSS_GET(conn); - uint32_t already_sent, seq; struct iovec *iov;

/* How much have we read/sent since last received ack ? */ @@ -2225,19 +2231,24 @@ static int tcp_data_from_sock(struct ctx *c, struct tcp_tap_conn *conn) tcp_set_peek_offset(s, 0); }

- if (!wnd_scaled || already_sent >= wnd_scaled) { + /* How much are we still allowed to send within current window ? */ + max_send = conn->seq_wnd_edge_from_tap - conn->seq_to_tap; + if (SEQ_LE(max_send, 0)) { + flow_trace(conn, "Empty window: win_upper: %u, sent: %u", This is not win_upper anymore, and the window is actually full rather than empty (it's... empty of space). Maybe:

flow_trace(conn, "Window full: right edge: %u, sent: %u"

yes.

...
...
+ conn->seq_wnd_edge_from_tap, conn->seq_to_tap); + conn->seq_wnd_edge_from_tap = conn->seq_to_tap; conn_flag(c, conn, STALLED); conn_flag(c, conn, ACK_FROM_TAP_DUE); return 0; }

/* Set up buffer descriptors we'll fill completely and partially. */ - fill_bufs = DIV_ROUND_UP(wnd_scaled - already_sent, mss); + fill_bufs = DIV_ROUND_UP(max_send, mss); if (fill_bufs > TCP_FRAMES) { fill_bufs = TCP_FRAMES; iov_rem = 0; } else { - iov_rem = (wnd_scaled - already_sent) % mss; + iov_rem = max_send % mss; }

/* Prepare iov according to kernel capability */ @@ -2466,6 +2477,13 @@ static int tcp_data_from_tap(struct ctx *c, struct tcp_tap_conn *conn, conn->seq_to_tap = max_ack_seq; tcp_set_peek_offset(conn->sock, 0); tcp_data_from_sock(c, conn); + } else if (!max_ack_seq_wnd && SEQ_GT(conn->seq_to_tap, max_ack_seq)) { + /* Force peer to send new advertisement now, but only once */ Two questions:

- which advertisement? We're sending a zero-window probe, not forcing the peer to do much really. I would rather just state that we're sending a probe

Actually, it is only clear if you know the code of the (linux) peer. I realized this was maybe a too strong statement, but this is really what happens.

Our peer is not necessarily the Linux kernel, though, and, especially, not a specific (broken) version of it.

...

...
- what guarantees it only happens once? If we get more data from the socket, we'll get again SEQ_GT(conn->seq_to_tap, max_ack_seq) in a bit, and send another ACK (duplicate) to the peer, without the peer necessarily ever advertising a non-zero window meanwhile Right. I need to give this a new spin, but I think I am on the right track. I *really* want this to be solved here, and not need to fall back to the timeout except under exceptional conditions. This happens really often in my tests, and with this fix it actually works like a charm.

...
I'm struggling a bit to understand how this can work "cleanly", a packet capture of this mechanism in action would certainly help.

Ok. In v5.

...
...
+ flow_trace(conn, "fast probe, ACK: %u, previous sequence: %u", + max_ack_seq, conn->seq_to_tap); + tcp_send_flag(c, conn, ACK); + conn->seq_to_tap = max_ack_seq; + tcp_set_peek_offset(conn->sock, 0); }

if (!iov_i) @@ -2911,6 +2929,10 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref) flow_dbg(conn, "activity timeout"); tcp_rst(c, conn); } + /* No data exchanged recently? Keep connection alive. */ ...I just spotted this from v3: this is not the reason why we're sending a keep-alive. We're sending a keep-alive segment because the peer advertised its window as zero.

I also realised that this is not scheduled additionally, so it will just trigger on an activity timeout, I suppose. We should reschedule this after ACK_TIMEOUT, instead (that was my earlier suggestion, I didn't check anymore) when the peer advertises a zero window.

...
+ if (conn->seq_to_tap == conn->seq_ack_from_tap && ...this part will only work if we reset seq_to_tap to seq_ack_from_tap earlier, and we have no pending data to send, which is not necessarily the case if we want to send a zero-window probe.

...
+ conn->seq_from_tap == conn->seq_ack_to_tap) + tcp_send_flag(c, conn, ACK); I think the conditions should simply be:

- the window currently advertised by the peer is zero

- we don't have pending data to acknowledge (otherwise the peer can interpret our keep-alive as a duplicate ACK)

That is ok. The deadlock situation we have ended up in will anyway be resolved by the real retransmit that will happen before we get here. I am now wondering if this probe it serves any purpose at all?

Well, RFC 9293 says we must implement it, so it's a good idea to do that, I guess. Let's say your first probe is lost for whatever reason (if the system is low on memory, that is actually likely to happen with tap devices).

...

...
...
} }

diff --git a/tcp_conn.h b/tcp_conn.h index d280b22..5cbad2a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -30,6 +30,7 @@ * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap * @seq_ack_from_tap: Last ACK number received from tap + * @seq_wnd_edge_from_tap: Right edge of last non-zero window from tap * @seq_from_tap: Next sequence for packets from tap (not actually sent) * @seq_ack_to_tap: Last ACK number sent to tap * @seq_init_from_tap: Initial sequence number from tap @@ -101,6 +102,7 @@ struct tcp_tap_conn {

uint32_t seq_to_tap; uint32_t seq_ack_from_tap; + uint32_t seq_wnd_edge_from_tap; uint32_t seq_from_tap; uint32_t seq_ack_to_tap; uint32_t seq_init_from_tap;

I'll re-spin the two first ones asap, so you can apply them, and then I will try to improve this one further.

Just mind that if 2/3 makes the issue you're working around here more likely to happen (me, I've never seen it), we shouldn't go ahead with 2/3 without this, I guess. -- Stefano

466

Age (days ago)

486

Last active (days ago)

List overview

Download

17 comments

3 participants

participants (3)

David Gibson
Jon Maloy
Stefano Brivio

[PATCH v4 0/3] Support for SO_PEEK_OFF socket option

tags

participants (3)