[PATCH v2 00/11] tcp_splice: Better exploit symmetry between sides of connection

David Gibson

7 Nov 2023 7 Nov '23

3:42 a.m.

For spliced connections, both "sides" are sockets, and for many purposes how we deal with each side is symmetric. Currently, however, we track the information for each side in independent fields in the structure, meaning we can't easily exploit that symmetry. This makes a number of reorganizations of the tcp splice code so that we can explot that symmetry to reduce code size. This will have some additional advantages when we come to integrate with the in-progress unified flow table. Based on top of the interface identifiers and automatic forwarding cleanup series. Changes since v1: * Small updates to comments and commit messages David Gibson (11): tcp_splice: Remove redundant tcp_splice_epoll_ctl() tcp_splice: Correct error handling in tcp_splice_epoll_ctl() tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl() tcp_splice: Remove unnecessary forward declaration tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl() tcp_splice: Don't pool pipes in pairs tcp_splice: Rename sides of connection from a/b to 0/1 tcp_splice: Exploit side symmetry in tcp_splice_timer() tcp_splice: Exploit side symmetry in tcp_splice_connect_finish() tcp_splice: Exploit side symmetry in tcp_splice_destroy() tcp_splice: Simplify selection of socket and pipe sides in socket handler tcp_conn.h | 45 +++--- tcp_splice.c | 389 +++++++++++++++++++++------------------------------ 2 files changed, 179 insertions(+), 255 deletions(-) -- 2.41.0

Show replies by date

David Gibson

7 Nov 7 Nov

3:42 a.m.

New subject: [PATCH v2 01/11] tcp_splice: Remove redundant tcp_splice_epoll_ctl()

tcp_splice_conn_update() calls tcp_splice_epoll_ctl() twice: first ignoring the return value, then checking it. This serves no purpose. If the first call succeeds, the second call will do exactly the same thing again, since nothing has changed in conn. If the first call fails, then tcp_splice_epoll_ctl() itself will EPOLL_CTL_DEL both fds, meaning when the second call tries to EPOLL_CTL_MOD them it will necessarily fail. It appears that this duplication was introduced by accident in an otherwise unrelated patch. Fixes: bb708111 ("treewide: Packet abstraction with mandatory boundary checks") Signed-off-by: David Gibson --- tcp_splice.c | 1 - 1 file changed, 1 deletion(-) diff --git a/tcp_splice.c b/tcp_splice.c index 133e825..87fbd50 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -255,7 +255,6 @@ static void conn_event_do(const struct ctx *c, struct tcp_splice_conn *conn, */ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) { - tcp_splice_epoll_ctl(c, new); if (tcp_splice_epoll_ctl(c, new)) conn_flag(c, new, CLOSING); } -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 02/11] tcp_splice: Correct error handling in tcp_splice_epoll_ctl()

If we get an error from epoll_ctl() in tcp_splice_epoll_ctl() we goto the 'delete' path where we remove both sockets from the epoll set and return an error. There are several problems with this: - We 'return -errno' after the EPOLL_CTL_DEL operations, which means the deleting epoll_ctl() calls may have overwritten the errno values which actually triggered the failures. - The call from conn_flag_do() occurs when the CLOSING flag is set, in which case we go do the delete path regardless of error. In that case the 'return errno' is meaningless since we don't expect the EPOLL_CTL_DEL operations to fail and we ignore the return code anyway. - All other calls to tcp_splice_epoll_ctl() check the return code and if non-zero immediately call conn_flag(..., CLOSING) which will call tcp_splice_epoll_ctl() again explicitly to remove the sockets from epoll. That means removing them when the error first occurs is redundant. - We never specifically report an error on the epoll_ctl() operations. We just set the connection to CLOSING, more or less silently killing it. This could make debugging difficult in the unlikely even that we get a failure here. Re-organise tcp_splice_epoll_ctl() to just log a message then return in the error case, and only EPOLL_CTL_DEL when explicitly asked to with the CLOSING flag. Signed-off-by: David Gibson --- tcp_splice.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 87fbd50..6c4c664 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -182,25 +182,27 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; uint32_t events_a, events_b; - if (conn->flags & CLOSING) - goto delete; + if (conn->flags & CLOSING) { + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->a, &ev_a); + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->b, &ev_b); + return 0; + } tcp_splice_conn_epoll_events(conn->events, &events_a, &events_b); ev_a.events = events_a; ev_b.events = events_b; if (epoll_ctl(c->epollfd, m, conn->a, &ev_a) || - epoll_ctl(c->epollfd, m, conn->b, &ev_b)) - goto delete; + epoll_ctl(c->epollfd, m, conn->b, &ev_b)) { + int ret = -errno; + err("TCP (spliced): index %li, ERROR on epoll_ctl(): %s", + CONN_IDX(conn), strerror(errno)); + return ret; + } conn->in_epoll = true; return 0; - -delete: - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->a, &ev_a); - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->b, &ev_b); - return -errno; } /** -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 03/11] tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl()

tcp_splice_epoll_ctl() removes both sockets from the epoll set if called when conn->flags & CLOSING. This will always happen immediately after setting that flag, since conn_flag_do() makes the call itself. That's also the _only_ time it can happen: we perform the EPOLL_CTL_DEL without clearing the conn->in_epoll flag, meaning that any further calls to tcp_splice_epoll_ctl() would attempt EPOLL_CTL_MOD, which would necessarily fail since the fds are no longer in the epoll. The EPOLL_CTL_DEL path in tcp_splice_epoll_ctl() has essentially zero overlap with anything else the function does, so just move them to be open coded in conn_flag_do(). This does require kernel 2.6.9 or later, in order to pass NULL as the event structure for epoll_ctl(). However, we already require at least 3.13 to allow unprivileged user namespaces. Given that, simply directly perform the EPOLL_CTL_DEL operations from conn_flag_do() rather than unnecessarily multiplexini Signed-off-by: David Gibson --- tcp_splice.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 6c4c664..570a8c6 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -152,8 +152,10 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, } } - if (flag == CLOSING) - tcp_splice_epoll_ctl(c, conn); + if (flag == CLOSING) { + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->a, NULL); + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->b, NULL); + } } #define conn_flag(c, conn, flag) \ @@ -182,12 +184,6 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; uint32_t events_a, events_b; - if (conn->flags & CLOSING) { - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->a, &ev_a); - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->b, &ev_b); - return 0; - } - tcp_splice_conn_epoll_events(conn->events, &events_a, &events_b); ev_a.events = events_a; ev_b.events = events_b; -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 04/11] tcp_splice: Remove unnecessary forward declaration

In tcp_splice.c we forward declare tcp_splice_epoll_ctl() then define it later on. However, there are no circular dependencies which prevent us from simply having the full definition in place of the forward declaration. Signed-off-by: David Gibson --- tcp_splice.c | 71 +++++++++++++++++++++++++--------------------------- 1 file changed, 34 insertions(+), 37 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 570a8c6..44d1e4a 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -116,8 +116,41 @@ static void tcp_splice_conn_epoll_events(uint16_t events, *b |= (events & B_OUT_WAIT) ? EPOLLOUT : 0; } +/** + * tcp_splice_epoll_ctl() - Add/modify/delete epoll state from connection events + * @c: Execution context + * @conn: Connection pointer + * + * Return: 0 on success, negative error code on failure (not on deletion) + */ static int tcp_splice_epoll_ctl(const struct ctx *c, - struct tcp_splice_conn *conn); + struct tcp_splice_conn *conn) +{ + int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + union epoll_ref ref_a = { .type = EPOLL_TYPE_TCP, .fd = conn->a, + .tcp.index = CONN_IDX(conn) }; + union epoll_ref ref_b = { .type = EPOLL_TYPE_TCP, .fd = conn->b, + .tcp.index = CONN_IDX(conn) }; + struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; + struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; + uint32_t events_a, events_b; + + tcp_splice_conn_epoll_events(conn->events, &events_a, &events_b); + ev_a.events = events_a; + ev_b.events = events_b; + + if (epoll_ctl(c->epollfd, m, conn->a, &ev_a) || + epoll_ctl(c->epollfd, m, conn->b, &ev_b)) { + int ret = -errno; + err("TCP (spliced): index %li, ERROR on epoll_ctl(): %s", + CONN_IDX(conn), strerror(errno)); + return ret; + } + + conn->in_epoll = true; + + return 0; +} /** * conn_flag_do() - Set/unset given flag, log, update epoll on CLOSING flag @@ -165,42 +198,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, conn_flag_do(c, conn, flag); \ } while (0) -/** - * tcp_splice_epoll_ctl() - Add/modify/delete epoll state from connection events - * @c: Execution context - * @conn: Connection pointer - * - * Return: 0 on success, negative error code on failure (not on deletion) - */ -static int tcp_splice_epoll_ctl(const struct ctx *c, - struct tcp_splice_conn *conn) -{ - int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; - union epoll_ref ref_a = { .type = EPOLL_TYPE_TCP, .fd = conn->a, - .tcp.index = CONN_IDX(conn) }; - union epoll_ref ref_b = { .type = EPOLL_TYPE_TCP, .fd = conn->b, - .tcp.index = CONN_IDX(conn) }; - struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; - struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; - uint32_t events_a, events_b; - - tcp_splice_conn_epoll_events(conn->events, &events_a, &events_b); - ev_a.events = events_a; - ev_b.events = events_b; - - if (epoll_ctl(c->epollfd, m, conn->a, &ev_a) || - epoll_ctl(c->epollfd, m, conn->b, &ev_b)) { - int ret = -errno; - err("TCP (spliced): index %li, ERROR on epoll_ctl(): %s", - CONN_IDX(conn), strerror(errno)); - return ret; - } - - conn->in_epoll = true; - - return 0; -} - /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 05/11] tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl()

We initialise the events_a and events_b variables with tcp_splice_conn_epoll_events() function, then immediately copy the values into ev_a.events and ev_b.events. We can't simply pass &ev_[ab].events to tcp_splice_conn_epoll_events(), because struct epoll_event is packed, leading to 'pointer may be unaligned' warnings if we attempt that. We can, however, make tcp_splice_conn_epoll_events() take struct epoll_event pointers rather than raw u32 pointers, avoiding the awkward temporaries. Signed-off-by: David Gibson --- tcp_splice.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 44d1e4a..a706ebb 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -95,25 +95,26 @@ static int tcp_sock_refill_ns(void *arg); /** * tcp_splice_conn_epoll_events() - epoll events masks for given state * @events: Connection event flags - * @a: Event mask for socket with accepted connection, set on return - * @b: Event mask for connection target socket, set on return + * @a: Event for socket with accepted connection, set on return + * @b: Event for connection target socket, set on return */ static void tcp_splice_conn_epoll_events(uint16_t events, - uint32_t *a, uint32_t *b) + struct epoll_event *a, + struct epoll_event *b) { - *a = *b = 0; + a->events = b->events = 0; if (events & SPLICE_ESTABLISHED) { if (!(events & B_FIN_SENT)) - *a = EPOLLIN | EPOLLRDHUP; + a->events = EPOLLIN | EPOLLRDHUP; if (!(events & A_FIN_SENT)) - *b = EPOLLIN | EPOLLRDHUP; + b->events = EPOLLIN | EPOLLRDHUP; } else if (events & SPLICE_CONNECT) { - *b = EPOLLOUT; + b->events = EPOLLOUT; } - *a |= (events & A_OUT_WAIT) ? EPOLLOUT : 0; - *b |= (events & B_OUT_WAIT) ? EPOLLOUT : 0; + a->events |= (events & A_OUT_WAIT) ? EPOLLOUT : 0; + b->events |= (events & B_OUT_WAIT) ? EPOLLOUT : 0; } /** @@ -133,11 +134,8 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, .tcp.index = CONN_IDX(conn) }; struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; - uint32_t events_a, events_b; - tcp_splice_conn_epoll_events(conn->events, &events_a, &events_b); - ev_a.events = events_a; - ev_b.events = events_b; + tcp_splice_conn_epoll_events(conn->events, &ev_a, &ev_b); if (epoll_ctl(c->epollfd, m, conn->a, &ev_a) || epoll_ctl(c->epollfd, m, conn->b, &ev_b)) { -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 06/11] tcp_splice: Don't pool pipes in pairs

To reduce latencies, the tcp splice code maintains a pool of pre-opened pipes to use for new connections. This is structured as an array of pairs of pipes, with each pipe, of course, being a pair of fds. Thus when we use the pool, a single pool "slot" provides both the a->b and b->a pipes. There's no strong reason to store the pool in pairs, though - we can with not much difficulty instead take the a->b and b->a pipes for a new connection independently from separate slots in the pool, or even take one from the the pool and create the other as we need it, if there's only one pipe left in the pool. This marginally increases the length of code, but simplifies the structure of the pipe pool. We should be able to re-shrink the code with later changes, too. In the process we also fix some minor bugs: - If we both failed to find a pipe in the pool and to create a new one, we didn't log an error and would silently drop the connection. That could make debugging such a situation difficult. Add in an error message for that case - When refilling the pool, if we were only able to open a single pipe in the pair, we attempted to rollback, but instead of closing the opened pipe, we instead closed the pipe we failed to open (probably leading to some ignored EBADFD errors). Signed-off-by: David Gibson --- tcp_splice.c | 60 +++++++++++++++++++++++++++------------------------- 1 file changed, 31 insertions(+), 29 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index a706ebb..7b36688 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -58,7 +58,7 @@ #include "tcp_conn.h" #define MAX_PIPE_SIZE (8UL * 1024 * 1024) -#define TCP_SPLICE_PIPE_POOL_SIZE 16 +#define TCP_SPLICE_PIPE_POOL_SIZE 32 #define TCP_SPLICE_CONN_PRESSURE 30 /* % of conn_count */ #define TCP_SPLICE_FILE_PRESSURE 30 /* % of c->nofile */ @@ -69,7 +69,7 @@ static int ns_sock_pool4 [TCP_SOCK_POOL_SIZE]; static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ -static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; +static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) @@ -307,19 +307,16 @@ static int tcp_splice_connect_finish(const struct ctx *c, conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { - if (splice_pipe_pool[i][0][0] >= 0) { - SWAP(conn->pipe_a_b[0], splice_pipe_pool[i][0][0]); - SWAP(conn->pipe_a_b[1], splice_pipe_pool[i][0][1]); - - SWAP(conn->pipe_b_a[0], splice_pipe_pool[i][1][0]); - SWAP(conn->pipe_b_a[1], splice_pipe_pool[i][1][1]); + if (splice_pipe_pool[i][0] >= 0) { + SWAP(conn->pipe_a_b[0], splice_pipe_pool[i][0]); + SWAP(conn->pipe_a_b[1], splice_pipe_pool[i][1]); break; } } - if (conn->pipe_a_b[0] < 0) { - if (pipe2(conn->pipe_a_b, O_NONBLOCK | O_CLOEXEC) || - pipe2(conn->pipe_b_a, O_NONBLOCK | O_CLOEXEC)) { + if (pipe2(conn->pipe_a_b, O_NONBLOCK | O_CLOEXEC)) { + err("TCP (spliced): cannot create a->b pipe: %s", + strerror(errno)); conn_flag(c, conn, CLOSING); return -EIO; } @@ -328,6 +325,22 @@ static int tcp_splice_connect_finish(const struct ctx *c, trace("TCP (spliced): cannot set a->b pipe size to %lu", c->tcp.pipe_size); } + } + + for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { + if (splice_pipe_pool[i][0] >= 0) { + SWAP(conn->pipe_b_a[0], splice_pipe_pool[i][0]); + SWAP(conn->pipe_b_a[1], splice_pipe_pool[i][1]); + break; + } + } + if (conn->pipe_b_a[0] < 0) { + if (pipe2(conn->pipe_b_a, O_NONBLOCK | O_CLOEXEC)) { + err("TCP (spliced): cannot create b->a pipe: %s", + strerror(errno)); + conn_flag(c, conn, CLOSING); + return -EIO; + } if (fcntl(conn->pipe_b_a[0], F_SETPIPE_SZ, c->tcp.pipe_size)) { trace("TCP (spliced): cannot set b->a pipe size to %lu", @@ -716,12 +729,12 @@ close: */ static void tcp_set_pipe_size(struct ctx *c) { - int probe_pipe[TCP_SPLICE_PIPE_POOL_SIZE * 2][2], i, j; + int probe_pipe[TCP_SPLICE_PIPE_POOL_SIZE][2], i, j; c->tcp.pipe_size = MAX_PIPE_SIZE; smaller: - for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE * 2; i++) { + for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { if (pipe2(probe_pipe[i], O_CLOEXEC)) { i++; break; @@ -736,7 +749,7 @@ smaller: close(probe_pipe[j][1]); } - if (i == TCP_SPLICE_PIPE_POOL_SIZE * 2) + if (i == TCP_SPLICE_PIPE_POOL_SIZE) return; if (!(c->tcp.pipe_size /= 2)) { @@ -756,25 +769,14 @@ static void tcp_splice_pipe_refill(const struct ctx *c) int i; for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { - if (splice_pipe_pool[i][0][0] >= 0) + if (splice_pipe_pool[i][0] >= 0) break; - if (pipe2(splice_pipe_pool[i][0], O_NONBLOCK | O_CLOEXEC)) - continue; - if (pipe2(splice_pipe_pool[i][1], O_NONBLOCK | O_CLOEXEC)) { - close(splice_pipe_pool[i][1][0]); - close(splice_pipe_pool[i][1][1]); + if (pipe2(splice_pipe_pool[i], O_NONBLOCK | O_CLOEXEC)) continue; - } - if (fcntl(splice_pipe_pool[i][0][0], F_SETPIPE_SZ, + if (fcntl(splice_pipe_pool[i][0], F_SETPIPE_SZ, c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set a->b pipe size to %lu", - c->tcp.pipe_size); - } - - if (fcntl(splice_pipe_pool[i][1][0], F_SETPIPE_SZ, - c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set b->a pipe size to %lu", + trace("TCP (spliced): cannot set pool pipe size to %lu", c->tcp.pipe_size); } } -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 07/11] tcp_splice: Rename sides of connection from a/b to 0/1

Each spliced connection has two mostly, although not entirely, symmetric sides. We currently call those "a" and "b" and have different fields in the connection structure for each one. We can better exploit that symmetry if we use two element arrays rather thatn separately named fields. Do that in the places we can, and for the others change the "a"/"b" terminology to 0/1 to match. Signed-off-by: David Gibson --- tcp_conn.h | 45 +++++------ tcp_splice.c | 224 +++++++++++++++++++++++++-------------------------- 2 files changed, 130 insertions(+), 139 deletions(-) diff --git a/tcp_conn.h b/tcp_conn.h index 0751e00..01d31d4 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -119,54 +119,47 @@ struct tcp_tap_conn { uint32_t seq_init_from_tap; }; +#define SIDES 2 /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @c: Fields common with tcp_tap_conn * @in_epoll: Is the connection in the epoll set? - * @a: File descriptor number of socket for accepted connection - * @pipe_a_b: Pipe ends for splice() from @a to @b - * @b: File descriptor number of peer connected socket - * @pipe_b_a: Pipe ends for splice() from @b to @a + * @s: File descriptor for sockets + * @pipe: File descriptors for pipes * @events: Events observed/actions performed on connection * @flags: Connection flags (attributes, not events) - * @a_read: Bytes read from @a (not fully written to @b in one shot) - * @a_written: Bytes written to @a (not fully written from one @b read) - * @b_read: Bytes read from @b (not fully written to @a in one shot) - * @b_written: Bytes written to @b (not fully written from one @a read) + * @read: Bytes read (not fully written to other side in one shot) + * @written: Bytes written (not fully written from one other side read) */ struct tcp_splice_conn { /* Must be first element to match tcp_tap_conn */ struct tcp_conn_common c; bool in_epoll :1; - int a; - int pipe_a_b[2]; - int b; - int pipe_b_a[2]; + int s[SIDES]; + int pipe[SIDES][2]; uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) #define SPLICE_ESTABLISHED BIT(1) -#define A_OUT_WAIT BIT(2) -#define B_OUT_WAIT BIT(3) -#define A_FIN_RCVD BIT(4) -#define B_FIN_RCVD BIT(5) -#define A_FIN_SENT BIT(6) -#define B_FIN_SENT BIT(7) +#define OUT_WAIT_0 BIT(2) +#define OUT_WAIT_1 BIT(3) +#define FIN_RCVD_0 BIT(4) +#define FIN_RCVD_1 BIT(5) +#define FIN_SENT_0 BIT(6) +#define FIN_SENT_1 BIT(7) uint8_t flags; #define SPLICE_V6 BIT(0) -#define RCVLOWAT_SET_A BIT(1) -#define RCVLOWAT_SET_B BIT(2) -#define RCVLOWAT_ACT_A BIT(3) -#define RCVLOWAT_ACT_B BIT(4) +#define RCVLOWAT_SET_0 BIT(1) +#define RCVLOWAT_SET_1 BIT(2) +#define RCVLOWAT_ACT_0 BIT(3) +#define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5) - uint32_t a_read; - uint32_t a_written; - uint32_t b_read; - uint32_t b_written; + uint32_t read[SIDES]; + uint32_t written[SIDES]; }; /** diff --git a/tcp_splice.c b/tcp_splice.c index 7b36688..f405184 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -21,12 +21,12 @@ * * - SPLICE_CONNECT: connection accepted, connecting to target * - SPLICE_ESTABLISHED: connection to target established - * - A_OUT_WAIT: pipe to accepted socket full, wait for EPOLLOUT - * - B_OUT_WAIT: pipe to target socket full, wait for EPOLLOUT - * - A_FIN_RCVD: FIN (EPOLLRDHUP) seen from accepted socket - * - B_FIN_RCVD: FIN (EPOLLRDHUP) seen from target socket - * - A_FIN_RCVD: FIN (write shutdown) sent to accepted socket - * - B_FIN_RCVD: FIN (write shutdown) sent to target socket + * - OUT_WAIT_0: pipe to accepted socket full, wait for EPOLLOUT + * - OUT_WAIT_1: pipe to target socket full, wait for EPOLLOUT + * - FIN_RCVD_0: FIN (EPOLLRDHUP) seen from accepted socket + * - FIN_RCVD_1: FIN (EPOLLRDHUP) seen from target socket + * - FIN_SENT_0: FIN (write shutdown) sent to accepted socket + * - FIN_SENT_1: FIN (write shutdown) sent to target socket * * #syscalls:pasta pipe2|pipe fcntl armv6l:fcntl64 armv7l:fcntl64 ppc64:fcntl64 */ @@ -79,14 +79,14 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { - "SPLICE_CONNECT", "SPLICE_ESTABLISHED", "A_OUT_WAIT", "B_OUT_WAIT", - "A_FIN_RCVD", "B_FIN_RCVD", "A_FIN_SENT", "B_FIN_SENT", + "SPLICE_CONNECT", "SPLICE_ESTABLISHED", "OUT_WAIT_0", "OUT_WAIT_1", + "FIN_RCVD_0", "FIN_RCVD_1", "FIN_SENT_0", "FIN_SENT_1", }; /* Display strings for connection flags */ static const char *tcp_splice_flag_str[] __attribute((__unused__)) = { - "SPLICE_V6", "RCVLOWAT_SET_A", "RCVLOWAT_SET_B", "RCVLOWAT_ACT_A", - "RCVLOWAT_ACT_B", "CLOSING", + "SPLICE_V6", "RCVLOWAT_SET_0", "RCVLOWAT_SET_1", "RCVLOWAT_ACT_0", + "RCVLOWAT_ACT_1", "CLOSING", }; /* Forward declaration */ @@ -95,26 +95,24 @@ static int tcp_sock_refill_ns(void *arg); /** * tcp_splice_conn_epoll_events() - epoll events masks for given state * @events: Connection event flags - * @a: Event for socket with accepted connection, set on return - * @b: Event for connection target socket, set on return + * @ev: Events to fill in, 0 is accepted socket, 1 is connecting socket */ static void tcp_splice_conn_epoll_events(uint16_t events, - struct epoll_event *a, - struct epoll_event *b) + struct epoll_event ev[]) { - a->events = b->events = 0; + ev[0].events = ev[1].events = 0; if (events & SPLICE_ESTABLISHED) { - if (!(events & B_FIN_SENT)) - a->events = EPOLLIN | EPOLLRDHUP; - if (!(events & A_FIN_SENT)) - b->events = EPOLLIN | EPOLLRDHUP; + if (!(events & FIN_SENT_1)) + ev[0].events = EPOLLIN | EPOLLRDHUP; + if (!(events & FIN_SENT_0)) + ev[1].events = EPOLLIN | EPOLLRDHUP; } else if (events & SPLICE_CONNECT) { - b->events = EPOLLOUT; + ev[1].events = EPOLLOUT; } - a->events |= (events & A_OUT_WAIT) ? EPOLLOUT : 0; - b->events |= (events & B_OUT_WAIT) ? EPOLLOUT : 0; + ev[0].events |= (events & OUT_WAIT_0) ? EPOLLOUT : 0; + ev[1].events |= (events & OUT_WAIT_1) ? EPOLLOUT : 0; } /** @@ -128,17 +126,17 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn) { int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; - union epoll_ref ref_a = { .type = EPOLL_TYPE_TCP, .fd = conn->a, - .tcp.index = CONN_IDX(conn) }; - union epoll_ref ref_b = { .type = EPOLL_TYPE_TCP, .fd = conn->b, - .tcp.index = CONN_IDX(conn) }; - struct epoll_event ev_a = { .data.u64 = ref_a.u64 }; - struct epoll_event ev_b = { .data.u64 = ref_b.u64 }; + union epoll_ref ref[SIDES] = { + { .type = EPOLL_TYPE_TCP, .fd = conn->s[0], .tcp.index = CONN_IDX(conn) }, + { .type = EPOLL_TYPE_TCP, .fd = conn->s[1], .tcp.index = CONN_IDX(conn) } + }; + struct epoll_event ev[SIDES] = { { .data.u64 = ref[0].u64 }, + { .data.u64 = ref[1].u64 } }; - tcp_splice_conn_epoll_events(conn->events, &ev_a, &ev_b); + tcp_splice_conn_epoll_events(conn->events, ev); - if (epoll_ctl(c->epollfd, m, conn->a, &ev_a) || - epoll_ctl(c->epollfd, m, conn->b, &ev_b)) { + if (epoll_ctl(c->epollfd, m, conn->s[0], &ev[0]) || + epoll_ctl(c->epollfd, m, conn->s[1], &ev[1])) { int ret = -errno; err("TCP (spliced): index %li, ERROR on epoll_ctl(): %s", CONN_IDX(conn), strerror(errno)); @@ -184,8 +182,8 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, } if (flag == CLOSING) { - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->a, NULL); - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->b, NULL); + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[0], NULL); + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->s[1], NULL); } } @@ -263,26 +261,26 @@ void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) if (conn->events & SPLICE_ESTABLISHED) { /* Flushing might need to block: don't recycle them. */ - if (conn->pipe_a_b[0] != -1) { - close(conn->pipe_a_b[0]); - close(conn->pipe_a_b[1]); - conn->pipe_a_b[0] = conn->pipe_a_b[1] = -1; + if (conn->pipe[0][0] != -1) { + close(conn->pipe[0][0]); + close(conn->pipe[0][1]); + conn->pipe[0][0] = conn->pipe[0][1] = -1; } - if (conn->pipe_b_a[0] != -1) { - close(conn->pipe_b_a[0]); - close(conn->pipe_b_a[1]); - conn->pipe_b_a[0] = conn->pipe_b_a[1] = -1; + if (conn->pipe[1][0] != -1) { + close(conn->pipe[1][0]); + close(conn->pipe[1][1]); + conn->pipe[1][0] = conn->pipe[1][1] = -1; } } if (conn->events & SPLICE_CONNECT) { - close(conn->b); - conn->b = -1; + close(conn->s[1]); + conn->s[1] = -1; } - close(conn->a); - conn->a = -1; - conn->a_read = conn->a_written = conn->b_read = conn->b_written = 0; + close(conn->s[0]); + conn->s[0] = -1; + conn->read[0] = conn->written[0] = conn->read[1] = conn->written[1] = 0; conn->events = SPLICE_CLOSED; conn->flags = 0; @@ -303,47 +301,47 @@ static int tcp_splice_connect_finish(const struct ctx *c, { int i; - conn->pipe_a_b[0] = conn->pipe_b_a[0] = -1; - conn->pipe_a_b[1] = conn->pipe_b_a[1] = -1; + conn->pipe[0][0] = conn->pipe[1][0] = -1; + conn->pipe[0][1] = conn->pipe[1][1] = -1; for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { if (splice_pipe_pool[i][0] >= 0) { - SWAP(conn->pipe_a_b[0], splice_pipe_pool[i][0]); - SWAP(conn->pipe_a_b[1], splice_pipe_pool[i][1]); + SWAP(conn->pipe[0][0], splice_pipe_pool[i][0]); + SWAP(conn->pipe[0][1], splice_pipe_pool[i][1]); break; } } - if (conn->pipe_a_b[0] < 0) { - if (pipe2(conn->pipe_a_b, O_NONBLOCK | O_CLOEXEC)) { - err("TCP (spliced): cannot create a->b pipe: %s", + if (conn->pipe[0][0] < 0) { + if (pipe2(conn->pipe[0], O_NONBLOCK | O_CLOEXEC)) { + err("TCP (spliced): cannot create 0->1 pipe: %s", strerror(errno)); conn_flag(c, conn, CLOSING); return -EIO; } - if (fcntl(conn->pipe_a_b[0], F_SETPIPE_SZ, c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set a->b pipe size to %lu", + if (fcntl(conn->pipe[0][0], F_SETPIPE_SZ, c->tcp.pipe_size)) { + trace("TCP (spliced): cannot set 0->1 pipe size to %lu", c->tcp.pipe_size); } } for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { if (splice_pipe_pool[i][0] >= 0) { - SWAP(conn->pipe_b_a[0], splice_pipe_pool[i][0]); - SWAP(conn->pipe_b_a[1], splice_pipe_pool[i][1]); + SWAP(conn->pipe[1][0], splice_pipe_pool[i][0]); + SWAP(conn->pipe[1][1], splice_pipe_pool[i][1]); break; } } - if (conn->pipe_b_a[0] < 0) { - if (pipe2(conn->pipe_b_a, O_NONBLOCK | O_CLOEXEC)) { - err("TCP (spliced): cannot create b->a pipe: %s", + if (conn->pipe[1][0] < 0) { + if (pipe2(conn->pipe[1], O_NONBLOCK | O_CLOEXEC)) { + err("TCP (spliced): cannot create 1->0 pipe: %s", strerror(errno)); conn_flag(c, conn, CLOSING); return -EIO; } - if (fcntl(conn->pipe_b_a[0], F_SETPIPE_SZ, c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set b->a pipe size to %lu", + if (fcntl(conn->pipe[1][0], F_SETPIPE_SZ, c->tcp.pipe_size)) { + trace("TCP (spliced): cannot set 1->0 pipe size to %lu", c->tcp.pipe_size); } } @@ -379,12 +377,12 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, const struct sockaddr *sa; socklen_t sl; - conn->b = sock_conn; + conn->s[1] = sock_conn; - if (setsockopt(conn->b, SOL_TCP, TCP_QUICKACK, + if (setsockopt(conn->s[1], SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) { trace("TCP (spliced): failed to set TCP_QUICKACK on socket %i", - conn->b); + conn->s[1]); } if (CONN_V6(conn)) { @@ -395,7 +393,7 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, sl = sizeof(addr4); } - if (connect(conn->b, sa, sl)) { + if (connect(conn->s[1], sa, sl)) { if (errno != EINPROGRESS) { int ret = -errno; @@ -473,13 +471,13 @@ static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, { if (!reverse) { *from = ref_sock; - *to = (*from == conn->a) ? conn->b : conn->a; + *to = (*from == conn->s[0]) ? conn->s[1] : conn->s[0]; } else { *to = ref_sock; - *from = (*to == conn->a) ? conn->b : conn->a; + *from = (*to == conn->s[0]) ? conn->s[1] : conn->s[0]; } - *pipes = *from == conn->a ? conn->pipe_a_b : conn->pipe_b_a; + *pipes = *from == conn->s[0] ? conn->pipe[0] : conn->pipe[1]; } /** @@ -521,7 +519,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, trace("TCP (spliced): failed to set TCP_QUICKACK on %i", s); conn->c.spliced = true; - conn->a = s; + conn->s[0] = s; if (tcp_splice_new(c, conn, ref.port, ref.pif)) conn_flag(c, conn, CLOSING); @@ -559,10 +557,10 @@ void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, } if (events & EPOLLOUT) { - if (s == conn->a) - conn_event(c, conn, ~A_OUT_WAIT); + if (s == conn->s[0]) + conn_event(c, conn, ~OUT_WAIT_0); else - conn_event(c, conn, ~B_OUT_WAIT); + conn_event(c, conn, ~OUT_WAIT_1); tcp_splice_dir(conn, s, 1, &from, &to, &pipes); } else { @@ -570,33 +568,33 @@ void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, } if (events & EPOLLRDHUP) { - if (s == conn->a) - conn_event(c, conn, A_FIN_RCVD); + if (s == conn->s[0]) + conn_event(c, conn, FIN_RCVD_0); else - conn_event(c, conn, B_FIN_RCVD); + conn_event(c, conn, FIN_RCVD_1); } if (events & EPOLLHUP) { - if (s == conn->a) - conn_event(c, conn, A_FIN_SENT); /* Fake, but implied */ + if (s == conn->s[0]) + conn_event(c, conn, FIN_SENT_0); /* Fake, but implied */ else - conn_event(c, conn, B_FIN_SENT); + conn_event(c, conn, FIN_SENT_1); } swap: eof = 0; never_read = 1; - if (from == conn->a) { - seq_read = &conn->a_read; - seq_write = &conn->a_written; - lowat_set_flag = RCVLOWAT_SET_A; - lowat_act_flag = RCVLOWAT_ACT_A; + if (from == conn->s[0]) { + seq_read = &conn->read[0]; + seq_write = &conn->written[0]; + lowat_set_flag = RCVLOWAT_SET_0; + lowat_act_flag = RCVLOWAT_ACT_0; } else { - seq_read = &conn->b_read; - seq_write = &conn->b_written; - lowat_set_flag = RCVLOWAT_SET_B; - lowat_act_flag = RCVLOWAT_ACT_B; + seq_read = &conn->read[1]; + seq_write = &conn->written[1]; + lowat_set_flag = RCVLOWAT_SET_1; + lowat_act_flag = RCVLOWAT_ACT_1; } while (1) { @@ -666,10 +664,10 @@ eintr: if (never_read) break; - if (to == conn->a) - conn_event(c, conn, A_OUT_WAIT); + if (to == conn->s[0]) + conn_event(c, conn, OUT_WAIT_0); else - conn_event(c, conn, B_OUT_WAIT); + conn_event(c, conn, OUT_WAIT_1); break; } @@ -685,31 +683,31 @@ eintr: break; } - if ((conn->events & A_FIN_RCVD) && !(conn->events & B_FIN_SENT)) { + if ((conn->events & FIN_RCVD_0) && !(conn->events & FIN_SENT_1)) { if (*seq_read == *seq_write && eof) { - shutdown(conn->b, SHUT_WR); - conn_event(c, conn, B_FIN_SENT); + shutdown(conn->s[1], SHUT_WR); + conn_event(c, conn, FIN_SENT_1); } } - if ((conn->events & B_FIN_RCVD) && !(conn->events & A_FIN_SENT)) { + if ((conn->events & FIN_RCVD_1) && !(conn->events & FIN_SENT_0)) { if (*seq_read == *seq_write && eof) { - shutdown(conn->a, SHUT_WR); - conn_event(c, conn, A_FIN_SENT); + shutdown(conn->s[0], SHUT_WR); + conn_event(c, conn, FIN_SENT_0); } } - if (CONN_HAS(conn, A_FIN_SENT | B_FIN_SENT)) + if (CONN_HAS(conn, FIN_SENT_0 | FIN_SENT_1)) goto close; if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) { events = EPOLLIN; SWAP(from, to); - if (pipes == conn->pipe_a_b) - pipes = conn->pipe_b_a; + if (pipes == conn->pipe[0]) + pipes = conn->pipe[1]; else - pipes = conn->pipe_a_b; + pipes = conn->pipe[0]; goto swap; } @@ -843,26 +841,26 @@ void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union) return; } - if ( (conn->flags & RCVLOWAT_SET_A) && - !(conn->flags & RCVLOWAT_ACT_A)) { - if (setsockopt(conn->a, SOL_SOCKET, SO_RCVLOWAT, + if ( (conn->flags & RCVLOWAT_SET_0) && + !(conn->flags & RCVLOWAT_ACT_0)) { + if (setsockopt(conn->s[0], SOL_SOCKET, SO_RCVLOWAT, &((int){ 1 }), sizeof(int))) { trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->a); + "%i", conn->s[0]); } - conn_flag(c, conn, ~RCVLOWAT_SET_A); + conn_flag(c, conn, ~RCVLOWAT_SET_0); } - if ( (conn->flags & RCVLOWAT_SET_B) && - !(conn->flags & RCVLOWAT_ACT_B)) { - if (setsockopt(conn->b, SOL_SOCKET, SO_RCVLOWAT, + if ( (conn->flags & RCVLOWAT_SET_1) && + !(conn->flags & RCVLOWAT_ACT_1)) { + if (setsockopt(conn->s[1], SOL_SOCKET, SO_RCVLOWAT, &((int){ 1 }), sizeof(int))) { trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->b); + "%i", conn->s[1]); } - conn_flag(c, conn, ~RCVLOWAT_SET_B); + conn_flag(c, conn, ~RCVLOWAT_SET_1); } - conn_flag(c, conn, ~RCVLOWAT_ACT_A); - conn_flag(c, conn, ~RCVLOWAT_ACT_B); + conn_flag(c, conn, ~RCVLOWAT_ACT_0); + conn_flag(c, conn, ~RCVLOWAT_ACT_1); } -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 08/11] tcp_splice: Exploit side symmetry in tcp_splice_timer()

tcp_splice_timer() has two very similar blocks one after another that handle the SO_RCVLOWAT flags for the two sides of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson --- tcp_splice.c | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index f405184..cadad32 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -835,30 +835,25 @@ void tcp_splice_init(struct ctx *c) void tcp_splice_timer(struct ctx *c, union tcp_conn *conn_union) { struct tcp_splice_conn *conn = &conn_union->splice; + int side; if (conn->flags & CLOSING) { tcp_splice_destroy(c, conn_union); return; } - if ( (conn->flags & RCVLOWAT_SET_0) && - !(conn->flags & RCVLOWAT_ACT_0)) { - if (setsockopt(conn->s[0], SOL_SOCKET, SO_RCVLOWAT, - &((int){ 1 }), sizeof(int))) { - trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->s[0]); - } - conn_flag(c, conn, ~RCVLOWAT_SET_0); - } + for (side = 0; side < SIDES; side++) { + uint8_t set = side == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1; + uint8_t act = side == 0 ? RCVLOWAT_ACT_0 : RCVLOWAT_ACT_1; - if ( (conn->flags & RCVLOWAT_SET_1) && - !(conn->flags & RCVLOWAT_ACT_1)) { - if (setsockopt(conn->s[1], SOL_SOCKET, SO_RCVLOWAT, - &((int){ 1 }), sizeof(int))) { - trace("TCP (spliced): can't set SO_RCVLOWAT on " - "%i", conn->s[1]); + if ((conn->flags & set) && !(conn->flags & act)) { + if (setsockopt(conn->s[side], SOL_SOCKET, SO_RCVLOWAT, + &((int){ 1 }), sizeof(int))) { + trace("TCP (spliced): can't set SO_RCVLOWAT on " + "%i", conn->s[side]); + } + conn_flag(c, conn, ~set); } - conn_flag(c, conn, ~RCVLOWAT_SET_1); } conn_flag(c, conn, ~RCVLOWAT_ACT_0); -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 09/11] tcp_splice: Exploit side symmetry in tcp_splice_connect_finish()

tcp_splice_connect_finish() has two very similar blocks opening the two pipes for each direction of the connection. We can deduplicate this with a loop across the two sides. Signed-off-by: David Gibson --- tcp_splice.c | 65 ++++++++++++++++++++-------------------------------- 1 file changed, 25 insertions(+), 40 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index cadad32..214bf22 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -299,50 +299,35 @@ void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { - int i; - - conn->pipe[0][0] = conn->pipe[1][0] = -1; - conn->pipe[0][1] = conn->pipe[1][1] = -1; - - for (i = 0; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { - if (splice_pipe_pool[i][0] >= 0) { - SWAP(conn->pipe[0][0], splice_pipe_pool[i][0]); - SWAP(conn->pipe[0][1], splice_pipe_pool[i][1]); - break; - } - } - if (conn->pipe[0][0] < 0) { - if (pipe2(conn->pipe[0], O_NONBLOCK | O_CLOEXEC)) { - err("TCP (spliced): cannot create 0->1 pipe: %s", - strerror(errno)); - conn_flag(c, conn, CLOSING); - return -EIO; - } + int i = 0; + int side; - if (fcntl(conn->pipe[0][0], F_SETPIPE_SZ, c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set 0->1 pipe size to %lu", - c->tcp.pipe_size); + for (side = 0; side < SIDES; side++) { + conn->pipe[side][0] = conn->pipe[side][1] = -1; + + for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { + if (splice_pipe_pool[i][0] >= 0) { + SWAP(conn->pipe[side][0], + splice_pipe_pool[i][0]); + SWAP(conn->pipe[side][1], + splice_pipe_pool[i][1]); + break; + } } - } - for (; i < TCP_SPLICE_PIPE_POOL_SIZE; i++) { - if (splice_pipe_pool[i][0] >= 0) { - SWAP(conn->pipe[1][0], splice_pipe_pool[i][0]); - SWAP(conn->pipe[1][1], splice_pipe_pool[i][1]); - break; - } - } - if (conn->pipe[1][0] < 0) { - if (pipe2(conn->pipe[1], O_NONBLOCK | O_CLOEXEC)) { - err("TCP (spliced): cannot create 1->0 pipe: %s", - strerror(errno)); - conn_flag(c, conn, CLOSING); - return -EIO; - } + if (conn->pipe[side][0] < 0) { + if (pipe2(conn->pipe[side], O_NONBLOCK | O_CLOEXEC)) { + err("TCP (spliced): cannot create %d->%d pipe: %s", + side, !side, strerror(errno)); + conn_flag(c, conn, CLOSING); + return -EIO; + } - if (fcntl(conn->pipe[1][0], F_SETPIPE_SZ, c->tcp.pipe_size)) { - trace("TCP (spliced): cannot set 1->0 pipe size to %lu", - c->tcp.pipe_size); + if (fcntl(conn->pipe[side][0], F_SETPIPE_SZ, + c->tcp.pipe_size)) { + trace("TCP (spliced): cannot set %d->%d pipe size to %lu", + side, !side, c->tcp.pipe_size); + } } } -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 10/11] tcp_splice: Exploit side symmetry in tcp_splice_destroy()

tcp_splice_destroy() has some close-to-duplicated logic handling closing of the socket and pipes for each side of the connection. We can use a loop across the sides to reduce the duplication. Signed-off-by: David Gibson --- tcp_splice.c | 32 ++++++++++++++------------------ 1 file changed, 14 insertions(+), 18 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 214bf22..9f84d4f 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -258,30 +258,26 @@ void tcp_splice_conn_update(const struct ctx *c, struct tcp_splice_conn *new) void tcp_splice_destroy(struct ctx *c, union tcp_conn *conn_union) { struct tcp_splice_conn *conn = &conn_union->splice; + int side; - if (conn->events & SPLICE_ESTABLISHED) { - /* Flushing might need to block: don't recycle them. */ - if (conn->pipe[0][0] != -1) { - close(conn->pipe[0][0]); - close(conn->pipe[0][1]); - conn->pipe[0][0] = conn->pipe[0][1] = -1; + for (side = 0; side < SIDES; side++) { + if (conn->events & SPLICE_ESTABLISHED) { + /* Flushing might need to block: don't recycle them. */ + if (conn->pipe[side][0] != -1) { + close(conn->pipe[side][0]); + close(conn->pipe[side][1]); + conn->pipe[side][0] = conn->pipe[side][1] = -1; + } } - if (conn->pipe[1][0] != -1) { - close(conn->pipe[1][0]); - close(conn->pipe[1][1]); - conn->pipe[1][0] = conn->pipe[1][1] = -1; + + if (side == 0 || conn->events & SPLICE_CONNECT) { + close(conn->s[side]); + conn->s[side] = -1; } - } - if (conn->events & SPLICE_CONNECT) { - close(conn->s[1]); - conn->s[1] = -1; + conn->read[side] = conn->written[side] = 0; } - close(conn->s[0]); - conn->s[0] = -1; - conn->read[0] = conn->written[0] = conn->read[1] = conn->written[1] = 0; - conn->events = SPLICE_CLOSED; conn->flags = 0; debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn)); -- 2.41.0

David Gibson

3:42 a.m.

New subject: [PATCH v2 11/11] tcp_splice: Simplify selection of socket and pipe sides in socket handler

tcp_splice_sock_handler() uses the tcp_splice_dir() helper to select which of the socket, pipe and counter fields to use depending on which side of the connection the socket event is coming from. Now that we are using arrays for the two sides, rather than separate named fields, we can instead just use a variable indicating the side and use that to index the arrays whever we need a particular side's field. Signed-off-by: David Gibson --- tcp_splice.c | 81 ++++++++++++++-------------------------------------- 1 file changed, 22 insertions(+), 59 deletions(-) diff --git a/tcp_splice.c b/tcp_splice.c index 9f84d4f..a5c1332 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -438,29 +438,6 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn, return tcp_splice_connect(c, conn, s, port); } -/** - * tcp_splice_dir() - Set sockets/pipe pointers reflecting flow direction - * @conn: Connection pointers - * @ref_sock: Socket returned as reference from epoll - * @reverse: Reverse direction: @ref_sock is used as destination - * @from: Destination socket pointer to set - * @to: Source socket pointer to set - * @pipes: Pipe set, assigned on return - */ -static void tcp_splice_dir(struct tcp_splice_conn *conn, int ref_sock, - int reverse, int *from, int *to, int **pipes) -{ - if (!reverse) { - *from = ref_sock; - *to = (*from == conn->s[0]) ? conn->s[1] : conn->s[0]; - } else { - *to = ref_sock; - *from = (*to == conn->s[0]) ? conn->s[1] : conn->s[0]; - } - - *pipes = *from == conn->s[0] ? conn->pipe[0] : conn->pipe[1]; -} - /** * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context @@ -521,8 +498,7 @@ void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, int s, uint32_t events) { uint8_t lowat_set_flag, lowat_act_flag; - int from, to, *pipes, eof, never_read; - uint32_t *seq_read, *seq_write; + int fromside, eof, never_read; if (conn->events == SPLICE_CLOSED) return; @@ -538,14 +514,15 @@ void tcp_splice_sock_handler(struct ctx *c, struct tcp_splice_conn *conn, } if (events & EPOLLOUT) { - if (s == conn->s[0]) + if (s == conn->s[0]) { conn_event(c, conn, ~OUT_WAIT_0); - else + fromside = 1; + } else { conn_event(c, conn, ~OUT_WAIT_1); - - tcp_splice_dir(conn, s, 1, &from, &to, &pipes); + fromside = 0; + } } else { - tcp_splice_dir(conn, s, 0, &from, &to, &pipes); + fromside = s == conn->s[0] ? 0 : 1; } if (events & EPOLLRDHUP) { @@ -566,24 +543,16 @@ swap: eof = 0; never_read = 1; - if (from == conn->s[0]) { - seq_read = &conn->read[0]; - seq_write = &conn->written[0]; - lowat_set_flag = RCVLOWAT_SET_0; - lowat_act_flag = RCVLOWAT_ACT_0; - } else { - seq_read = &conn->read[1]; - seq_write = &conn->written[1]; - lowat_set_flag = RCVLOWAT_SET_1; - lowat_act_flag = RCVLOWAT_ACT_1; - } + lowat_set_flag = fromside == 0 ? RCVLOWAT_SET_0 : RCVLOWAT_SET_1; + lowat_act_flag = fromside == 0 ? RCVLOWAT_ACT_0 : RCVLOWAT_ACT_1; while (1) { ssize_t readlen, to_write = 0, written; int more = 0; retry: - readlen = splice(from, NULL, pipes[1], NULL, c->tcp.pipe_size, + readlen = splice(conn->s[fromside], NULL, + conn->pipe[fromside][1], NULL, c->tcp.pipe_size, SPLICE_F_MOVE | SPLICE_F_NONBLOCK); trace("TCP (spliced): %li from read-side call", readlen); if (readlen < 0) { @@ -608,7 +577,8 @@ retry: } eintr: - written = splice(pipes[0], NULL, to, NULL, to_write, + written = splice(conn->pipe[fromside][0], NULL, + conn->s[!fromside], NULL, to_write, SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK); trace("TCP (spliced): %li from write-side call (passed %lu)", written, to_write); @@ -622,8 +592,8 @@ eintr: readlen > (long)c->tcp.pipe_size / 10) { int lowat = c->tcp.pipe_size / 4; - setsockopt(from, SOL_SOCKET, SO_RCVLOWAT, - &lowat, sizeof(lowat)); + setsockopt(conn->s[fromside], SOL_SOCKET, + SO_RCVLOWAT, &lowat, sizeof(lowat)); conn_flag(c, conn, lowat_set_flag); conn_flag(c, conn, lowat_act_flag); @@ -632,8 +602,8 @@ eintr: break; } - *seq_read += readlen > 0 ? readlen : 0; - *seq_write += written > 0 ? written : 0; + conn->read[fromside] += readlen > 0 ? readlen : 0; + conn->written[fromside] += written > 0 ? written : 0; if (written < 0) { if (errno == EINTR) @@ -645,10 +615,8 @@ eintr: if (never_read) break; - if (to == conn->s[0]) - conn_event(c, conn, OUT_WAIT_0); - else - conn_event(c, conn, OUT_WAIT_1); + conn_event(c, conn, + fromside == 0 ? OUT_WAIT_1 : OUT_WAIT_0); break; } @@ -665,14 +633,14 @@ eintr: } if ((conn->events & FIN_RCVD_0) && !(conn->events & FIN_SENT_1)) { - if (*seq_read == *seq_write && eof) { + if (conn->read[fromside] == conn->written[fromside] && eof) { shutdown(conn->s[1], SHUT_WR); conn_event(c, conn, FIN_SENT_1); } } if ((conn->events & FIN_RCVD_1) && !(conn->events & FIN_SENT_0)) { - if (*seq_read == *seq_write && eof) { + if (conn->read[fromside] == conn->written[fromside] && eof) { shutdown(conn->s[0], SHUT_WR); conn_event(c, conn, FIN_SENT_0); } @@ -684,12 +652,7 @@ eintr: if ((events & (EPOLLIN | EPOLLOUT)) == (EPOLLIN | EPOLLOUT)) { events = EPOLLIN; - SWAP(from, to); - if (pipes == conn->pipe[0]) - pipes = conn->pipe[1]; - else - pipes = conn->pipe[0]; - + fromside = !fromside; goto swap; } -- 2.41.0

Stefano Brivio

1:45 p.m.

On Tue, 7 Nov 2023 13:42:39 +1100 David Gibson wrote:

...

For spliced connections, both "sides" are sockets, and for many purposes how we deal with each side is symmetric. Currently, however, we track the information for each side in independent fields in the structure, meaning we can't easily exploit that symmetry.

This makes a number of reorganizations of the tcp splice code so that we can explot that symmetry to reduce code size. This will have some additional advantages when we come to integrate with the in-progress unified flow table.

Based on top of the interface identifiers and automatic forwarding cleanup series.

Changes since v1: * Small updates to comments and commit messages

David Gibson (11): tcp_splice: Remove redundant tcp_splice_epoll_ctl() tcp_splice: Correct error handling in tcp_splice_epoll_ctl() tcp_splice: Don't handle EPOLL_CTL_DEL as part of tcp_splice_epoll_ctl() tcp_splice: Remove unnecessary forward declaration tcp_splice: Avoid awkward temporaries in tcp_splice_epoll_ctl() tcp_splice: Don't pool pipes in pairs tcp_splice: Rename sides of connection from a/b to 0/1 tcp_splice: Exploit side symmetry in tcp_splice_timer() tcp_splice: Exploit side symmetry in tcp_splice_connect_finish() tcp_splice: Exploit side symmetry in tcp_splice_destroy() tcp_splice: Simplify selection of socket and pipe sides in socket handler

Applied. -- Stefano

622

Age (days ago)

622

Last active (days ago)

List overview

Download

12 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH v2 00/11] tcp_splice: Better exploit symmetry between sides of connection

tags

participants (2)