[PATCH v2 00/32] Use dual stack sockets to listen for inbound TCP connections
When forwarding many ports, passt can consume a lot of kernel memory because of the many listening sockets it opens. There are not a lot of ways we can reduce that, but here's one. Currently we create separate listening sockets for each port for both IPv4 and IPv6. However in Linux (and probably other platforms), it's possible to listen for both IPv4 and IPv6 connections on an IPv6 socket. This series uses such dual stack sockets to halve the number of listening sockets needed for TCP. When forwarding all TCP and UDP ports, this reduces the kernel memory used from around 677 MiB to around 487 MiB (kernel 6.0.8 on an x86_64 Fedora 37 machine). This should also be possible for UDP, but that will require a mostly separate implementation. Changes since v2: * Assorted minor polishing based on Stefano's review David Gibson (32): clang-tidy: Suppress warning about assignments in if statements style: Minor corrections to function comments tcp_splice: #include tcp_splice.h in tcp_splice.c tcp: Remove unused TCP_MAX_SOCKS constant tcp: Better helpers for converting between connection pointer and index tcp_splice: Helpers for converting from index to/from tcp_splice_conn tcp: Move connection state structures into a shared header tcp: Add connection union type tcp: Improved helpers to update connections after moving tcp: Unify spliced and non-spliced connection tables tcp: Unify tcp_defer_handler and tcp_splice_defer_handler() tcp: Partially unify tcp_timer() and tcp_splice_timer() tcp: Unify the IN_EPOLL flag tcp: Separate helpers to create ns listening sockets tcp: Unify part of spliced and non-spliced conn_from_sock path tcp: Use the same sockets to listen for spliced and non-spliced connections tcp: Remove splice from tcp_epoll_ref tcp: Don't store hash bucket in connection structures inany: Helper functions for handling addresses which could be IPv4 or IPv6 tcp: Hash IPv4 and IPv4-mapped-IPv6 addresses the same tcp: Take tcp_hash_insert() address from struct tcp_conn tcp: Simplify tcp_hash_match() to take an inany_addr tcp: Unify initial sequence number calculation for IPv4 and IPv6 tcp: Have tcp_seq_init() take its parameters from struct tcp_conn tcp: Fix small errors in tcp_seq_init() time handling tcp: Remove v6 flag from tcp_epoll_ref tcp: NAT IPv4-mapped IPv6 addresses like IPv4 addresses tcp_splice: Allow splicing of connections from IPv4-mapped loopback tcp: Consolidate tcp_sock_init[46] util: Allow sock_l4() to open dual stack sockets util: Always return -1 on error in sock_l4() tcp: Use dual stack sockets for port forwarding when possible Makefile | 15 +- conf.c | 12 +- inany.h | 94 +++++ siphash.c | 2 + tap.c | 6 +- tcp.c | 981 ++++++++++++++++++++++----------------------------- tcp.h | 11 +- tcp_conn.h | 192 ++++++++++ tcp_splice.c | 333 +++++++---------- tcp_splice.h | 12 +- util.c | 19 +- 11 files changed, 891 insertions(+), 786 deletions(-) create mode 100644 inany.h create mode 100644 tcp_conn.h -- 2.38.1
clang-tools 15.0.0 appears to have added a new warning that will always
complain about assignments in if statements, which we use in a number of
places in passt/pasta. Encountered on Fedora 37 with
clang-tools-extra-15.0.0-3.fc37.x86_64.
Suppress the new warning so that we can compile and test.
Signed-off-by: David Gibson
Some style issues and a typo.
Signed-off-by: David Gibson
This obvious include was omitted, which means that declarations in the
header weren't checked against definitions in the .c file. This shows up
an old declaration for a function that is now static, and a duplicate
#define.
Signed-off-by: David Gibson
Presumably it meant something in the past, but it's no longer used.
Signed-off-by: David Gibson
The macro CONN_OR_NULL() is used to look up connections by index with
bounds checking. Replace it with an inline function, which means:
- Better type checking
- No danger of multiple evaluation of an @index with side effects
Also add a helper to perform the reverse translation: from connection
pointer to index. Introduce a macro for this which will make later
cleanups easier and safer.
Signed-off-by: David Gibson
Like we already have for non-spliced connections, create a CONN_IDX()
macro for looking up the index of spliced connection structures. Change
the name of the array of spliced connections to be different from that for
non-spliced connections (even though they're in different modules). This
will make subsequent changes a bit safer.
Signed-off-by: David Gibson
Currently spliced and non-spliced connections use completely independent
tracking structures. We want to unify these, so as a preliminary step move
the definitions for both variants into a new tcp_conn.h header, shared by
tcp.c and tcp_splice.c.
This requires renaming some #defines with the same name but different
meanings between the two cases. In the process we correct some places that
are slightly out of sync between the comments and the code for various
event bit names.
Signed-off-by: David Gibson
Currently, the tables for spliced and non-spliced connections are entirely
separate, with different types in different arrays. We want to unify them.
As a first step, create a union type which can represent either a spliced
or non-spliced connection. For them to be distinguishable, the individual
types need to have a common header added, with a bit indicating which type
this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one
(64 byte) cacheline. This isn't ideal, but it makes things simpler for now
and we'll re-optimize this later.
Signed-off-by: David Gibson
On Thu, 17 Nov 2022 16:58:44 +1100
David Gibson
Currently, the tables for spliced and non-spliced connections are entirely separate, with different types in different arrays. We want to unify them. As a first step, create a union type which can represent either a spliced or non-spliced connection. For them to be distinguishable, the individual types need to have a common header added, with a bit indicating which type this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one (64 byte) cacheline. This isn't ideal, but it makes things simpler for now and we'll re-optimize this later.
Signed-off-by: David Gibson
--- tcp.c | 4 ++++ tcp_conn.h | 30 ++++++++++++++++++++++++++++++ tcp_splice.c | 2 ++ 3 files changed, 36 insertions(+) diff --git a/tcp.c b/tcp.c index 189041c..05eed85 100644 --- a/tcp.c +++ b/tcp.c @@ -288,6 +288,7 @@ #include
#include #include +#include #include
/* For struct tcp_info */ @@ -601,6 +602,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; + assert(!(CONN(index)->c.spliced)); return CONN(index); }
@@ -2096,6 +2098,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, }
conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -2764,6 +2767,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, return;
conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; diff --git a/tcp_conn.h b/tcp_conn.h index db4c2d9..39d104a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -11,8 +11,19 @@
#define TCP_HASH_BUCKET_BITS (TCP_CONN_INDEX_BITS + 1)
+/** + * struct tcp_conn_common - Common fields for spliced and non-spliced + * @spliced: Is this a spliced connection? + */ +struct tcp_conn_common { + bool spliced :1; +}; + +extern const char *tcp_common_flag_str[]; + /** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) + * @c: Fields common with tcp_splice_conn * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number @@ -40,6 +51,9 @@ * @seq_init_from_tap: Initial sequence number from tap */ struct tcp_tap_conn { + /* Must be first element to match tcp_splice_conn */ + struct tcp_conn_common c; + int next_index :TCP_CONN_INDEX_BITS + 2;
#define TCP_RETRANS_BITS 3 @@ -122,6 +136,7 @@ struct tcp_tap_conn {
/** * struct tcp_splice_conn - Descriptor for a spliced TCP connection + * @c: Fields common with tcp_tap_conn * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b * @b: File descriptor number of peer connected socket @@ -134,6 +149,9 @@ struct tcp_tap_conn { * @b_written: Bytes written to @b (not fully written from one @a read) */ struct tcp_splice_conn { + /* Must be first element to match tcp_tap_conn */ + struct tcp_conn_common c; + int a; int pipe_a_b[2]; int b; @@ -165,4 +183,16 @@ struct tcp_splice_conn { uint32_t b_written; };
+/** + * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) + * @c: Fields common between all variants + * @tap: Fields specific to non-spliced connections + * @splice: Fields specific to spliced connections +*/ +union tcp_conn { + struct tcp_conn_common c; + struct tcp_tap_conn tap; + struct tcp_splice_conn splice; +};
Sorry, I could have noticed earlier: I understand that this is needed
to end up, at the end of the series, with a 64-byte tcp_conn, but it
doesn't really look like the most natural way of doing things. I would
have expected something like:
struct tcp_conn {
struct tcp_conn_common c;
union {
struct tcp_tap_conn tap;
struct tcp_splice_conn splice;
} u;
};
but sure, if we do this, then we have 3 bytes between 'c' and 'u', and
struct tcp_conn becomes 68 bytes long.
It also confuses Coverity Scan, because in tcp_table_compact() we have:
memset(hole, 0, sizeof(*hole));
and while the prototype is:
void tcp_table_compact(struct ctx *c, union tcp_conn *hole)
it sees that we're passing, from tcp_splice_destroy(), something
smaller than that (48 bytes), but we're zeroing the whole thing.
Of course, it's not a real issue, that space is reserved for a
connection slot anyway, but given there are no other issues reported,
I'd try to keep Coverity happy if possible.
First try, failed: check hole->c.spliced and, if set, zero only
sizeof(struct tcp_splice_conn) bytes. This looks like a false
positive.
Another try, which should probably work (I just hit the daily build
submission quota, grr): explicitly pass the union tcp_conn containing
our struct tcp_splice_conn. This patch does it:
---
diff --git a/tcp.c b/tcp.c
index 8874789..d635a8e 100644
--- a/tcp.c
+++ b/tcp.c
@@ -591,7 +591,7 @@ static size_t tcp6_l2_flags_buf_bytes;
union tcp_conn tc[TCP_MAX_CONNS];
#define CONN(index) (&tc[(index)].tap)
-#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc)
+#define CONN_IDX(conn) (TCP_TAP_TO_COMMON(conn) - tc)
/** conn_at_idx() - Find a connection by index, if present
* @index: Index of connection to lookup
@@ -1385,7 +1385,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn)
close(conn->timer);
tcp_hash_remove(c, conn);
- tcp_table_compact(c, (union tcp_conn *)conn);
+ tcp_table_compact(c, TCP_TAP_TO_COMMON(conn));
}
static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn);
diff --git a/tcp_conn.h b/tcp_conn.h
index 4a8be29..fa407ad 100644
--- a/tcp_conn.h
+++ b/tcp_conn.h
@@ -176,6 +176,12 @@ union tcp_conn {
struct tcp_splice_conn splice;
};
+#define TCP_TAP_TO_COMMON(x) \
+ ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, tap)))
+
+#define TCP_SPLICE_TO_COMMON(x) \
+ ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, splice)))
+
/* TCP connections */
extern union tcp_conn tc[];
diff --git a/tcp_splice.c b/tcp_splice.c
index e2f0ce1..7d3f17e 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -37,6 +37,7 @@
#include
On Fri, Nov 18, 2022 at 01:25:18AM +0100, Stefano Brivio wrote:
On Thu, 17 Nov 2022 16:58:44 +1100 David Gibson
wrote: Currently, the tables for spliced and non-spliced connections are entirely separate, with different types in different arrays. We want to unify them. As a first step, create a union type which can represent either a spliced or non-spliced connection. For them to be distinguishable, the individual types need to have a common header added, with a bit indicating which type this structure is.
This comes at the cost of increasing the size of tcp_tap_conn to over one (64 byte) cacheline. This isn't ideal, but it makes things simpler for now and we'll re-optimize this later.
Signed-off-by: David Gibson
--- tcp.c | 4 ++++ tcp_conn.h | 30 ++++++++++++++++++++++++++++++ tcp_splice.c | 2 ++ 3 files changed, 36 insertions(+) diff --git a/tcp.c b/tcp.c index 189041c..05eed85 100644 --- a/tcp.c +++ b/tcp.c @@ -288,6 +288,7 @@ #include
#include #include +#include #include
/* For struct tcp_info */ @@ -601,6 +602,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; + assert(!(CONN(index)->c.spliced)); return CONN(index); }
@@ -2096,6 +2098,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, }
conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -2764,6 +2767,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, return;
conn = CONN(c->tcp.conn_count++); + conn->c.spliced = false; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; diff --git a/tcp_conn.h b/tcp_conn.h index db4c2d9..39d104a 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -11,8 +11,19 @@
#define TCP_HASH_BUCKET_BITS (TCP_CONN_INDEX_BITS + 1)
+/** + * struct tcp_conn_common - Common fields for spliced and non-spliced + * @spliced: Is this a spliced connection? + */ +struct tcp_conn_common { + bool spliced :1; +}; + +extern const char *tcp_common_flag_str[]; + /** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) + * @c: Fields common with tcp_splice_conn * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number @@ -40,6 +51,9 @@ * @seq_init_from_tap: Initial sequence number from tap */ struct tcp_tap_conn { + /* Must be first element to match tcp_splice_conn */ + struct tcp_conn_common c; + int next_index :TCP_CONN_INDEX_BITS + 2;
#define TCP_RETRANS_BITS 3 @@ -122,6 +136,7 @@ struct tcp_tap_conn {
/** * struct tcp_splice_conn - Descriptor for a spliced TCP connection + * @c: Fields common with tcp_tap_conn * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b * @b: File descriptor number of peer connected socket @@ -134,6 +149,9 @@ struct tcp_tap_conn { * @b_written: Bytes written to @b (not fully written from one @a read) */ struct tcp_splice_conn { + /* Must be first element to match tcp_tap_conn */ + struct tcp_conn_common c; + int a; int pipe_a_b[2]; int b; @@ -165,4 +183,16 @@ struct tcp_splice_conn { uint32_t b_written; };
+/** + * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) + * @c: Fields common between all variants + * @tap: Fields specific to non-spliced connections + * @splice: Fields specific to spliced connections +*/ +union tcp_conn { + struct tcp_conn_common c; + struct tcp_tap_conn tap; + struct tcp_splice_conn splice; +};
Sorry, I could have noticed earlier: I understand that this is needed to end up, at the end of the series, with a 64-byte tcp_conn, but it doesn't really look like the most natural way of doing things. I would have expected something like:
struct tcp_conn { struct tcp_conn_common c; union { struct tcp_tap_conn tap; struct tcp_splice_conn splice; } u; };
Heh... so... I actually went forth between these two options about a dozen times while figuring out how to do this. Getting the thing back down to 64 bytes is not actually the main reason I settled on the current approach, although it does also make that easier. The biggest reason I didn't use the union-in-struct approach is that I wasn't sure if we wanted to introduce container_of mechanics (as you do in the patch below). Without that we'd have to pass tcp_conn to a lot of places that mostly just want a tap or splice variant, just so they can get the index. That in turn also means we have to explicitly reference into the union in a heap of places - I probably spent an hour or two attempting this getting bogged down in trivial text changes before giving up and going a different way. And finally, as you note with the union-of-structs approach, the compiler is able to pack other bitfields into the spare bits left by the common header, which makes it much, much easier to get the whole thing down to 64 bytes again. In general I don't love the "make sure these structs have the same header" approach over enforcing it with a union-in-struct, but that's why I chose it here. Note also that we are already dealing with a similar pattern for sockaddrs.
but sure, if we do this, then we have 3 bytes between 'c' and 'u', and struct tcp_conn becomes 68 bytes long.
Right. I toyed with a bunch of different options for trying to pull various fields out into the common part to fill that gap, and they were all really painful.
It also confuses Coverity Scan, because in tcp_table_compact() we have:
memset(hole, 0, sizeof(*hole));
and while the prototype is:
void tcp_table_compact(struct ctx *c, union tcp_conn *hole)
it sees that we're passing, from tcp_splice_destroy(), something smaller than that (48 bytes), but we're zeroing the whole thing.
Bother.
Of course, it's not a real issue, that space is reserved for a connection slot anyway, but given there are no other issues reported, I'd try to keep Coverity happy if possible.
First try, failed: check hole->c.spliced and, if set, zero only sizeof(struct tcp_splice_conn) bytes. This looks like a false positive.
Another try, which should probably work (I just hit the daily build submission quota, grr): explicitly pass the union tcp_conn containing our struct tcp_splice_conn. This patch does it:
--- diff --git a/tcp.c b/tcp.c index 8874789..d635a8e 100644 --- a/tcp.c +++ b/tcp.c @@ -591,7 +591,7 @@ static size_t tcp6_l2_flags_buf_bytes; union tcp_conn tc[TCP_MAX_CONNS];
#define CONN(index) (&tc[(index)].tap) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN_IDX(conn) (TCP_TAP_TO_COMMON(conn) - tc)
/** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -1385,7 +1385,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) close(conn->timer);
tcp_hash_remove(c, conn); - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TCP_TAP_TO_COMMON(conn)); }
static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); diff --git a/tcp_conn.h b/tcp_conn.h index 4a8be29..fa407ad 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -176,6 +176,12 @@ union tcp_conn { struct tcp_splice_conn splice; };
+#define TCP_TAP_TO_COMMON(x) \ + ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, tap))) + +#define TCP_SPLICE_TO_COMMON(x) \ + ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, splice))) + /* TCP connections */ extern union tcp_conn tc[];
diff --git a/tcp_splice.c b/tcp_splice.c index e2f0ce1..7d3f17e 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -37,6 +37,7 @@ #include
#include #include +#include #include #include #include @@ -74,7 +75,7 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) #define CONN(index) (&tc[(index)].splice) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN_IDX(conn) (TCP_SPLICE_TO_COMMON(conn) - tc) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -283,7 +284,7 @@ void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
c->tcp.splice_conn_count--; - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TCP_SPLICE_TO_COMMON(conn)); }
Hmm.. so TAP_TO_COMMON and SPLICE_TO_COMMON don't actually change the pointer any more than the cast does, but you're hoping that will be enough to convince coverity? I'm a bit skeptical, but I guess we can try it. I wonder if an explicit coverity lint override might be a better option here. Or, we could add padding to tcp_splice_conn so it has the same size as tcp_tap_conn. I think that would probably convince coverity.
/** ---
I can add it on top if you agree, assuming it works.
I also tried to actually turn tcp_conn into a struct. It takes 68 bytes, so I'm not pursuing this approach, but I'm including the diff just in case you have some quick idea to fix it up:
Like I said, I tried this in a bunch of ways, and just got mired in irritating and not very productive code frobbing
diff --git a/tcp.c b/tcp.c index 8874789..6ee5675 100644 --- a/tcp.c +++ b/tcp.c @@ -588,10 +588,10 @@ static unsigned int tcp6_l2_flags_buf_used; static size_t tcp6_l2_flags_buf_bytes;
/* TCP connections */ -union tcp_conn tc[TCP_MAX_CONNS]; +struct tcp_conn tc[TCP_MAX_CONNS];
-#define CONN(index) (&tc[(index)].tap) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN(index) (&tc[(index)].u.tap) +#define CONN_IDX(conn) (TO_TCP_CONN(conn) - tc)
/** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -602,7 +602,7 @@ static inline struct tcp_tap_conn *conn_at_idx(int index) { if ((index < 0) || (index >= TCP_MAX_CONNS)) return NULL; - assert(!(CONN(index)->c.spliced)); + assert(!TO_TCP_CONN(CONN(index))->c.spliced); return CONN(index); }
@@ -660,13 +660,13 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { - int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = TO_TCP_CONN(conn)->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .r.proto = IPPROTO_TCP, .r.s = conn->sock, .r.p.tcp.tcp.index = CONN_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64 };
if (conn->events == CLOSED) { - if (conn->c.in_epoll) + if (TO_TCP_CONN(conn)->c.in_epoll) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, &ev); if (conn->timer != -1) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev); @@ -678,7 +678,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (epoll_ctl(c->epollfd, m, conn->sock, &ev)) return -errno;
- conn->c.in_epoll = true; + TO_TCP_CONN(conn)->c.in_epoll = true;
if (conn->timer != -1) { union epoll_ref ref_t = { .r.proto = IPPROTO_TCP, @@ -1347,9 +1347,9 @@ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, * @c: Execution context * @hole: Pointer to recently closed connection */ -void tcp_table_compact(struct ctx *c, union tcp_conn *hole) +void tcp_table_compact(struct ctx *c, struct tcp_conn *hole) { - union tcp_conn *from; + struct tcp_conn *from;
if (CONN_IDX(hole) == --c->tcp.conn_count) { debug("TCP: table compaction: maximum index was %li (%p)", @@ -1361,14 +1361,15 @@ void tcp_table_compact(struct ctx *c, union tcp_conn *hole) from = tc + c->tcp.conn_count; memcpy(hole, from, sizeof(*hole));
- if (from->c.spliced) - tcp_splice_conn_update(c, &hole->splice); + if (TO_TCP_CONN(from)->c.spliced) + tcp_splice_conn_update(c, &hole->u.splice); else - tcp_tap_conn_update(c, &from->tap, &hole->tap); + tcp_tap_conn_update(c, &from->u.tap, &hole->u.tap);
debug("TCP: table compaction (spliced=%d): old index %li, new index %li, " "from: %p, to: %p", - from->c.spliced, CONN_IDX(from), CONN_IDX(hole), from, hole); + TO_TCP_CONN(from)->c.spliced, CONN_IDX(from), CONN_IDX(hole), + from, hole);
memset(from, 0, sizeof(*from)); } @@ -1385,7 +1386,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) close(conn->timer);
tcp_hash_remove(c, conn); - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TO_TCP_CONN(conn)); }
static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); @@ -1523,7 +1524,7 @@ void tcp_defer_handler(struct ctx *c) { int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE; int max_files = c->nofile / 100 * TCP_FILE_PRESSURE; - union tcp_conn *conn; + struct tcp_conn *conn;
tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); @@ -1533,12 +1534,12 @@ void tcp_defer_handler(struct ctx *c) return;
for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - if (conn->c.spliced) { - if (conn->splice.flags & CLOSING) - tcp_splice_destroy(c, &conn->splice); + if (TO_TCP_CONN(conn)->c.spliced) { + if (conn->u.splice.flags & CLOSING) + tcp_splice_destroy(c, &conn->u.splice); } else { - if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, &conn->tap); + if (conn->u.tap.events == CLOSED) + tcp_conn_destroy(c, &conn->u.tap); } }
@@ -2086,7 +2087,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, }
conn = CONN(c->tcp.conn_count++); - conn->c.spliced = false; + TO_TCP_CONN(conn)->c.spliced = false; conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -2770,7 +2771,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, union epoll_ref ref, struct sockaddr *sa, const struct timespec *now) { - conn->c.spliced = false; + TO_TCP_CONN(conn)->c.spliced = false; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; @@ -2804,7 +2805,7 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, const struct timespec *now) { struct sockaddr_storage sa; - union tcp_conn *conn; + struct tcp_conn *conn; socklen_t sl; int s;
@@ -2826,11 +2827,11 @@ static void tcp_conn_from_sock(struct ctx *c, union epoll_ref ref, conn = tc + c->tcp.conn_count++;
if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref, &conn->splice, + tcp_splice_conn_from_sock(c, ref, &conn->u.splice, s, (struct sockaddr *)&sa)) return;
- tcp_tap_conn_from_sock(c, ref, &conn->tap, s, + tcp_tap_conn_from_sock(c, ref, &conn->u.tap, s, (struct sockaddr *)&sa, now); }
@@ -2961,7 +2962,7 @@ static void tcp_tap_sock_handler(struct ctx *c, struct tcp_tap_conn *conn, void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now) { - union tcp_conn *conn; + struct tcp_conn *conn;
if (ref.r.p.tcp.tcp.timer) { tcp_timer_handler(c, ref); @@ -2975,10 +2976,10 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events,
conn = tc + ref.r.p.tcp.tcp.index;
- if (conn->c.spliced) - tcp_splice_sock_handler(c, &conn->splice, ref.r.s, events); + if (TO_TCP_CONN(conn)->c.spliced) + tcp_splice_sock_handler(c, &conn->u.splice, ref.r.s, events); else - tcp_tap_sock_handler(c, &conn->tap, events); + tcp_tap_sock_handler(c, &conn->u.tap, events); }
/** @@ -3370,7 +3371,7 @@ static int tcp_port_rebind(void *arg) void tcp_timer(struct ctx *c, const struct timespec *ts) { struct tcp_sock_refill_arg refill_arg = { c, 0 }; - union tcp_conn *conn; + struct tcp_conn *conn;
(void)ts;
@@ -3394,11 +3395,11 @@ void tcp_timer(struct ctx *c, const struct timespec *ts) }
for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { - if (conn->c.spliced) { - tcp_splice_timer(c, &conn->splice); + if (TO_TCP_CONN(conn)->c.spliced) { + tcp_splice_timer(c, &conn->u.splice); } else { - if (conn->tap.events == CLOSED) - tcp_conn_destroy(c, &conn->tap); + if (conn->u.tap.events == CLOSED) + tcp_conn_destroy(c, &conn->u.tap); } }
diff --git a/tcp_conn.h b/tcp_conn.h index 4a8be29..3df7905 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,7 +23,6 @@ extern const char *tcp_common_flag_str[];
/** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) - * @c: Fields common with tcp_splice_conn * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number @@ -47,9 +46,6 @@ extern const char *tcp_common_flag_str[]; * @seq_init_from_tap: Initial sequence number from tap */ struct tcp_tap_conn { - /* Must be first element to match tcp_splice_conn */ - struct tcp_conn_common c; - int next_index :TCP_CONN_INDEX_BITS + 2;
#define TCP_RETRANS_BITS 3 @@ -118,7 +114,6 @@ struct tcp_tap_conn {
/** * struct tcp_splice_conn - Descriptor for a spliced TCP connection - * @c: Fields common with tcp_tap_conn * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b * @b: File descriptor number of peer connected socket @@ -131,9 +126,6 @@ struct tcp_tap_conn { * @b_written: Bytes written to @b (not fully written from one @a read) */ struct tcp_splice_conn { - /* Must be first element to match tcp_tap_conn */ - struct tcp_conn_common c; - int a; int pipe_a_b[2]; int b; @@ -165,22 +157,27 @@ struct tcp_splice_conn { };
/** - * union tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) + * struct tcp_conn - Descriptor for a TCP connection (spliced or non-spliced) * @c: Fields common between all variants - * @tap: Fields specific to non-spliced connections - * @splice: Fields specific to spliced connections + * @u.tap: Fields specific to non-spliced connections + * @u.splice: Fields specific to spliced connections */ -union tcp_conn { +struct tcp_conn { struct tcp_conn_common c; - struct tcp_tap_conn tap; - struct tcp_splice_conn splice; + union { + struct tcp_tap_conn tap; + struct tcp_splice_conn splice; + } u; };
+#define TO_TCP_CONN(x) \ + ((struct tcp_conn *)((char *)(x) - offsetof(struct tcp_conn, u))) + /* TCP connections */ -extern union tcp_conn tc[]; +extern struct tcp_conn tc[];
void tcp_splice_conn_update(struct ctx *c, struct tcp_splice_conn *new); -void tcp_table_compact(struct ctx *c, union tcp_conn *hole); +void tcp_table_compact(struct ctx *c, struct tcp_conn *hole); void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn); void tcp_splice_timer(struct ctx *c, struct tcp_splice_conn *conn); void tcp_splice_pipe_refill(const struct ctx *c); diff --git a/tcp_splice.c b/tcp_splice.c index e2f0ce1..04fc513 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -37,6 +37,7 @@ #include
#include #include +#include #include #include #include @@ -73,8 +74,8 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) -#define CONN(index) (&tc[(index)].splice) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN(index) (&tc[(index)].u.splice) +#define CONN_IDX(conn) (TO_TCP_CONN(conn) - tc) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -165,7 +166,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn) { - int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = TO_TCP_CONN(conn)->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .r.proto = IPPROTO_TCP, .r.s = conn->a, .r.p.tcp.tcp.index = CONN_IDX(conn) }; union epoll_ref ref_b = { .r.proto = IPPROTO_TCP, .r.s = conn->b, @@ -185,7 +186,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, epoll_ctl(c->epollfd, m, conn->b, &ev_b)) goto delete;
- conn->c.in_epoll = true; + TO_TCP_CONN(conn)->c.in_epoll = true;
return 0;
@@ -283,7 +284,7 @@ void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
c->tcp.splice_conn_count--; - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TO_TCP_CONN(conn)); }
/** @@ -535,7 +536,7 @@ bool tcp_splice_conn_from_sock(struct ctx *c, union epoll_ref ref, if (setsockopt(s, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) trace("TCP (spliced): failed to set TCP_QUICKACK on %i", s);
- conn->c.spliced = true; + TO_TCP_CONN(conn)->c.spliced = true; c->tcp.splice_conn_count++; conn->a = s;
---
I'm fine with this series in any case. If you don't have other ideas, I would just try to get rid of that warning (Out-of-bounds access, CWE-119) with the first diff here, or something similar.
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Fri, 18 Nov 2022 12:10:47 +1100
David Gibson
On Fri, Nov 18, 2022 at 01:25:18AM +0100, Stefano Brivio wrote:
[...]
It also confuses Coverity Scan, because in tcp_table_compact() we have:
memset(hole, 0, sizeof(*hole));
and while the prototype is:
void tcp_table_compact(struct ctx *c, union tcp_conn *hole)
it sees that we're passing, from tcp_splice_destroy(), something smaller than that (48 bytes), but we're zeroing the whole thing.
Bother.
Of course, it's not a real issue, that space is reserved for a connection slot anyway, but given there are no other issues reported, I'd try to keep Coverity happy if possible.
First try, failed: check hole->c.spliced and, if set, zero only sizeof(struct tcp_splice_conn) bytes. This looks like a false positive.
Another try, which should probably work (I just hit the daily build submission quota, grr): explicitly pass the union tcp_conn containing our struct tcp_splice_conn. This patch does it:
--- diff --git a/tcp.c b/tcp.c index 8874789..d635a8e 100644 --- a/tcp.c +++ b/tcp.c @@ -591,7 +591,7 @@ static size_t tcp6_l2_flags_buf_bytes; union tcp_conn tc[TCP_MAX_CONNS];
#define CONN(index) (&tc[(index)].tap) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN_IDX(conn) (TCP_TAP_TO_COMMON(conn) - tc)
/** conn_at_idx() - Find a connection by index, if present * @index: Index of connection to lookup @@ -1385,7 +1385,7 @@ static void tcp_conn_destroy(struct ctx *c, struct tcp_tap_conn *conn) close(conn->timer);
tcp_hash_remove(c, conn); - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TCP_TAP_TO_COMMON(conn)); }
static void tcp_rst_do(struct ctx *c, struct tcp_tap_conn *conn); diff --git a/tcp_conn.h b/tcp_conn.h index 4a8be29..fa407ad 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -176,6 +176,12 @@ union tcp_conn { struct tcp_splice_conn splice; };
+#define TCP_TAP_TO_COMMON(x) \ + ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, tap))) + +#define TCP_SPLICE_TO_COMMON(x) \ + ((union tcp_conn *)((char *)(x) - offsetof(union tcp_conn, splice))) + /* TCP connections */ extern union tcp_conn tc[];
diff --git a/tcp_splice.c b/tcp_splice.c index e2f0ce1..7d3f17e 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -37,6 +37,7 @@ #include
#include #include +#include #include #include #include @@ -74,7 +75,7 @@ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2][2]; #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) #define CONN(index) (&tc[(index)].splice) -#define CONN_IDX(conn) ((union tcp_conn *)(conn) - tc) +#define CONN_IDX(conn) (TCP_SPLICE_TO_COMMON(conn) - tc) /* Display strings for connection events */ static const char *tcp_splice_event_str[] __attribute((__unused__)) = { @@ -283,7 +284,7 @@ void tcp_splice_destroy(struct ctx *c, struct tcp_splice_conn *conn) debug("TCP (spliced): index %li, CLOSED", CONN_IDX(conn));
c->tcp.splice_conn_count--; - tcp_table_compact(c, (union tcp_conn *)conn); + tcp_table_compact(c, TCP_SPLICE_TO_COMMON(conn)); }
Hmm.. so TAP_TO_COMMON and SPLICE_TO_COMMON don't actually change the pointer any more than the cast does, but you're hoping that will be enough to convince coverity? I'm a bit skeptical, but I guess we can try it.
Right, that didn't actually work.
I wonder if an explicit coverity lint override might be a better option here. Or, we could add padding to tcp_splice_conn so it has the same size as tcp_tap_conn. I think that would probably convince coverity.
Padding needs to be calculated manually, and a specific override might need some specific maintenance, with references to proprietary code. I'm not really fond of either. Given that we already have the pointers to the container union in the callers, we could just pass those instead, and then in splice or "tap" specific logic, and just there, use struct tcp_{tap,splice}_conn. I just posted a patch doing this. -- Stefano
When we compact the connection tables (both spliced and non-spliced) we
need to move entries from one slot to another. That requires some updates
in the entries themselves. Add helpers to make all the necessary updates
for the spliced and non-spliced cases. This will simplify later cleanups.
Signed-off-by: David Gibson
Currently spliced and non-spliced connections are stored in completely
separate tables, so there are completely independent limits on the number
of spliced and non-spliced connections. This is a bit counter-intuitive.
More importantly, the fact that the tables are separate prevents us from
unifying some other logic between the two cases. So, merge these two
tables into one, using the 'c.spliced' common field to distinguish between
them when necessary.
For now we keep a common limit of 128k connections, whether they're spliced
or non-spliced, which means we save memory overall. If necessary we could
increase this to a 256k or higher total, which would cost memory but give
some more flexibility.
For now, the code paths which need to step through all extant connections
are still separate for the two cases, just skipping over entries which
aren't for them. We'll improve that in later patches.
Signed-off-by: David Gibson
These two functions each step through non-spliced and spliced connections
respectively and clean up entries for closed connections. To avoid
scanning the connection table twice, we merge these into a single function
which scans the unified table and performs the appropriate sort of cleanup
action on each one.
Signed-off-by: David Gibson
These two functions scan all the non-splced and spliced connections
respectively and perform timed updates on them. Avoid scanning the now
unified table twice, by having tcp_timer scan it once calling the
relevant per-connection function for each one.
Signed-off-by: David Gibson
There is very little common between the tcp_tap_conn and tcp_splice_conn
structures. However, both do have an IN_EPOLL flag which has the same
meaning in each case, though it's stored in a different location.
Simplify things slightly by moving this bit into the common header of the
two structures.
Signed-off-by: David Gibson
tcp_sock_init*() can create either sockets listening on the host, or in the pasta network namespace (with @ns==1). There are, however, a number of differences in how these two cases work in practice though. "ns" sockets are only used in pasta mode, and they always lead to spliced connections only. The functions are also only ever called in "ns" mode with a NULL address and interface name, and it doesn't really make sense for them to be called any other way. Later changes will introduce further differences in behaviour between these two cases, so it makes more sense to use separate functions for creating the ns listening sockets than the regular external/host listening sockets. --- conf.c | 6 +-- tcp.c | 132 ++++++++++++++++++++++++++++++++++++++------------------- tcp.h | 4 +- 3 files changed, 93 insertions(+), 49 deletions(-) diff --git a/conf.c b/conf.c index b07d661..4721c97 100644 --- a/conf.c +++ b/conf.c @@ -209,7 +209,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, for (i = 0; i < PORT_EPHEMERAL_MIN; i++) { if (optname == 't') - tcp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); + tcp_sock_init(c, AF_UNSPEC, NULL, NULL, i); else if (optname == 'u') udp_sock_init(c, 0, AF_UNSPEC, NULL, NULL, i); } @@ -287,7 +287,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, bitmap_set(fwd->map, i); if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } @@ -333,7 +333,7 @@ static int conf_ports(const struct ctx *c, char optname, const char *optarg, fwd->delta[i] = mapped_range.first - orig_range.first; if (optname == 't') - tcp_sock_init(c, 0, af, addr, ifname, i); + tcp_sock_init(c, af, addr, ifname, i); else if (optname == 'u') udp_sock_init(c, 0, af, addr, ifname, i); } diff --git a/tcp.c b/tcp.c index 35fca31..306f928 100644 --- a/tcp.c +++ b/tcp.c @@ -2987,15 +2987,15 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, /** * tcp_sock_init4() - Initialise listening sockets for a given IPv4 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *addr, +static void tcp_sock_init4(const struct ctx *c, const struct in_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3006,14 +3006,9 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad if (!addr) addr = &c->ip4.addr; - tap = !ns && !IN4_IS_ADDR_LOOPBACK(addr); + tap = !IN4_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3039,29 +3034,25 @@ static void tcp_sock_init4(const struct ctx *c, int ns, const struct in_addr *ad else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V4] = s; - else - tcp_sock_init_lo[port][V4] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V4] = s; } } /** * tcp_sock_init6() - Initialise listening sockets for a given IPv6 port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -static void tcp_sock_init6(const struct ctx *c, int ns, +static void tcp_sock_init6(const struct ctx *c, const struct in6_addr *addr, const char *ifname, in_port_t port) { - union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = ns, - .tcp.v6 = 1 }; + in_port_t idx = port + c->tcp.fwd_in.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.v6 = 1, + .tcp.index = idx }; bool spliced = false, tap = true; int s; @@ -3073,14 +3064,9 @@ static void tcp_sock_init6(const struct ctx *c, int ns, if (!addr) addr = &c->ip6.addr; - tap = !ns && !IN6_IS_ADDR_LOOPBACK(addr); + tap = !IN6_IS_ADDR_LOOPBACK(addr); } - if (ns) - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_out.delta[port]); - else - tref.tcp.index = (in_port_t)(port + c->tcp.fwd_in.delta[port]); - if (tap) { s = sock_l4(c, AF_INET6, IPPROTO_TCP, addr, ifname, port, tref.u32); @@ -3105,40 +3091,99 @@ static void tcp_sock_init6(const struct ctx *c, int ns, else s = -1; - if (c->tcp.fwd_out.mode == FWD_AUTO) { - if (ns) - tcp_sock_ns[port][V6] = s; - else - tcp_sock_init_lo[port][V6] = s; - } + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_init_lo[port][V6] = s; } } /** - * tcp_sock_init() - Initialise listening sockets for a given port + * tcp_sock_init() - Create listening sockets for a given host ("inbound") port * @c: Execution context - * @ns: In pasta mode, if set, bind with loopback address in namespace * @af: Address family to select a specific IP version, or AF_UNSPEC * @addr: Pointer to address for binding, NULL if not configured * @ifname: Name of interface to bind to, NULL if not configured * @port: Port, host order */ -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port) +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port) { if ((af == AF_INET || af == AF_UNSPEC) && c->ifi4) - tcp_sock_init4(c, ns, addr, ifname, port); + tcp_sock_init4(c, addr, ifname, port); if ((af == AF_INET6 || af == AF_UNSPEC) && c->ifi6) - tcp_sock_init6(c, ns, addr, ifname, port); + tcp_sock_init6(c, addr, ifname, port); +} + +/** + * tcp_ns_sock_init4() - Init socket to listen for outbound IPv4 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init4(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.index = idx }; + struct in_addr loopback = { htonl(INADDR_LOOPBACK) }; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET, IPPROTO_TCP, &loopback, NULL, port, tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V4] = s; } /** - * tcp_sock_init_ns() - Bind sockets in namespace for outbound connections + * tcp_ns_sock_init6() - Init socket to listen for outbound IPv6 connections + * @c: Execution context + * @port: Port, host order + */ +static void tcp_ns_sock_init6(const struct ctx *c, in_port_t port) +{ + in_port_t idx = port + c->tcp.fwd_out.delta[port]; + union tcp_epoll_ref tref = { .tcp.listen = 1, .tcp.outbound = 1, + .tcp.splice = 1, .tcp.v6 = 1, + .tcp.index = idx }; + int s; + + assert(c->mode == MODE_PASTA); + + s = sock_l4(c, AF_INET6, IPPROTO_TCP, &in6addr_loopback, NULL, port, + tref.u32); + if (s >= 0) + tcp_sock_set_bufsize(c, s); + else + s = -1; + + if (c->tcp.fwd_out.mode == FWD_AUTO) + tcp_sock_ns[port][V6] = s; +} + +/** + * tcp_ns_sock_init() - Init socket to listen for spliced outbound connections + * @c: Execution context + * @port: Port, host order + */ +void tcp_ns_sock_init(const struct ctx *c, in_port_t port) +{ + if (c->ifi4) + tcp_ns_sock_init4(c, port); + if (c->ifi6) + tcp_ns_sock_init6(c, port); +} + +/** + * tcp_ns_socks_init() - Bind sockets in namespace for outbound connections * @arg: Execution context * * Return: 0 */ -static int tcp_sock_init_ns(void *arg) +static int tcp_ns_socks_init(void *arg) { struct ctx *c = (struct ctx *)arg; unsigned port; @@ -3149,7 +3194,7 @@ static int tcp_sock_init_ns(void *arg) if (!bitmap_isset(c->tcp.fwd_out.map, port)) continue; - tcp_sock_init(c, 1, AF_UNSPEC, NULL, NULL, port); + tcp_ns_sock_init(c, port); } return 0; @@ -3279,7 +3324,7 @@ int tcp_init(struct ctx *c) if (c->mode == MODE_PASTA) { tcp_splice_init(c); - NS_CALL(tcp_sock_init_ns, c); + NS_CALL(tcp_ns_socks_init, c); refill_arg.ns = 1; NS_CALL(tcp_sock_refill, &refill_arg); @@ -3364,8 +3409,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_ns[port][V4] == -1) || (a->c->ifi6 && tcp_sock_ns[port][V6] == -1)) - tcp_sock_init(a->c, 1, AF_UNSPEC, NULL, NULL, - port); + tcp_ns_sock_init(a->c, port); } } else { for (port = 0; port < NUM_PORTS; port++) { @@ -3398,7 +3442,7 @@ static int tcp_port_rebind(void *arg) if ((a->c->ifi4 && tcp_sock_init_ext[port][V4] == -1) || (a->c->ifi6 && tcp_sock_init_ext[port][V6] == -1)) - tcp_sock_init(a->c, 0, AF_UNSPEC, NULL, NULL, + tcp_sock_init(a->c, AF_UNSPEC, NULL, NULL, port); } } diff --git a/tcp.h b/tcp.h index 49738ef..f4ed298 100644 --- a/tcp.h +++ b/tcp.h @@ -19,8 +19,8 @@ void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); int tcp_tap_handler(struct ctx *c, int af, const void *addr, const struct pool *p, const struct timespec *now); -void tcp_sock_init(const struct ctx *c, int ns, sa_family_t af, - const void *addr, const char *ifname, in_port_t port); +void tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, + const char *ifname, in_port_t port); int tcp_init(struct ctx *c); void tcp_timer(struct ctx *c, const struct timespec *ts); void tcp_defer_handler(struct ctx *c); -- 2.38.1
In tcp_sock_handler() we split off to handle spliced sockets before
checking anything else. However the first steps of the "new connection"
path for each case are the same: allocate a connection entry and accept()
the connection.
Remove this duplication by making tcp_conn_from_sock() handle both spliced
and non-spliced cases, with help from more specific tcp_tap_conn_from_sock
and tcp_splice_conn_from_sock functions for the later stages which differ.
Signed-off-by: David Gibson
In pasta mode, tcp_sock_init[46]() create separate sockets to listen for
spliced connections (these are bound to localhost) and non-spliced
connections (these are bound to the host address). This introduces a
subtle behavioural difference between pasta and passt: by default, pasta
will listen only on a single host address, whereas passt will listen on
all addresses (0.0.0.0 or ::). This also prevents us using some additional
optimizations that only work with the unspecified (0.0.0.0 or ::) address.
However, it turns out we don't need to do this. We can splice a connection
if and only if it originates from the loopback address. Currently we
ensure this by having the "spliced" listening sockets listening only on
loopback. Instead, defer the decision about whether to splice a connection
until after accept(), by checking if the connection was made from the
loopback address.
Signed-off-by: David Gibson
Currently the epoll reference for tcp sockets includes a bit indicating
whether the socket maps to a spliced connection. However, the reference
also has the index of the connection structure which also indicates whether
it is spliced. We can therefore avoid the splice bit in the epoll_ref by
unifying the first part of the non-spliced and spliced handlers where we
look up the connection state.
Signed-off-by: David Gibson
Currently when we insert a connection into the hash table, we store its
bucket number so we can find it when removing entries. However, we can
recompute the hash value from other contents of the structure so we don't
need to store it. This brings the size of tcp_tap_conn down to 64 bytes
again, which means it will fit in a single cacheline on common machines.
This change also removes a non-obvious constraint that the hash table have
less than twice TCP_MAX_CONNS buckets, because of the way
TCP_HASH_BUCKET_BITS was constructed.
Signed-off-by: David Gibson
struct tcp_conn stores an address which could be IPv6 or IPv4 using a
union. We can do this without an additional tag by encoding IPv4 addresses
as IPv4-mapped IPv6 addresses.
This approach is useful wider than the specific place in tcp_conn, so
expose a new 'union inany_addr' like this from a new inany.h. Along with
that create a number of helper functions to make working with these "inany"
addresses easier.
Signed-off-by: David Gibson
In the tcp_conn structure, we represent the address with an inany_addr
which could be an IPv4 or IPv6 address. However, we have different paths
which will calculate different hashes for IPv4 and equivalent IPv4-mapped
IPv6 addresses. This will cause problems for some future changes.
Make the hash function work the same for these two cases, by taking an
inany_addr directly. Since this represents IPv4 and IPv4-mapped IPv6
addresses the same way, it will trivially hash the same for both cases.
Callers are changed to construct an inany_addr from whatever they have.
Some of that will be elided in later changes.
Signed-off-by: David Gibson
tcp_hash_insert() takes an address to control which hash bucket the
connection will go into. However, an inany_addr representation of that
address is already stored in struct tcp_conn.
Now that we've made the hashing of IPv4 and IPv4-mapped IPv6 addresses
equivalent, we can simplify tcp_hash_insert() to use the address in
struct tcp_conn, rather than taking it as an extra parameter.
Signed-off-by: David Gibson
tcp_hash_match() can take either an IPv4 (struct in_addr) or IPv6 (struct
in6_addr) address. It has two different paths for each of those cases.
However, its only caller has already constructed an equivalent inany
representation of the address, so we can have tcp_hash_match take that
directly and use a simpler comparison with the inany_equals() helper.
Signed-off-by: David Gibson
tcp_seq_init() has separate paths for IPv4 and IPv6 addresses, which means
we will calculate different sequence numbers for IPv4 and equivalent
IPv4-mapped IPv6 addresses.
Change it to treat these the same by always converting the input address
into an inany_addr representation and use that to calculate the sequence
number.
Signed-off-by: David Gibson
tcp_seq_init() takes a number of parameters for the connection, but at
every call site, these are already populated in the tcp_conn structure.
Likewise we always store the result into the @seq_to_tap field.
Use this to simplify tcp_seq_init().
Signed-off-by: David Gibson
It looks like tcp_seq_init() is supposed to advance the sequence number
by one every 32ns. However we only right shift the ns part of the timespec
not the seconds part, meaning that we'll advance by an extra 32 steps on
each second.
I don't know if that's exploitable in any way, but it doesn't appear to be
the intent, nor what RFC 6528 suggests.
In addition, we convert from seconds to nanoseconds with a multiplication
by '1E9'. In C '1E9' is a floating point constant, forcing a conversion
to floating point and back for what should be an integer calculation
(confirmed with objdump and Makefile default compiler flags). Spell out
1000000000 in full to avoid that.
Signed-off-by: David Gibson
This bit in the TCP specific epoll reference indicates whether the
connection is IPv6 or IPv4. However the sites which refer to it are
already calling accept() which (optionally) returns an address for the
remote end of the connection. We can use the sa_family field in that
address to determine the connection type independent of the epoll
reference.
This does have a cost: for the spliced case, it means we now need to get
that address from accept() which introduces an extran copy_to_user().
However, in future we want to allow handling IPv4 connectons through IPv6
sockets, which means we won't be able to determine the IP version at the
time we create the listening socket and epoll reference. So, at some point
we'll have to pay this cost anyway.
Signed-off-by: David Gibson
passt usually doesn't NAT, but it does do so for the remapping of the
gateway address to refer to the host. Currently we perform this NAT with
slightly different rules on both IPv4 addresses and IPv6 addresses, but not
on IPv4-mapped IPv6 addresses. This means we won't correctly handle the
case of an IPv4 connection over an IPv6 socket, which is possible on Linux
(and probably other platforms).
Refactor tcp_conn_from_sock() to perform the NAT after converting either
address family into an inany_addr, so IPv4 and and IPv4-mapped addresses
have the same representation.
With two new helpers this lets us remove the IPv4 and IPv6 specific paths
from tcp_conn_from_sock().
Signed-off-by: David Gibson
For non-spliced connections we now treat IPv4-mapped IPv6 addresses the
same as the corresponding IPv4 addresses. However currently we won't
splice a connection from ::ffff:127.0.0.1 the way we would one from
127.0.0.1. Correct this so that we can splice connections from IPv4
localhost that have been received on an IPv6 dual stack socket.
Signed-off-by: David Gibson
Previous cleanups mean that tcp_sock_init4() and tcp_sock_init6() are
almost identical, and the remaining differences can be easily
parameterized. Combine both into a single tcp_sock_init_af() function.
Signed-off-by: David Gibson
Currently, when instructed to open an IPv6 socket, sock_l4() explicitly
sets the IPV6_V6ONLY socket option so that the socket will only respond to
IPv6 connections. Linux (and probably other platforms) allow "dual stack"
sockets: IPv6 sockets which can also accept IPv4 connections.
Extend sock_l4() to be able to make such sockets, by passing AF_UNSPEC as
the address family and no bind address (binding to a specific address would
defeat the purpose). We add a Makefile define 'DUAL_STACK_SOCKETS' to
indicate availability of this feature on the target platform.
Signed-off-by: David Gibson
According to its doc comments, sock_l4() returns -1 on error. It does,
except in one case where it returns -EIO. Fix this inconsistency to match
the docs and always return -1.
Signed-off-by: David Gibson
Platforms like Linux allow IPv6 sockets to listen for IPv4 connections as
well as native IPv6 connections. By doing this we halve the number of
listening sockets we need for TCP (assuming passt/pasta is listening on the
same ports for IPv4 and IPv6). When forwarding many ports (e.g. -t all)
this can significantly reduce the amount of kernel memory that passt
consumes.
When forwarding all TCP and UDP ports for both IPv4 and IPv6 (-t all
-u all), this reduces kernel memory usage from ~677MiB to ~487MiB
(kernel version 6.0.8 on Fedora 37, x86_64).
Signed-off-by: David Gibson
On Thu, 17 Nov 2022 16:58:36 +1100
David Gibson
When forwarding many ports, passt can consume a lot of kernel memory because of the many listening sockets it opens. There are not a lot of ways we can reduce that, but here's one.
Currently we create separate listening sockets for each port for both IPv4 and IPv6. However in Linux (and probably other platforms), it's possible to listen for both IPv4 and IPv6 connections on an IPv6 socket. This series uses such dual stack sockets to halve the number of listening sockets needed for TCP. When forwarding all TCP and UDP ports, this reduces the kernel memory used from around 677 MiB to around 487 MiB (kernel 6.0.8 on an x86_64 Fedora 37 machine).
Applied, thanks! -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio