[PATCH v3 00/15] RFC: Unified flow table
This is a third draft of the first steps in implementing more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP is converted to use the new flow table. Caveats: * We significantly increase the size of a connection/flow entry - Can probably be mitigated, but I haven't investigated much yet * We perform a number of extra getsockname() calls to know some of the socket endpoints - Haven't yet measured how much performance impact that has - Can be mitigated in at least some cases, but again, haven't tried yet * UDP is not converted yet Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (15): flow: Common data structures for tracking flow addresses tcp, flow: Maintain guest side flow information tcp, flow: Maintain host side flow information tcp_splice,flow: Maintain flow information for spliced connections flow, tcp, tcp_splice: Uniform debug helpers for new flows tcp, flow: Replace TCP specific hash function with general flow hash flow: Add helper to determine a flow's protocol flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Store ping socket information in the flow table icmp: Populate guest side information for ping flows icmp: Populate and use host side flow information icmp: Use 'flowside' epoll references for ping sockets icmp: Merge EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 icmp: Eliminate icmp_id_map Makefile | 6 +- flow.c | 260 ++++++++++++++++++++++++++++++++++++++++++ flow.h | 104 +++++++++++++++++ flow_table.h | 2 + icmp.c | 211 +++++++++++++++++++--------------- icmp.h | 15 +-- icmp_flow.h | 31 +++++ passt.c | 15 +-- passt.h | 9 +- tap.c | 11 -- tap.h | 1 - tcp.c | 313 +++++++++++++++------------------------------------ tcp_conn.h | 9 -- tcp_splice.c | 63 ++++++++--- tcp_splice.h | 3 +- util.c | 4 +- util.h | 18 +++ 17 files changed, 683 insertions(+), 392 deletions(-) create mode 100644 icmp_flow.h -- 2.43.0
Handling of each protocol needs some degree of tracking of the addresses
and ports at the end of each connection or flow. Sometimes that's explicit
(as in the guest visible addresses for TCP connections), sometimes implicit
(the bound and connected addresses of sockets).
To allow more general and robust handling, and more consistency across
protocols we want to uniformly track the address and port at each end of
the connection. Furthermore, because we allow port remapping, and we
sometimes need to apply NAT, the addresses and ports can be different as
seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of common information related to
one side of each flow. For now that's the addresses, ports and the pif id.
Store two of these in the common fields of a flow to track that information
for both sides.
For now we just introduce the structure and fields themselves, along with
a simple helper. Later patches will actually use these to store useful
information.
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:23 +1100
David Gibson
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with a simple helper. Later patches will actually use these to store useful information.
Signed-off-by: David Gibson
--- flow.h | 33 +++++++++++++++++++++++++++++++++ passt.h | 2 ++ tcp_conn.h | 1 - 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/flow.h b/flow.h index 48a0ab4..e090ba0 100644 --- a/flow.h +++ b/flow.h @@ -27,11 +27,44 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")
+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); +
Nits:
+/** flowside_complete - Check if flowside is fully initialized
flowside_complete(). By the way, shouldn't we call it something more descriptive such as flowside_is_complete()? It's not obvious this is a check otherwise.
+ * @fside: flowside to check
* Return: true if pif, addresses and ports are set ...or something of that sort.
+ */ +static inline bool flowside_complete(const struct flowside *fside) +{ + return fside->pif != PIF_NONE && + !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && + fside->fport != 0 && fside->eport != 0;
I would align everything after 'return ', that is: return fside->pif != PIF_NONE && !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && fside->fport != 0 && fside->eport != 0; mostly for consistency -- not a strong preference. -- Stefano
On Sat, Jan 13, 2024 at 11:50:40PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:23 +1100 David Gibson
wrote: Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides.
For now we just introduce the structure and fields themselves, along with a simple helper. Later patches will actually use these to store useful information.
Signed-off-by: David Gibson
--- flow.h | 33 +++++++++++++++++++++++++++++++++ passt.h | 2 ++ tcp_conn.h | 1 - 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/flow.h b/flow.h index 48a0ab4..e090ba0 100644 --- a/flow.h +++ b/flow.h @@ -27,11 +27,44 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")
+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); +
Nits:
+/** flowside_complete - Check if flowside is fully initialized
flowside_complete(). By the way, shouldn't we call it something more descriptive such as flowside_is_complete()? It's not obvious this is a check otherwise.
+ * @fside: flowside to check
* Return: true if pif, addresses and ports are set
...or something of that sort.
+ */ +static inline bool flowside_complete(const struct flowside *fside) +{ + return fside->pif != PIF_NONE && + !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && + fside->fport != 0 && fside->eport != 0;
I would align everything after 'return ', that is:
return fside->pif != PIF_NONE && !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && fside->fport != 0 && fside->eport != 0;
mostly for consistency -- not a strong preference.
All seems reasonable, updated accordingly. I've also updated 'flow_complete' to 'flow_is_complete' later in the series. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
tcp_tap_conn has several fields to track addresses and ports as seen by
the guest/namespace. However we now have general fields for this in the
common flowside struct. Use those instead of protocol specific fields.
So far we've only explicitly tracked the guest side forwarding address
in the TCP connection - the remote address from the guest's point of
view. The tap side endpoint address - the local address from the
guest's point of view - was assumed to always be one of the handful of
guest addresses we track as addr_seen (one each for IPv4, IPv6 global
and IPv6 link-local).
struct flowside expects both addresses, and we will want to use the
endpoint address in future. So, we need to add code to determine that
address and record it in the flowside.
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:24 +1100
David Gibson
tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside struct. Use those instead of protocol specific fields.
So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local).
struct flowside expects both addresses, and we will want to use the endpoint address in future. So, we need to add code to determine that address and record it in the flowside.
Signed-off-by: David Gibson
--- flow.h | 20 +++++++++++++++ tcp.c | 75 ++++++++++++++++++++++++++++-------------------------- tcp_conn.h | 8 ------ 3 files changed, 59 insertions(+), 44 deletions(-) diff --git a/flow.h b/flow.h index e090ba0..e7126e4 100644 --- a/flow.h +++ b/flow.h @@ -45,6 +45,26 @@ struct flowside { static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), "Unexpected alignment for struct flowside");
+/** flowside_from_af - Initialize flowside from addresses
flowside_from_af() - ...from addresses and ports?
+ * @fside: flowside to initialize + * @pif: pif if of this flowside
s/if/id/ or s/if/ID/.
+ * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + fside->pif = pif; + inany_from_af(&fside->faddr, af, faddr); + inany_from_af(&fside->eaddr, af, eaddr); + fside->fport = fport; + fside->eport = eport; +} + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index ddabc34..7ef20b1 100644 --- a/tcp.c +++ b/tcp.c @@ -394,7 +394,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8
-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE])
After reviewing the patch I understand what TAPFSIDE() is for... but I still think the name is pretty obscure. I'm struggling to come up with a better idea though. Perhaps FLOWSIDE_TAP(conn), or: #define FLOWSIDE(conn, _side) (&(conn)->f.side[(_side)]) ? Then sure, TAPFSIDE ans SOCKFSIDE have the advantage of brevity... but for instance they don't spare any ugliness in tcp_tap_conn_from_sock() compared to FLOWSIDE_TAP. Maybe they do in other places I didn't check.
+ +#define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -845,7 +847,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return 1;
return 0; @@ -867,7 +869,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -879,7 +881,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;
- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPFSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1144,8 +1146,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && + TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) return 1;
return 0; @@ -1179,7 +1181,8 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPFSIDE(conn)->faddr, + TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); }
/** @@ -1365,13 +1368,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPFSIDE(conn)->faddr); size_t ip_len, tlen;
#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPFSIDE(conn)->fport); \ + b->th.dest = htons(TAPFSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1389,7 +1392,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPFSIDE(conn)->eaddr)->s_addr;
if (check) b->iph.check = *check; @@ -1407,11 +1410,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPFSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPFSIDE(conn)->eaddr.a6;
As I was checking these assignments, it occurred to me that perhaps laddr/raddr (local/remote), instead of faddr/eaddr, used in the sense of the comment to struct flowside: * @eaddr: Endpoint address (remote address from passt's PoV) * @faddr: Forwarding address (local address from passt's PoV) might be more obvious...? I'm not sure at this point.
memset(b->ip6h.flow_lbl, 0, 3);
@@ -1739,26 +1739,21 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; uint64_t hash; uint32_t ns;
- if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - - inany_siphash_feed(&state, &conn->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); + inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)TAPFSIDE(conn)->fport << 16 | + TAPFSIDE(conn)->eport);
/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1925,8 +1920,6 @@ static void tcp_conn_from_tap(struct ctx *c, socklen_t sl; int s, mss;
- (void)saddr; - if (!(flow = flow_alloc())) return;
@@ -1953,6 +1946,10 @@ static void tcp_conn_from_tap(struct ctx *c,
conn = &flow->tcp; conn->f.type = FLOW_TCP; + flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, + daddr, ntohs(th->dest), saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPFSIDE(conn))); + conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1972,8 +1969,6 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;
- inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); @@ -1982,9 +1977,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); }
- conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2674,10 +2666,21 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + TAPFSIDE(conn)->pif = PIF_HOST; + inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPFSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPFSIDE(conn)->faddr.a6)) + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPFSIDE(conn)->eport = ref.port;
- tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPFSIDE(conn)));
tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a4dcf2..8d9c7bd 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {
uint8_t seq_dup_ack_approx;
- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;
The rest of this patch looks all good to me, but I'm still reviewing this series, currently at 6/15. -- Stefano
On Sat, Jan 13, 2024 at 11:51:05PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:24 +1100 David Gibson
wrote: tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside struct. Use those instead of protocol specific fields.
So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local).
struct flowside expects both addresses, and we will want to use the endpoint address in future. So, we need to add code to determine that address and record it in the flowside.
Signed-off-by: David Gibson
--- flow.h | 20 +++++++++++++++ tcp.c | 75 ++++++++++++++++++++++++++++-------------------------- tcp_conn.h | 8 ------ 3 files changed, 59 insertions(+), 44 deletions(-) diff --git a/flow.h b/flow.h index e090ba0..e7126e4 100644 --- a/flow.h +++ b/flow.h @@ -45,6 +45,26 @@ struct flowside { static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), "Unexpected alignment for struct flowside");
+/** flowside_from_af - Initialize flowside from addresses
flowside_from_af() - ...from addresses and ports?
Yes. I couldn't think of a way of spelling it out more directly without making it very long. I hoped this was clear enough by way of analogy with inany_from_af().
+ * @fside: flowside to initialize + * @pif: pif if of this flowside
s/if/id/ or s/if/ID/.
Fixed.
+ * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + fside->pif = pif; + inany_from_af(&fside->faddr, af, faddr); + inany_from_af(&fside->eaddr, af, eaddr); + fside->fport = fport; + fside->eport = eport; +} + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index ddabc34..7ef20b1 100644 --- a/tcp.c +++ b/tcp.c @@ -394,7 +394,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8
-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE])
After reviewing the patch I understand what TAPFSIDE() is for... but I still think the name is pretty obscure. I'm struggling to come up with a better idea though. Perhaps FLOWSIDE_TAP(conn), or:
#define FLOWSIDE(conn, _side) (&(conn)->f.side[(_side)])
?
Then sure, TAPFSIDE ans SOCKFSIDE have the advantage of brevity... but for instance they don't spare any ugliness in tcp_tap_conn_from_sock() compared to FLOWSIDE_TAP. Maybe they do in other places I didn't check.
Right, the whole point of these macros is brevity - without it expressions we use it in become unreadably verbose. I think FLOWSIDE(conn, TAPSIDE) is too long. FLOWSIDE_TAP(conn) might be barely acceptable.
+ +#define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -845,7 +847,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return 1;
return 0; @@ -867,7 +869,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -879,7 +881,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;
- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPFSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1144,8 +1146,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && + TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) return 1;
return 0; @@ -1179,7 +1181,8 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPFSIDE(conn)->faddr, + TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); }
/** @@ -1365,13 +1368,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPFSIDE(conn)->faddr); size_t ip_len, tlen;
#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPFSIDE(conn)->fport); \ + b->th.dest = htons(TAPFSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1389,7 +1392,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPFSIDE(conn)->eaddr)->s_addr;
if (check) b->iph.check = *check; @@ -1407,11 +1410,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);
b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPFSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPFSIDE(conn)->eaddr.a6;
As I was checking these assignments, it occurred to me that perhaps laddr/raddr (local/remote), instead of faddr/eaddr, used in the sense of the comment to struct flowside:
* @eaddr: Endpoint address (remote address from passt's PoV) * @faddr: Forwarding address (local address from passt's PoV)
might be more obvious...? I'm not sure at this point.
I don't think so. I introduced the endpoint/forwarding terminology specifically because local and remote often weren't clear. In this one case it might be ok, but there are lots of places where it's not necessarily obvious whether we're talking about local/remote w.r.t. passt or local/remote w.r.t. the guest.
memset(b->ip6h.flow_lbl, 0, 3);
@@ -1739,26 +1739,21 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; uint64_t hash; uint32_t ns;
- if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - - inany_siphash_feed(&state, &conn->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); + inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)TAPFSIDE(conn)->fport << 16 | + TAPFSIDE(conn)->eport);
/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1925,8 +1920,6 @@ static void tcp_conn_from_tap(struct ctx *c, socklen_t sl; int s, mss;
- (void)saddr; - if (!(flow = flow_alloc())) return;
@@ -1953,6 +1946,10 @@ static void tcp_conn_from_tap(struct ctx *c,
conn = &flow->tcp; conn->f.type = FLOW_TCP; + flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, + daddr, ntohs(th->dest), saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPFSIDE(conn))); + conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1972,8 +1969,6 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;
- inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); @@ -1982,9 +1977,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); }
- conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2674,10 +2666,21 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + TAPFSIDE(conn)->pif = PIF_HOST; + inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPFSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPFSIDE(conn)->faddr.a6)) + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPFSIDE(conn)->eport = ref.port;
- tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPFSIDE(conn)));
tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a4dcf2..8d9c7bd 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {
uint8_t seq_dup_ack_approx;
- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;
The rest of this patch looks all good to me, but I'm still reviewing this series, currently at 6/15.
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
We now maintain a struct flowside describing each TCP connection as it
appears to the guest. We don't yet store the same information for the
connections as they appear to the host. Rather, that information is
implicit in the state of the host side socket. For future generalisations
of flow/connection tracking, we're going to need to use this information
more heavily, so properly populate the other flowside in each flow table
entry with this information.
This does require an additional getsockname() call for each new connection.
We hope to optimise that away for at least some cases in future.
Signed-off-by: David Gibson
Every flow in the flow table now has space for the the addresses as seen by
both the host and guest side. We fill that information in for regular
"tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly
available for all flow types (that are implemented so far).
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:26 +1100
David Gibson
Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that are implemented so far).
I wonder if carrying the address for spliced connections is in any way useful -- other than being obviously useful as a simplification (which justifies this of course). That is, for a spliced connection, addresses and ports are kind of meaningless to us once the connection is established: we operate exclusively above Layer 4. Also, conceptually, all that's there to represent for a spliced connection is that addresses are loopback. To be clear: I'm not suggesting any change to this -- I just want to raise the conceptual inconsistency if it didn't occur to you.
Signed-off-by: David Gibson
--- tcp.c | 35 +++++++++++++---------------- tcp_splice.c | 62 +++++++++++++++++++++++++++++++++++++--------------- tcp_splice.h | 3 +-- 3 files changed, 60 insertions(+), 40 deletions(-) diff --git a/tcp.c b/tcp.c index 18ab3ac..6d77cf6 100644 --- a/tcp.c +++ b/tcp.c @@ -2658,32 +2658,23 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPFSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - const struct sockaddr *sa, const struct timespec *now) { + ASSERT(flowside_complete(SOCKFSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - - ASSERT(flowside_complete(SOCKFSIDE(conn))); - TAPFSIDE(conn)->pif = PIF_TAP; TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; @@ -2712,8 +2703,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);
tcp_get_sndbuf(conn); - - return true; }
/** @@ -2737,15 +2726,21 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel;
- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_from_sock(&flow->f.side[0], ref.tcp_listen.pif, s, + NULL, &sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); + flow_alloc_cancel(flow); return; + }
- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;
+ tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); + return; + cancel: /* Failed to create the connection */ if (s >= 0) diff --git a/tcp_splice.c b/tcp_splice.c index eec02fe..0faeb1b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -72,6 +72,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2];
+#define FSIDE0(conn) (&(conn)->f.side[0]) +#define FSIDE1(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -280,9 +283,21 @@ bool tcp_splice_flow_defer(union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); unsigned side; int i = 0;
+ if (getsockname(conn->s[1], (struct sockaddr *)&sa, &sl) < 0) { + int ret = -errno; + conn_flag(c, conn, CLOSING); + return ret; + } + inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, + (struct sockaddr *)&sa); + + ASSERT(flowside_complete(FSIDE1(conn))); + for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1;
@@ -352,13 +367,24 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }
+ /* It would be nicer if we could initialise FSIDE1 all at once with + * flowaddrs_from_af() or flowaddrs_from_sock(). However, we can't get + * the forwarding port until the connect() has finished and we don't + * want to block to wait for it. Meanwhile we have the endpoint address
[...] endpoint address and port [...]. Or, if "address" includes the port too, then the comment should also say "forwarding address", not "forwarding port". It's confusing otherwise: why is there anything special with the endpoint *address* as opposed to the forwarding *port*?
+ * here, but don't have a place to stash it other than in the flowaddrs + * itself. So, initialisation of FSIDE1 is split between here and + * tcp_splice_connect_finish(). Ugly but necessary. + */ if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET, &addr4.sin_addr); } + FSIDE1(conn)->eport = port;
-- Stefano
On Wed, Jan 17, 2024 at 08:59:14PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:26 +1100 David Gibson
wrote: Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.
Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that are implemented so far).
I wonder if carrying the address for spliced connections is in any way useful -- other than being obviously useful as a simplification (which justifies this of course).
The simplification / consistency is even more important than it seems right here. One of the big aims of this (though I haven't implemented it yet) is to allow our NAT to be done generically, rather that per-protocol. That requires having the addressing information in the common structure regardless of flow type.
That is, for a spliced connection, addresses and ports are kind of meaningless to us once the connection is established: we operate exclusively above Layer 4.
Also, conceptually, all that's there to represent for a spliced connection is that addresses are loopback.
To be clear: I'm not suggesting any change to this -- I just want to raise the conceptual inconsistency if it didn't occur to you.
Hmm.. I would say in this case it's conceptual consistency leading to redundancy of information in practice rather than a conceptual inconsistency.
Signed-off-by: David Gibson
--- tcp.c | 35 +++++++++++++---------------- tcp_splice.c | 62 +++++++++++++++++++++++++++++++++++++--------------- tcp_splice.h | 3 +-- 3 files changed, 60 insertions(+), 40 deletions(-) diff --git a/tcp.c b/tcp.c index 18ab3ac..6d77cf6 100644 --- a/tcp.c +++ b/tcp.c @@ -2658,32 +2658,23 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPFSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - const struct sockaddr *sa, const struct timespec *now) { + ASSERT(flowside_complete(SOCKFSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - - ASSERT(flowside_complete(SOCKFSIDE(conn))); - TAPFSIDE(conn)->pif = PIF_TAP; TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; @@ -2712,8 +2703,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);
tcp_get_sndbuf(conn); - - return true; }
/** @@ -2737,15 +2726,21 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel;
- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_from_sock(&flow->f.side[0], ref.tcp_listen.pif, s, + NULL, &sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); + flow_alloc_cancel(flow); return; + }
- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;
+ tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); + return; + cancel: /* Failed to create the connection */ if (s >= 0) diff --git a/tcp_splice.c b/tcp_splice.c index eec02fe..0faeb1b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -72,6 +72,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2];
+#define FSIDE0(conn) (&(conn)->f.side[0]) +#define FSIDE1(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -280,9 +283,21 @@ bool tcp_splice_flow_defer(union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); unsigned side; int i = 0;
+ if (getsockname(conn->s[1], (struct sockaddr *)&sa, &sl) < 0) { + int ret = -errno; + conn_flag(c, conn, CLOSING); + return ret; + } + inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, + (struct sockaddr *)&sa); + + ASSERT(flowside_complete(FSIDE1(conn))); + for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1;
@@ -352,13 +367,24 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }
+ /* It would be nicer if we could initialise FSIDE1 all at once with + * flowaddrs_from_af() or flowaddrs_from_sock(). However, we can't get + * the forwarding port until the connect() has finished and we don't + * want to block to wait for it. Meanwhile we have the endpoint address
[...] endpoint address and port [...]. Or, if "address" includes the port too, then the comment should also say "forwarding address", not "forwarding port".
It's confusing otherwise: why is there anything special with the endpoint *address* as opposed to the forwarding *port*?
+ * here, but don't have a place to stash it other than in the flowaddrs + * itself. So, initialisation of FSIDE1 is split between here and + * tcp_splice_connect_finish(). Ugly but necessary. + */ if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET, &addr4.sin_addr); } + FSIDE1(conn)->eport = port;
-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
When debugging passt/pasta, and the flow table in particular, one of the
most obvious things to know is when a new flow is initiated, along with the
details of its interface, addresses and ports. Once we've determined to
what interface the flow should be forwarded, it's useful to know the
details of how it will appear on that other interface.
To help present that information uniformly, introduce FLOW_NEW_DBG() and
FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and
spliced.
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:27 +1100
David Gibson
When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.
To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.
Signed-off-by: David Gibson
--- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include
#include #include +#include #include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }
+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",
I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later. It's also useful if anything goes wrong with the flow table itself. Sure, address/ports pairs should now be sufficient to uniquely identify a flow, and the flow table should be otherwise transparent, but the idea behind debugging is that there's a bug somewhere.
+ flow_type_str[f->type], pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + +/** + * flow_fwd_dbg() - Print debug message for newly forwarded flow + * @f: Common flow structure + * @side: Which side was the flow forwarded to + */ +void flow_fwd_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)); + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)); + + flow_log_(f, LOG_DEBUG, "Forwarded to %s/%u: [%s]:%hu <-> [%s]:%hu", + pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + /** flowside_from_sock - Initialize flowside to match an existing socket * @fside: flowside to initialize * @pif: pif id of this flowside diff --git a/flow.h b/flow.h index 37885b2..e7c4484 100644 --- a/flow.h +++ b/flow.h @@ -97,6 +97,22 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** flow_complete - Check if common parts of flow are fully initialized + * @flow: flow to check + */ +static inline bool flow_complete(const struct flow_common *f) +{ + return f->type != FLOW_TYPE_NONE && f->type < FLOW_NUM_TYPES && + flowside_complete(&f->side[0]) && + flowside_complete(&f->side[1]);
Preferably align next lines after "return ". -- Stefano
On Wed, Jan 17, 2024 at 08:59:22PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:27 +1100 David Gibson
wrote: When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.
To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.
Signed-off-by: David Gibson
--- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include
#include #include +#include #include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }
+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",
I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later.
Printing the index is built into flow_log_(), so we're already doing that.
It's also useful if anything goes wrong with the flow table itself.
Sure, address/ports pairs should now be sufficient to uniquely identify a flow, and the flow table should be otherwise transparent, but the idea behind debugging is that there's a bug somewhere.
+ flow_type_str[f->type], pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + +/** + * flow_fwd_dbg() - Print debug message for newly forwarded flow + * @f: Common flow structure + * @side: Which side was the flow forwarded to + */ +void flow_fwd_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)); + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)); + + flow_log_(f, LOG_DEBUG, "Forwarded to %s/%u: [%s]:%hu <-> [%s]:%hu", + pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + /** flowside_from_sock - Initialize flowside to match an existing socket * @fside: flowside to initialize * @pif: pif id of this flowside diff --git a/flow.h b/flow.h index 37885b2..e7c4484 100644 --- a/flow.h +++ b/flow.h @@ -97,6 +97,22 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */
+/** flow_complete - Check if common parts of flow are fully initialized + * @flow: flow to check + */ +static inline bool flow_complete(const struct flow_common *f) +{ + return f->type != FLOW_TYPE_NONE && f->type < FLOW_NUM_TYPES && + flowside_complete(&f->side[0]) && + flowside_complete(&f->side[1]);
Preferably align next lines after "return ".
Done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 18 Jan 2024 12:04:15 +1100
David Gibson
On Wed, Jan 17, 2024 at 08:59:22PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:27 +1100 David Gibson
wrote: When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.
To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.
Signed-off-by: David Gibson
--- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include
#include #include +#include #include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }
+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",
I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later.
Printing the index is built into flow_log_(), so we're already doing that.
Oh, right, sorry, I forgot about it. -- Stefano
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the L4 protocol number.
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:28 +1100
David Gibson
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.
Signed-off-by: David Gibson
--- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-) diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }
+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);
Customary newline here.
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);
If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)
#define SIDES 2
+/** + * flowside_eq() - Check if two flowsides are equal
...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too? I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers. Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful. But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case? Same here, I'm not proposing a concrete change here, I'm rather raising this if you didn't think about it (assuming it makes any sense).
+ * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return left->pif == right->pif && + inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport;
Preferably align under "return ". -- Stefano
On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson
wrote: Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.
Signed-off-by: David Gibson
--- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-) diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }
+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);
Customary newline here.
Done.
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);
If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.
Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass * we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised * the struct part would need to be packed, or padding will mess with us * the exact value would now depend on the host endianness, which is probably fine, but worth noting
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)
#define SIDES 2
+/** + * flowside_eq() - Check if two flowsides are equal
...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?
I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.
Right, which is not something I'm planning on doing, or at least not very soon.
Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.
Exactly.
But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?
Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.
Same here, I'm not proposing a concrete change here, I'm rather raising this if you didn't think about it (assuming it makes any sense).
+ * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return left->pif == right->pif && + inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport;
Preferably align under "return ".
Done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 18 Jan 2024 12:15:12 +1100
David Gibson
On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson
wrote: Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.
Signed-off-by: David Gibson
--- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-) diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }
+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);
Customary newline here.
Done.
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);
If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.
Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass
I haven't tried, but I guess you can just cast a (packed) struct.
* we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised
Where do the 64 _bytes_ come from?
* the struct part would need to be packed, or padding will mess with us
Right, yes, just see tcp_seq_init() before fc8f0f8c48ef.
* the exact value would now depend on the host endianness, which is probably fine, but worth noting
Oh, I didn't even think of that when I wrote the old tcp_seq_init(). Anyway, yes, it doesn't matter.
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)
#define SIDES 2
+/** + * flowside_eq() - Check if two flowsides are equal
...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?
I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.
Right, which is not something I'm planning on doing, or at least not very soon.
Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.
Exactly.
But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?
Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.
I'm not suggesting to support that right away, I was just wondering if it actually makes sense, right from the beginning, to keep the hash information "consistent" with the flow table information, including having the protocol number in the flow table. But I didn't really think it through, you know better. -- Stefano
On Thu, Jan 18, 2024 at 04:42:34PM +0100, Stefano Brivio wrote:
On Thu, 18 Jan 2024 12:15:12 +1100 David Gibson
wrote: On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson
wrote: Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.
Signed-off-by: David Gibson
--- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-) diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }
+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);
Customary newline here.
Done.
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);
If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.
Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass
I haven't tried, but I guess you can just cast a (packed) struct.
No... I don't think you can:
$ cat pack64.c
#include
* we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised
Where do the 64 _bytes_ come from?
Duh, sorry. I meant the remaining (8 - 6 == 2) bytes of the union.
* the struct part would need to be packed, or padding will mess with us
Right, yes, just see tcp_seq_init() before fc8f0f8c48ef.
* the exact value would now depend on the host endianness, which is probably fine, but worth noting
Oh, I didn't even think of that when I wrote the old tcp_seq_init(). Anyway, yes, it doesn't matter.
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)
#define SIDES 2
+/** + * flowside_eq() - Check if two flowsides are equal
...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?
I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.
Right, which is not something I'm planning on doing, or at least not very soon.
Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.
Exactly.
But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?
Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.
I'm not suggesting to support that right away, I was just wondering if it actually makes sense, right from the beginning, to keep the hash information "consistent" with the flow table information, including having the protocol number in the flow table.
But I didn't really think it through, you know better.
Yeah, I think this would at most make things very slightly easier down the track when we want to add that, at the cost of making things significantly less natural now. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Each flow already has a type field. This implies the protocol the flow
represents, but also has more information: we have two ways to represent
TCP flows, "tap" and "spliced". In order to generalise some of the flow
mechanics, we'll need to determine a flow's protocol in terms of the IP
(L4) protocol number.
Introduce a constant table and helper macro to derive this from the flow
type.
Signed-off-by: David Gibson
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_hash_lookup() now needs to verify that the entry has a matching
protocol as well as matching addresses, ports and interface
* When moving entries in the flow table tcp_tap_conn_update() could
previously assume that only the TAP side of the TCP connection was
hashed and might need an update. For the general flow hash table either
side of a flow could be hashed, so when we move an entry in the flow
table we need to allow for updating both sides in the hash table.
Signed-off-by: David Gibson
We generate TCP initial sequence numbers, when we need them, from a hash
of the source and destination addresses and ports, plus a timestamp.
Moments later, we generate another hash of the same information, plus a few
extras to slot the connection into the flow table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson
Currently icmp_id_map[][] stores information about ping sockets in a
bespoke structure. Move the same information into a new type of flow that
lives in the flow table. To match that change, replace the existing ICMP
timer with a flow-based timer for expiring ping sockets. This has the
advantage that we only need to scan the active flows, not all possible ids.
For now we still index the flow table entries with icmp_id_map[][], and we
don't populate all the common flow information. This is kind of an abuse
of the flow table, but it's a useful interim step towards full flow table
integration for ICMP.
Signed-off-by: David Gibson
ping flows now have flow table entries, but we don't populate most of the
common information yet. Start doing this by populating the guest side
addresses. We set the "port" for both ends of the flow to the ICMP id.
With that populated, rather than looking up the correct flow in the
icmp_id_map[] array, we can use the flow hash table. Furthermore, we can
use the addresses stored in the flow table to direct returning packets,
without having to rely on tap_ip[46]_daddr() to reobtain the guest address.
This marks the last user of tap_ip4_daddr() so it is also removed.
Signed-off-by: David Gibson
Complete filling out the common flow information for "ping" flows by
storing the host side information for the ping socket. We can only
retrieve this from the socket after we send the echo-request, because
that's when the kernel will assign an ID.
Signed-off-by: David Gibson
On Thu, 21 Dec 2023 18:02:34 +1100
David Gibson
Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.
Signed-off-by: David Gibson
--- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */
Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well. But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway. Patches 13/15 to 15/15 all look good to me. -- Stefano
On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson
wrote: Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.
Signed-off-by: David Gibson
--- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */
Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.
If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.
But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.
That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Thu, 18 Jan 2024 12:22:29 +1100
David Gibson
On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson
wrote: Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.
Signed-off-by: David Gibson
--- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */
Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.
If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.
Well, unless you consider 0.0.0.0 / :: as the actual value (which makes it a wildcard, but not in the hash table sense). I'm not suggesting we should, though (at least for the moment being).
But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.
That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed.
The comment makes this sound like a compromise on functionality, or a problem we have (and that's what mostly caught my attention: can we "solve" this?). Maybe you could add something like "(not that there's any use for it)" at the end, or similar. -- Stefano
On Thu, Jan 18, 2024 at 04:43:05PM +0100, Stefano Brivio wrote:
On Thu, 18 Jan 2024 12:22:29 +1100 David Gibson
wrote: On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:
On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson
wrote: Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.
Signed-off-by: David Gibson
--- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */
Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.
If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.
Well, unless you consider 0.0.0.0 / :: as the actual value (which makes it a wildcard, but not in the hash table sense). I'm not suggesting we should, though (at least for the moment being).
Right, ok. So, I've been looking at what we need to do flow based NAT, and I am now leaning towards revising the existing series so that it's less insistent on "complete" flowside information, except in the case where we really are hashing. That should also let us remove many of the getsockname() calls.
But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.
That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed.
The comment makes this sound like a compromise on functionality, or a problem we have (and that's what mostly caught my attention: can we "solve" this?).
Maybe you could add something like "(not that there's any use for it)" at the end, or similar.
Right, the revision I'm thinking of should make that clearer, I hope. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
Currently ping sockets use a custom epoll reference type which includes
the ICMP id. However, now that we have entries in the flow table for
ping flows, finding that is sufficient to get everything else we want,
including the id. Therefore remove the icmp_epoll_ref type and use the
general 'flowside' field for ping sockets.
Signed-off-by: David Gibson
We have separate epoll reference types for ICMP and ICMPv6 ping sockets.
Now that ping socket references point to a flow table entry, we can easily
determine whether we're dealing with ICMP or ICMPv6 from the flow type,
making the two epoll types redundant.
Merge both types into EPOLL_TYPE_PING.
Signed-off-by: David Gibson
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything. Eliminate it.
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio