[PATCH v3 00/15] RFC: Unified flow table

David Gibson

21 Dec 2023 21 Dec '23

8:02 a.m.

This is a third draft of the first steps in implementing more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP is converted to use the new flow table. Caveats: * We significantly increase the size of a connection/flow entry - Can probably be mitigated, but I haven't investigated much yet * We perform a number of extra getsockname() calls to know some of the socket endpoints - Haven't yet measured how much performance impact that has - Can be mitigated in at least some cases, but again, haven't tried yet * UDP is not converted yet Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (15): flow: Common data structures for tracking flow addresses tcp, flow: Maintain guest side flow information tcp, flow: Maintain host side flow information tcp_splice,flow: Maintain flow information for spliced connections flow, tcp, tcp_splice: Uniform debug helpers for new flows tcp, flow: Replace TCP specific hash function with general flow hash flow: Add helper to determine a flow's protocol flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Store ping socket information in the flow table icmp: Populate guest side information for ping flows icmp: Populate and use host side flow information icmp: Use 'flowside' epoll references for ping sockets icmp: Merge EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6 icmp: Eliminate icmp_id_map Makefile | 6 +- flow.c | 260 ++++++++++++++++++++++++++++++++++++++++++ flow.h | 104 +++++++++++++++++ flow_table.h | 2 + icmp.c | 211 +++++++++++++++++++--------------- icmp.h | 15 +-- icmp_flow.h | 31 +++++ passt.c | 15 +-- passt.h | 9 +- tap.c | 11 -- tap.h | 1 - tcp.c | 313 +++++++++++++++------------------------------------ tcp_conn.h | 9 -- tcp_splice.c | 63 ++++++++--- tcp_splice.h | 3 +- util.c | 4 +- util.h | 18 +++ 17 files changed, 683 insertions(+), 392 deletions(-) create mode 100644 icmp_flow.h -- 2.43.0

Show replies by date

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 01/15] flow: Common data structures for tracking flow addresses

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets). To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host. Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides. For now we just introduce the structure and fields themselves, along with a simple helper. Later patches will actually use these to store useful information. Signed-off-by: David Gibson --- flow.h | 33 +++++++++++++++++++++++++++++++++ passt.h | 2 ++ tcp_conn.h | 1 - 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/flow.h b/flow.h index 48a0ab4..e090ba0 100644 --- a/flow.h +++ b/flow.h @@ -27,11 +27,44 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?") +/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); + +/** flowside_complete - Check if flowside is fully initialized + * @fside: flowside to check + */ +static inline bool flowside_complete(const struct flowside *fside) +{ + return fside->pif != PIF_NONE && + !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && + fside->fport != 0 && fside->eport != 0; +} + +#define SIDES 2 + /** * struct flow_common - Common fields for packet flows + * @side[]: Information for each side of the flow * @type: Type of packet flow */ struct flow_common { + struct flowside side[SIDES]; uint8_t type; }; diff --git a/passt.h b/passt.h index a9e8f15..db0cd10 100644 --- a/passt.h +++ b/passt.h @@ -37,6 +37,8 @@ union epoll_ref; #include "pif.h" #include "packet.h" +#include "siphash.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "port_fwd.h" diff --git a/tcp_conn.h b/tcp_conn.h index a5f5cfe..1a4dcf2 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -106,7 +106,6 @@ struct tcp_tap_conn { uint32_t seq_init_from_tap; }; -#define SIDES 2 /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information -- 2.43.0

Stefano Brivio

13 Jan 13 Jan

11:50 p.m.

New subject: [PATCH v3 01/15] flow: Common data structures for tracking flow addresses

On Thu, 21 Dec 2023 18:02:23 +1100 David Gibson wrote:

...

Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides.

For now we just introduce the structure and fields themselves, along with a simple helper. Later patches will actually use these to store useful information.

Signed-off-by: David Gibson --- flow.h | 33 +++++++++++++++++++++++++++++++++ passt.h | 2 ++ tcp_conn.h | 1 - 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/flow.h b/flow.h index 48a0ab4..e090ba0 100644 --- a/flow.h +++ b/flow.h @@ -27,11 +27,44 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")

+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); +

Nits:

...

+/** flowside_complete - Check if flowside is fully initialized

flowside_complete(). By the way, shouldn't we call it something more descriptive such as flowside_is_complete()? It's not obvious this is a check otherwise.

...

+ * @fside: flowside to check

* Return: true if pif, addresses and ports are set ...or something of that sort.

...

+ */ +static inline bool flowside_complete(const struct flowside *fside) +{ + return fside->pif != PIF_NONE && + !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && + fside->fport != 0 && fside->eport != 0;

I would align everything after 'return ', that is: return fside->pif != PIF_NONE && !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && fside->fport != 0 && fside->eport != 0; mostly for consistency -- not a strong preference. -- Stefano

David Gibson

16 Jan 16 Jan

7:14 a.m.

New subject: [PATCH v3 01/15] flow: Common data structures for tracking flow addresses

On Sat, Jan 13, 2024 at 11:50:40PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:23 +1100 David Gibson wrote:

...
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).

To allow more general and robust handling, and more consistency across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.

Introduce 'struct flowside' to keep track of common information related to one side of each flow. For now that's the addresses, ports and the pif id. Store two of these in the common fields of a flow to track that information for both sides.

For now we just introduce the structure and fields themselves, along with a simple helper. Later patches will actually use these to store useful information.

Signed-off-by: David Gibson --- flow.h | 33 +++++++++++++++++++++++++++++++++ passt.h | 2 ++ tcp_conn.h | 1 - 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/flow.h b/flow.h index 48a0ab4..e090ba0 100644 --- a/flow.h +++ b/flow.h @@ -27,11 +27,44 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?")

+/** + * struct flowside - Common information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + * @pif: pif ID on which this side of the flow exists + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; + uint8_t pif; +}; +static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), + "Unexpected alignment for struct flowside"); +

Nits:

...
+/** flowside_complete - Check if flowside is fully initialized

flowside_complete(). By the way, shouldn't we call it something more descriptive such as flowside_is_complete()? It's not obvious this is a check otherwise.

...
+ * @fside: flowside to check

* Return: true if pif, addresses and ports are set

...or something of that sort.

...
+ */ +static inline bool flowside_complete(const struct flowside *fside) +{ + return fside->pif != PIF_NONE && + !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && + !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && + fside->fport != 0 && fside->eport != 0;

I would align everything after 'return ', that is:

return fside->pif != PIF_NONE && !IN6_IS_ADDR_UNSPECIFIED(&fside->faddr) && !IN6_IS_ADDR_UNSPECIFIED(&fside->eaddr) && fside->fport != 0 && fside->eport != 0;

mostly for consistency -- not a strong preference.

All seems reasonable, updated accordingly. I've also updated 'flow_complete' to 'flow_is_complete' later in the series. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 02/15] tcp, flow: Maintain guest side flow information

tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside struct. Use those instead of protocol specific fields. So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local). struct flowside expects both addresses, and we will want to use the endpoint address in future. So, we need to add code to determine that address and record it in the flowside. Signed-off-by: David Gibson --- flow.h | 20 +++++++++++++++ tcp.c | 75 ++++++++++++++++++++++++++++-------------------------- tcp_conn.h | 8 ------ 3 files changed, 59 insertions(+), 44 deletions(-) diff --git a/flow.h b/flow.h index e090ba0..e7126e4 100644 --- a/flow.h +++ b/flow.h @@ -45,6 +45,26 @@ struct flowside { static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), "Unexpected alignment for struct flowside"); +/** flowside_from_af - Initialize flowside from addresses + * @fside: flowside to initialize + * @pif: pif if of this flowside + * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + fside->pif = pif; + inany_from_af(&fside->faddr, af, faddr); + inany_from_af(&fside->eaddr, af, eaddr); + fside->fport = fport; + fside->eport = eport; +} + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index ddabc34..7ef20b1 100644 --- a/tcp.c +++ b/tcp.c @@ -394,7 +394,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE]) + +#define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -845,7 +847,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return 1; return 0; @@ -867,7 +869,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -879,7 +881,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPFSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1144,8 +1146,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && + TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) return 1; return 0; @@ -1179,7 +1181,8 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPFSIDE(conn)->faddr, + TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); } /** @@ -1365,13 +1368,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPFSIDE(conn)->faddr); size_t ip_len, tlen; #define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPFSIDE(conn)->fport); \ + b->th.dest = htons(TAPFSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1389,7 +1392,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPFSIDE(conn)->eaddr)->s_addr; if (check) b->iph.check = *check; @@ -1407,11 +1410,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr); b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPFSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPFSIDE(conn)->eaddr.a6; memset(b->ip6h.flow_lbl, 0, 3); @@ -1739,26 +1739,21 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; uint64_t hash; uint32_t ns; - if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - - inany_siphash_feed(&state, &conn->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); + inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)TAPFSIDE(conn)->fport << 16 | + TAPFSIDE(conn)->eport); /* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1925,8 +1920,6 @@ static void tcp_conn_from_tap(struct ctx *c, socklen_t sl; int s, mss; - (void)saddr; - if (!(flow = flow_alloc())) return; @@ -1953,6 +1946,10 @@ static void tcp_conn_from_tap(struct ctx *c, conn = &flow->tcp; conn->f.type = FLOW_TCP; + flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, + daddr, ntohs(th->dest), saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPFSIDE(conn))); + conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1972,8 +1969,6 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); @@ -1982,9 +1977,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); } - conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2674,10 +2666,21 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + TAPFSIDE(conn)->pif = PIF_HOST; + inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPFSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPFSIDE(conn)->faddr.a6)) + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPFSIDE(conn)->eport = ref.port; - tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPFSIDE(conn))); tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a4dcf2..8d9c7bd 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap; -- 2.43.0

Stefano Brivio

13 Jan 13 Jan

11:51 p.m.

New subject: [PATCH v3 02/15] tcp, flow: Maintain guest side flow information

On Thu, 21 Dec 2023 18:02:24 +1100 David Gibson wrote:

...

tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside struct. Use those instead of protocol specific fields.

So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local).

struct flowside expects both addresses, and we will want to use the endpoint address in future. So, we need to add code to determine that address and record it in the flowside.

Signed-off-by: David Gibson --- flow.h | 20 +++++++++++++++ tcp.c | 75 ++++++++++++++++++++++++++++-------------------------- tcp_conn.h | 8 ------ 3 files changed, 59 insertions(+), 44 deletions(-)

diff --git a/flow.h b/flow.h index e090ba0..e7126e4 100644 --- a/flow.h +++ b/flow.h @@ -45,6 +45,26 @@ struct flowside { static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), "Unexpected alignment for struct flowside");

+/** flowside_from_af - Initialize flowside from addresses

flowside_from_af() - ...from addresses and ports?

...

+ * @fside: flowside to initialize + * @pif: pif if of this flowside

s/if/id/ or s/if/ID/.

...

+ * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + fside->pif = pif; + inany_from_af(&fside->faddr, af, faddr); + inany_from_af(&fside->eaddr, af, eaddr); + fside->fport = fport; + fside->eport = eport; +} + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index ddabc34..7ef20b1 100644 --- a/tcp.c +++ b/tcp.c @@ -394,7 +394,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8

-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE])

After reviewing the patch I understand what TAPFSIDE() is for... but I still think the name is pretty obscure. I'm struggling to come up with a better idea though. Perhaps FLOWSIDE_TAP(conn), or: #define FLOWSIDE(conn, _side) (&(conn)->f.side[(_side)]) ? Then sure, TAPFSIDE ans SOCKFSIDE have the advantage of brevity... but for instance they don't spare any ugliness in tcp_tap_conn_from_sock() compared to FLOWSIDE_TAP. Maybe they do in other places I didn't check.

...

+ +#define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -845,7 +847,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return 1;

return 0; @@ -867,7 +869,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -879,7 +881,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;

- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPFSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1144,8 +1146,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && + TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) return 1;

return 0; @@ -1179,7 +1181,8 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPFSIDE(conn)->faddr, + TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); }

/** @@ -1365,13 +1368,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPFSIDE(conn)->faddr); size_t ip_len, tlen;

#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPFSIDE(conn)->fport); \ + b->th.dest = htons(TAPFSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1389,7 +1392,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPFSIDE(conn)->eaddr)->s_addr;

if (check) b->iph.check = *check; @@ -1407,11 +1410,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);

b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPFSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPFSIDE(conn)->eaddr.a6;

As I was checking these assignments, it occurred to me that perhaps laddr/raddr (local/remote), instead of faddr/eaddr, used in the sense of the comment to struct flowside: * @eaddr: Endpoint address (remote address from passt's PoV) * @faddr: Forwarding address (local address from passt's PoV) might be more obvious...? I'm not sure at this point.

...

memset(b->ip6h.flow_lbl, 0, 3);

@@ -1739,26 +1739,21 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; uint64_t hash; uint32_t ns;

- if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - - inany_siphash_feed(&state, &conn->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); + inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)TAPFSIDE(conn)->fport << 16 | + TAPFSIDE(conn)->eport);

/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1925,8 +1920,6 @@ static void tcp_conn_from_tap(struct ctx *c, socklen_t sl; int s, mss;

- (void)saddr; - if (!(flow = flow_alloc())) return;

@@ -1953,6 +1946,10 @@ static void tcp_conn_from_tap(struct ctx *c,

conn = &flow->tcp; conn->f.type = FLOW_TCP; + flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, + daddr, ntohs(th->dest), saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPFSIDE(conn))); + conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1972,8 +1969,6 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;

- inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); @@ -1982,9 +1977,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); }

- conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2674,10 +2666,21 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + TAPFSIDE(conn)->pif = PIF_HOST; + inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPFSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPFSIDE(conn)->faddr.a6)) + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPFSIDE(conn)->eport = ref.port;

- tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPFSIDE(conn)));

tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a4dcf2..8d9c7bd 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {

uint8_t seq_dup_ack_approx;

- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;

The rest of this patch looks all good to me, but I'm still reviewing this series, currently at 6/15. -- Stefano

David Gibson

16 Jan 16 Jan

7:23 a.m.

New subject: [PATCH v3 02/15] tcp, flow: Maintain guest side flow information

On Sat, Jan 13, 2024 at 11:51:05PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:24 +1100 David Gibson wrote:

...
tcp_tap_conn has several fields to track addresses and ports as seen by the guest/namespace. However we now have general fields for this in the common flowside struct. Use those instead of protocol specific fields.

So far we've only explicitly tracked the guest side forwarding address in the TCP connection - the remote address from the guest's point of view. The tap side endpoint address - the local address from the guest's point of view - was assumed to always be one of the handful of guest addresses we track as addr_seen (one each for IPv4, IPv6 global and IPv6 link-local).

struct flowside expects both addresses, and we will want to use the endpoint address in future. So, we need to add code to determine that address and record it in the flowside.

Signed-off-by: David Gibson --- flow.h | 20 +++++++++++++++ tcp.c | 75 ++++++++++++++++++++++++++++-------------------------- tcp_conn.h | 8 ------ 3 files changed, 59 insertions(+), 44 deletions(-)

diff --git a/flow.h b/flow.h index e090ba0..e7126e4 100644 --- a/flow.h +++ b/flow.h @@ -45,6 +45,26 @@ struct flowside { static_assert(_Alignof(struct flowside) == _Alignof(uint32_t), "Unexpected alignment for struct flowside");

+/** flowside_from_af - Initialize flowside from addresses

flowside_from_af() - ...from addresses and ports?

Yes. I couldn't think of a way of spelling it out more directly without making it very long. I hoped this was clear enough by way of analogy with inany_from_af().

...

...
+ * @fside: flowside to initialize + * @pif: pif if of this flowside

s/if/id/ or s/if/ID/.

Fixed.

...

...
+ * @af: Address family (AF_INET or AF_INET6) + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + */ +static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, + const void *faddr, in_port_t fport, + const void *eaddr, in_port_t eport) +{ + fside->pif = pif; + inany_from_af(&fside->faddr, af, faddr); + inany_from_af(&fside->eaddr, af, eaddr); + fside->fport = fport; + fside->eport = eport; +} + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index ddabc34..7ef20b1 100644 --- a/tcp.c +++ b/tcp.c @@ -394,7 +394,9 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8

-#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) +#define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE])

After reviewing the patch I understand what TAPFSIDE() is for... but I still think the name is pretty obscure. I'm struggling to come up with a better idea though. Perhaps FLOWSIDE_TAP(conn), or:

#define FLOWSIDE(conn, _side) (&(conn)->f.side[(_side)])

?

Then sure, TAPFSIDE ans SOCKFSIDE have the advantage of brevity... but for instance they don't spare any ugliness in tcp_tap_conn_from_sock() compared to FLOWSIDE_TAP. Maybe they do in other places I didn't check.

Right, the whole point of these macros is brevity - without it expressions we use it in become unreadably verbose. I think FLOWSIDE(conn, TAPSIDE) is too long. FLOWSIDE_TAP(conn) might be barely acceptable.

...

...
+ +#define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -845,7 +847,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return 1;

return 0; @@ -867,7 +869,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;

for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&TAPFSIDE(conn)->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -879,7 +881,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;

- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = TAPFSIDE(conn)->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1144,8 +1146,8 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && + TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) return 1;

return 0; @@ -1179,7 +1181,8 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + return tcp_hash(c, &TAPFSIDE(conn)->faddr, + TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); }

/** @@ -1365,13 +1368,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct in_addr *a4 = inany_v4(&TAPFSIDE(conn)->faddr); size_t ip_len, tlen;

#define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->fport); \ - b->th.dest = htons(conn->eport); \ + b->th.source = htons(TAPFSIDE(conn)->fport); \ + b->th.dest = htons(TAPFSIDE(conn)->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1389,7 +1392,7 @@ do { \ ip_len = plen + sizeof(struct iphdr) + sizeof(struct tcphdr); b->iph.tot_len = htons(ip_len); b->iph.saddr = a4->s_addr; - b->iph.daddr = c->ip4.addr_seen.s_addr; + b->iph.daddr = inany_v4(&TAPFSIDE(conn)->eaddr)->s_addr;

if (check) b->iph.check = *check; @@ -1407,11 +1410,8 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr);

b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->faddr.a6; - if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) - b->ip6h.daddr = c->ip6.addr_ll_seen; - else - b->ip6h.daddr = c->ip6.addr_seen; + b->ip6h.saddr = TAPFSIDE(conn)->faddr.a6; + b->ip6h.daddr = TAPFSIDE(conn)->eaddr.a6;

As I was checking these assignments, it occurred to me that perhaps laddr/raddr (local/remote), instead of faddr/eaddr, used in the sense of the comment to struct flowside:

* @eaddr: Endpoint address (remote address from passt's PoV) * @faddr: Forwarding address (local address from passt's PoV)

might be more obvious...? I'm not sure at this point.

I don't think so. I introduced the endpoint/forwarding terminology specifically because local and remote often weren't clear. In this one case it might be ok, but there are lots of places where it's not necessarily obvious whether we're talking about local/remote w.r.t. passt or local/remote w.r.t. the guest.

...

...
memset(b->ip6h.flow_lbl, 0, 3);

@@ -1739,26 +1739,21 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with faddr, fport and eport populated + * @conn: TCP connection, with faddr, fport, eaddr, eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); - union inany_addr aany; uint64_t hash; uint32_t ns;

- if (CONN_V4(conn)) - inany_from_af(&aany, AF_INET, &c->ip4.addr); - else - inany_from_af(&aany, AF_INET6, &c->ip6.addr); - - inany_siphash_feed(&state, &conn->faddr); - inany_siphash_feed(&state, &aany); + inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); + inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)TAPFSIDE(conn)->fport << 16 | + TAPFSIDE(conn)->eport);

/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1925,8 +1920,6 @@ static void tcp_conn_from_tap(struct ctx *c, socklen_t sl; int s, mss;

- (void)saddr; - if (!(flow = flow_alloc())) return;

@@ -1953,6 +1946,10 @@ static void tcp_conn_from_tap(struct ctx *c,

conn = &flow->tcp; conn->f.type = FLOW_TCP; + flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, + daddr, ntohs(th->dest), saddr, ntohs(th->source)); + ASSERT(flowside_complete(TAPFSIDE(conn))); + conn->sock = s; conn->timer = -1; conn_event(c, conn, TAP_SYN_RCVD); @@ -1972,8 +1969,6 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;

- inany_from_af(&conn->faddr, af, daddr); - if (af == AF_INET) { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); @@ -1982,9 +1977,6 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); }

- conn->fport = ntohs(th->dest); - conn->eport = ntohs(th->source); - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2674,10 +2666,21 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = ref.port; + TAPFSIDE(conn)->pif = PIF_HOST; + inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); + + if (CONN_V4(conn)) { + inany_from_af(&TAPFSIDE(conn)->eaddr, AF_INET, &c->ip4.addr_seen); + } else { + if (IN6_IS_ADDR_LINKLOCAL(&TAPFSIDE(conn)->faddr.a6)) + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_ll_seen; + else + TAPFSIDE(conn)->eaddr.a6 = c->ip6.addr_seen; + } + TAPFSIDE(conn)->eport = ref.port;

- tcp_snat_inbound(c, &conn->faddr); + ASSERT(flowside_complete(TAPFSIDE(conn)));

tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 1a4dcf2..8d9c7bd 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -23,9 +23,6 @@ * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -91,11 +88,6 @@ struct tcp_tap_conn {

uint8_t seq_dup_ack_approx;

- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;

The rest of this patch looks all good to me, but I'm still reviewing this series, currently at 6/15.

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 03/15] tcp, flow: Maintain host side flow information

We now maintain a struct flowside describing each TCP connection as it appears to the guest. We don't yet store the same information for the connections as they appear to the host. Rather, that information is implicit in the state of the host side socket. For future generalisations of flow/connection tracking, we're going to need to use this information more heavily, so properly populate the other flowside in each flow table entry with this information. This does require an additional getsockname() call for each new connection. We hope to optimise that away for at least some cases in future. Signed-off-by: David Gibson --- flow.c | 41 +++++++++++++++++++++++++++++++++++++++++ flow.h | 3 +++ tcp.c | 36 ++++++++++++++++++++++++++++++------ util.h | 18 ++++++++++++++++++ 4 files changed, 92 insertions(+), 6 deletions(-) diff --git a/flow.c b/flow.c index 421e6b5..b9c4a18 100644 --- a/flow.c +++ b/flow.c @@ -8,6 +8,7 @@ #include #include #include +#include #include "util.h" #include "passt.h" @@ -49,6 +50,46 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } +/** flowside_from_sock - Initialize flowside to match an existing socket + * @fside: flowside to initialize + * @pif: pif id of this flowside + * @s: socket + * @fsa: Local addr of @s as sockaddr_in or sockaddr_in6, or NULL + * @esa: Remote addr of @s as sockaddr_in or sockaddr_in6, or NULL + * + * If NULL is passed for either @fsa/@esa, we use getsockname()/getpeername() to + * obtain the information from the @s. + * + * #syscalls getsockname getpeername + */ +int flowside_from_sock(struct flowside *fside, uint8_t pif, int s, + const void *fsa, const void *esa) +{ + struct sockaddr_storage sa; + + fside->pif = pif; + + if (!fsa) { + socklen_t sl = sizeof(sa); + if (getsockname(s, (struct sockaddr *)&sa, &sl) < 0) + return -errno; + fsa = &sa; + } + inany_from_sockaddr(&fside->faddr, &fside->fport, + (const struct sockaddr *)fsa); + + if (!esa) { + socklen_t sl = sizeof(sa); + if (getpeername(s, (struct sockaddr *)&sa, &sl) < 0) + return -errno; + esa = &sa; + } + inany_from_sockaddr(&fside->eaddr, &fside->eport, + (const struct sockaddr *)esa); + + return 0; +} + /** * DOC: Theory of Operation - allocation and freeing of flow entries * diff --git a/flow.h b/flow.h index e7126e4..37885b2 100644 --- a/flow.h +++ b/flow.h @@ -65,6 +65,9 @@ static inline void flowside_from_af(struct flowside *fside, uint8_t pif, int af, fside->eport = eport; } +int flowside_from_sock(struct flowside *fside, uint8_t pif, int s, + const void *fsa, const void *esa); + /** flowside_complete - Check if flowside is fully initialized * @fside: flowside to check */ diff --git a/tcp.c b/tcp.c index 7ef20b1..18ab3ac 100644 --- a/tcp.c +++ b/tcp.c @@ -395,6 +395,7 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_TS 8 #define TAPFSIDE(conn) (&(conn)->f.side[TAPSIDE]) +#define SOCKFSIDE(conn) (&(conn)->f.side[SOCKSIDE]) #define CONN_V4(conn) (!!inany_v4(&TAPFSIDE(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) @@ -2014,6 +2015,14 @@ static void tcp_conn_from_tap(struct ctx *c, conn_event(c, conn, TAP_SYN_ACK_SENT); } + if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { + err("tcp: Failed to get local name for outgoing connection"); + tcp_rst(c, conn); + return; + } + + ASSERT(flowside_complete(SOCKFSIDE(conn))); + tcp_epoll_ctl(c, conn); return; @@ -2653,8 +2662,10 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * @s: Accepted socket * @sa: Peer socket address (from accept()) * @now: Current timestamp + * + * Return: true if able to create a tap connection, false otherwise */ -static void tcp_tap_conn_from_sock(struct ctx *c, +static bool tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, const struct sockaddr *sa, @@ -2666,8 +2677,16 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - TAPFSIDE(conn)->pif = PIF_HOST; - inany_from_sockaddr(&TAPFSIDE(conn)->faddr, &TAPFSIDE(conn)->fport, sa); + if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + return false; + } + + ASSERT(flowside_complete(SOCKFSIDE(conn))); + + TAPFSIDE(conn)->pif = PIF_TAP; + TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; + TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; tcp_snat_inbound(c, &TAPFSIDE(conn)->faddr); if (CONN_V4(conn)) { @@ -2693,6 +2712,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE); tcp_get_sndbuf(conn); + + return true; } /** @@ -2721,11 +2742,14 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, s, (struct sockaddr *)&sa)) return; - tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now); - return; + if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, + (struct sockaddr *)&sa, now)) + return; cancel: + /* Failed to create the connection */ + if (s >= 0) + close(s); flow_alloc_cancel(flow); } diff --git a/util.h b/util.h index d2320f8..13d7353 100644 --- a/util.h +++ b/util.h @@ -298,4 +298,22 @@ static inline int wrap_accept4(int sockfd, struct sockaddr *addr, #define accept4(s, addr, addrlen, flags) \ wrap_accept4((s), (addr), (addrlen), (flags)) +static inline int wrap_getsockname(int sockfd, struct sockaddr *addr, + socklen_t *addrlen) +{ + sa_init(addr, addrlen); + return getsockname(sockfd, addr, addrlen); +} +#define getsockname(s, addr, addrlen) \ + wrap_getsockname((s), (addr), (addrlen)) + +static inline int wrap_getpeername(int sockfd, struct sockaddr *addr, + socklen_t *addrlen) +{ + sa_init(addr, addrlen); + return getpeername(sockfd, addr, addrlen); +} +#define getpeername(s, addr, addrlen) \ + wrap_getpeername((s), (addr), (addrlen)) + #endif /* UTIL_H */ -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 04/15] tcp_splice,flow: Maintain flow information for spliced connections

Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections. Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that are implemented so far). Signed-off-by: David Gibson --- tcp.c | 35 +++++++++++++---------------- tcp_splice.c | 62 +++++++++++++++++++++++++++++++++++++--------------- tcp_splice.h | 3 +-- 3 files changed, 60 insertions(+), 40 deletions(-) diff --git a/tcp.c b/tcp.c index 18ab3ac..6d77cf6 100644 --- a/tcp.c +++ b/tcp.c @@ -2658,32 +2658,23 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPFSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - const struct sockaddr *sa, const struct timespec *now) { + ASSERT(flowside_complete(SOCKFSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - - ASSERT(flowside_complete(SOCKFSIDE(conn))); - TAPFSIDE(conn)->pif = PIF_TAP; TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; @@ -2712,8 +2703,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE); tcp_get_sndbuf(conn); - - return true; } /** @@ -2737,15 +2726,21 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel; - if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_from_sock(&flow->f.side[0], ref.tcp_listen.pif, s, + NULL, &sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); + flow_alloc_cancel(flow); return; + } - if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return; + tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); + return; + cancel: /* Failed to create the connection */ if (s >= 0) diff --git a/tcp_splice.c b/tcp_splice.c index eec02fe..0faeb1b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -72,6 +72,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2]; +#define FSIDE0(conn) (&(conn)->f.side[0]) +#define FSIDE1(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -280,9 +283,21 @@ bool tcp_splice_flow_defer(union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); unsigned side; int i = 0; + if (getsockname(conn->s[1], (struct sockaddr *)&sa, &sl) < 0) { + int ret = -errno; + conn_flag(c, conn, CLOSING); + return ret; + } + inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, + (struct sockaddr *)&sa); + + ASSERT(flowside_complete(FSIDE1(conn))); + for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1; @@ -352,13 +367,24 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); } + /* It would be nicer if we could initialise FSIDE1 all at once with + * flowaddrs_from_af() or flowaddrs_from_sock(). However, we can't get + * the forwarding port until the connect() has finished and we don't + * want to block to wait for it. Meanwhile we have the endpoint address + * here, but don't have a place to stash it other than in the flowaddrs + * itself. So, initialisation of FSIDE1 is split between here and + * tcp_splice_connect_finish(). Ugly but necessary. + */ if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET, &addr4.sin_addr); } + FSIDE1(conn)->eport = port; if (connect(conn->s[1], sa, sl)) { if (errno != EINPROGRESS) { @@ -381,13 +407,13 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, * @c: Execution context * @conn: Connection pointer * @port: Destination port, host order - * @pif: Originating pif of the splice * * Return: return code from connect() */ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn, - in_port_t port, uint8_t pif) + in_port_t port) { + uint8_t pif0 = FSIDE0(conn)->pif, pif1; int s = -1; /* If the pool is empty we take slightly different approaches @@ -397,17 +423,19 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn, * entering the ns anyway, so we might as well refill the * pool. */ - if (pif == PIF_SPLICE) { + if (pif0 == PIF_SPLICE) { int *p = CONN_V6(conn) ? init_sock_pool6 : init_sock_pool4; int af = CONN_V6(conn) ? AF_INET6 : AF_INET; + pif1 = PIF_HOST; s = tcp_conn_pool_sock(p); if (s < 0) s = tcp_conn_new_sock(c, af); } else { int *p = CONN_V6(conn) ? ns_sock_pool6 : ns_sock_pool4; - ASSERT(pif == PIF_HOST); + ASSERT(pif0 == PIF_HOST); + pif1 = PIF_SPLICE; /* If pool is empty, refill it first */ if (p[TCP_SOCK_POOL_SIZE-1] < 0) @@ -421,6 +449,7 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn, return s; } + FSIDE1(conn)->pif = pif1; return tcp_splice_connect(c, conn, s, port); } @@ -428,34 +457,31 @@ static int tcp_splice_new(const struct ctx *c, struct tcp_splice_conn *conn, * tcp_splice_conn_from_sock() - Attempt to init state for a spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with FSIDE0(@conn) completed) * @s: Accepted socket - * @sa: Peer address of connection * * Return: true if able to create a spliced connection, false otherwise * #syscalls:pasta setsockopt */ bool tcp_splice_conn_from_sock(const struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa) + struct tcp_splice_conn *conn, int s) { - const struct in_addr *a4; - union inany_addr aany; - in_port_t port; + const struct in_addr *e4 = inany_v4(&FSIDE0(conn)->eaddr); + const struct in_addr *f4 = inany_v4(&FSIDE0(conn)->faddr); ASSERT(c->mode == MODE_PASTA); + ASSERT(flowside_complete(FSIDE0(conn))); - inany_from_sockaddr(&aany, &port, sa); - a4 = inany_v4(&aany); - - if (a4) { - if (!IN4_IS_ADDR_LOOPBACK(a4)) + if (e4) { + if (!IN4_IS_ADDR_LOOPBACK(e4)) return false; + ASSERT(f4 && IN4_IS_ADDR_LOOPBACK(f4)); conn->flags = 0; } else { - if (!IN6_IS_ADDR_LOOPBACK(&aany.a6)) + if (!IN6_IS_ADDR_LOOPBACK(&FSIDE0(conn)->eaddr.a6)) return false; + ASSERT(IN6_IS_ADDR_LOOPBACK(&FSIDE0(conn)->faddr.a6)); conn->flags = SPLICE_V6; } @@ -465,7 +491,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, conn->f.type = FLOW_TCP_SPLICE; conn->s[0] = s; - if (tcp_splice_new(c, conn, ref.port, ref.pif)) + if (tcp_splice_new(c, conn, ref.port)) conn_flag(c, conn, CLOSING); return true; diff --git a/tcp_splice.h b/tcp_splice.h index 18193e4..e1863a9 100644 --- a/tcp_splice.h +++ b/tcp_splice.h @@ -12,8 +12,7 @@ void tcp_splice_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); bool tcp_splice_conn_from_sock(const struct ctx *c, union tcp_listen_epoll_ref ref, - struct tcp_splice_conn *conn, int s, - const struct sockaddr *sa); + struct tcp_splice_conn *conn, int s); void tcp_splice_init(struct ctx *c); #endif /* TCP_SPLICE_H */ -- 2.43.0

Stefano Brivio

17 Jan 17 Jan

8:59 p.m.

New subject: [PATCH v3 04/15] tcp_splice,flow: Maintain flow information for spliced connections

On Thu, 21 Dec 2023 18:02:26 +1100 David Gibson wrote:

...

Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.

Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that are implemented so far).

I wonder if carrying the address for spliced connections is in any way useful -- other than being obviously useful as a simplification (which justifies this of course). That is, for a spliced connection, addresses and ports are kind of meaningless to us once the connection is established: we operate exclusively above Layer 4. Also, conceptually, all that's there to represent for a spliced connection is that addresses are loopback. To be clear: I'm not suggesting any change to this -- I just want to raise the conceptual inconsistency if it didn't occur to you.

...

Signed-off-by: David Gibson --- tcp.c | 35 +++++++++++++---------------- tcp_splice.c | 62 +++++++++++++++++++++++++++++++++++++--------------- tcp_splice.h | 3 +-- 3 files changed, 60 insertions(+), 40 deletions(-)

diff --git a/tcp.c b/tcp.c index 18ab3ac..6d77cf6 100644 --- a/tcp.c +++ b/tcp.c @@ -2658,32 +2658,23 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPFSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - const struct sockaddr *sa, const struct timespec *now) { + ASSERT(flowside_complete(SOCKFSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - - ASSERT(flowside_complete(SOCKFSIDE(conn))); - TAPFSIDE(conn)->pif = PIF_TAP; TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; @@ -2712,8 +2703,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);

tcp_get_sndbuf(conn); - - return true; }

/** @@ -2737,15 +2726,21 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel;

- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_from_sock(&flow->f.side[0], ref.tcp_listen.pif, s, + NULL, &sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); + flow_alloc_cancel(flow); return; + }

- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;

+ tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); + return; + cancel: /* Failed to create the connection */ if (s >= 0) diff --git a/tcp_splice.c b/tcp_splice.c index eec02fe..0faeb1b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -72,6 +72,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2];

+#define FSIDE0(conn) (&(conn)->f.side[0]) +#define FSIDE1(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -280,9 +283,21 @@ bool tcp_splice_flow_defer(union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); unsigned side; int i = 0;

+ if (getsockname(conn->s[1], (struct sockaddr *)&sa, &sl) < 0) { + int ret = -errno; + conn_flag(c, conn, CLOSING); + return ret; + } + inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, + (struct sockaddr *)&sa); + + ASSERT(flowside_complete(FSIDE1(conn))); + for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1;

@@ -352,13 +367,24 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }

+ /* It would be nicer if we could initialise FSIDE1 all at once with + * flowaddrs_from_af() or flowaddrs_from_sock(). However, we can't get + * the forwarding port until the connect() has finished and we don't + * want to block to wait for it. Meanwhile we have the endpoint address

[...] endpoint address and port [...]. Or, if "address" includes the port too, then the comment should also say "forwarding address", not "forwarding port". It's confusing otherwise: why is there anything special with the endpoint *address* as opposed to the forwarding *port*?

...

+ * here, but don't have a place to stash it other than in the flowaddrs + * itself. So, initialisation of FSIDE1 is split between here and + * tcp_splice_connect_finish(). Ugly but necessary. + */ if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET, &addr4.sin_addr); } + FSIDE1(conn)->eport = port;

-- Stefano

David Gibson

18 Jan 18 Jan

2:01 a.m.

New subject: [PATCH v3 04/15] tcp_splice,flow: Maintain flow information for spliced connections

On Wed, Jan 17, 2024 at 08:59:14PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:26 +1100 David Gibson wrote:

...
Every flow in the flow table now has space for the the addresses as seen by both the host and guest side. We fill that information in for regular "tap" TCP connections, but not for spliced connections.

Fill in that information for spliced connections too, so it's now uniformly available for all flow types (that are implemented so far).

I wonder if carrying the address for spliced connections is in any way useful -- other than being obviously useful as a simplification (which justifies this of course).

The simplification / consistency is even more important than it seems right here. One of the big aims of this (though I haven't implemented it yet) is to allow our NAT to be done generically, rather that per-protocol. That requires having the addressing information in the common structure regardless of flow type.

...

That is, for a spliced connection, addresses and ports are kind of meaningless to us once the connection is established: we operate exclusively above Layer 4.

Also, conceptually, all that's there to represent for a spliced connection is that addresses are loopback.

To be clear: I'm not suggesting any change to this -- I just want to raise the conceptual inconsistency if it didn't occur to you.

Hmm.. I would say in this case it's conceptual consistency leading to redundancy of information in practice rather than a conceptual inconsistency.

...

...
Signed-off-by: David Gibson --- tcp.c | 35 +++++++++++++---------------- tcp_splice.c | 62 +++++++++++++++++++++++++++++++++++++--------------- tcp_splice.h | 3 +-- 3 files changed, 60 insertions(+), 40 deletions(-)

diff --git a/tcp.c b/tcp.c index 18ab3ac..6d77cf6 100644 --- a/tcp.c +++ b/tcp.c @@ -2658,32 +2658,23 @@ static void tcp_snat_inbound(const struct ctx *c, union inany_addr *addr) * tcp_tap_conn_from_sock() - Initialize state for non-spliced connection * @c: Execution context * @ref: epoll reference of listening socket - * @conn: connection structure to initialize + * @conn: connection structure (with TAPFSIDE(@conn) completed) * @s: Accepted socket - * @sa: Peer socket address (from accept()) * @now: Current timestamp - * - * Return: true if able to create a tap connection, false otherwise */ -static bool tcp_tap_conn_from_sock(struct ctx *c, +static void tcp_tap_conn_from_sock(struct ctx *c, union tcp_listen_epoll_ref ref, struct tcp_tap_conn *conn, int s, - const struct sockaddr *sa, const struct timespec *now) { + ASSERT(flowside_complete(SOCKFSIDE(conn))); + conn->f.type = FLOW_TCP; conn->sock = s; conn->timer = -1; conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);

- if (flowside_from_sock(SOCKFSIDE(conn), PIF_HOST, s, NULL, sa) < 0) { - err("tcp: Failed to get local name, connection dropped"); - return false; - } - - ASSERT(flowside_complete(SOCKFSIDE(conn))); - TAPFSIDE(conn)->pif = PIF_TAP; TAPFSIDE(conn)->faddr = SOCKFSIDE(conn)->eaddr; TAPFSIDE(conn)->fport = SOCKFSIDE(conn)->eport; @@ -2712,8 +2703,6 @@ static bool tcp_tap_conn_from_sock(struct ctx *c, conn_flag(c, conn, ACK_FROM_TAP_DUE);

tcp_get_sndbuf(conn); - - return true; }

/** @@ -2737,15 +2726,21 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel;

- if (c->mode == MODE_PASTA && - tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, - s, (struct sockaddr *)&sa)) + if (flowside_from_sock(&flow->f.side[0], ref.tcp_listen.pif, s, + NULL, &sa) < 0) { + err("tcp: Failed to get local name, connection dropped"); + close(s); + flow_alloc_cancel(flow); return; + }

- if (tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, - (struct sockaddr *)&sa, now)) + if (c->mode == MODE_PASTA && + tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return;

+ tcp_tap_conn_from_sock(c, ref.tcp_listen, &flow->tcp, s, now); + return; + cancel: /* Failed to create the connection */ if (s >= 0) diff --git a/tcp_splice.c b/tcp_splice.c index eec02fe..0faeb1b 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -72,6 +72,9 @@ static int ns_sock_pool6 [TCP_SOCK_POOL_SIZE]; /* Pool of pre-opened pipes */ static int splice_pipe_pool [TCP_SPLICE_PIPE_POOL_SIZE][2];

+#define FSIDE0(conn) (&(conn)->f.side[0]) +#define FSIDE1(conn) (&(conn)->f.side[1]) + #define CONN_V6(x) (x->flags & SPLICE_V6) #define CONN_V4(x) (!CONN_V6(x)) #define CONN_HAS(conn, set) ((conn->events & (set)) == (set)) @@ -280,9 +283,21 @@ bool tcp_splice_flow_defer(union flow *flow) static int tcp_splice_connect_finish(const struct ctx *c, struct tcp_splice_conn *conn) { + struct sockaddr_storage sa; + socklen_t sl = sizeof(sa); unsigned side; int i = 0;

+ if (getsockname(conn->s[1], (struct sockaddr *)&sa, &sl) < 0) { + int ret = -errno; + conn_flag(c, conn, CLOSING); + return ret; + } + inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, + (struct sockaddr *)&sa); + + ASSERT(flowside_complete(FSIDE1(conn))); + for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1;

@@ -352,13 +367,24 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }

+ /* It would be nicer if we could initialise FSIDE1 all at once with + * flowaddrs_from_af() or flowaddrs_from_sock(). However, we can't get + * the forwarding port until the connect() has finished and we don't + * want to block to wait for it. Meanwhile we have the endpoint address

[...] endpoint address and port [...]. Or, if "address" includes the port too, then the comment should also say "forwarding address", not "forwarding port".

It's confusing otherwise: why is there anything special with the endpoint *address* as opposed to the forwarding *port*?

...
+ * here, but don't have a place to stash it other than in the flowaddrs + * itself. So, initialisation of FSIDE1 is split between here and + * tcp_splice_connect_finish(). Ugly but necessary. + */ if (CONN_V6(conn)) { sa = (struct sockaddr *)&addr6; sl = sizeof(addr6); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET6, &addr6.sin6_addr); } else { sa = (struct sockaddr *)&addr4; sl = sizeof(addr4); + inany_from_af(&FSIDE1(conn)->eaddr, AF_INET, &addr4.sin_addr); } + FSIDE1(conn)->eport = port;

-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 05/15] flow, tcp, tcp_splice: Uniform debug helpers for new flows

When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface. To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced. Signed-off-by: David Gibson --- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-) diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include #include #include +#include #include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); } +/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu", + flow_type_str[f->type], pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + +/** + * flow_fwd_dbg() - Print debug message for newly forwarded flow + * @f: Common flow structure + * @side: Which side was the flow forwarded to + */ +void flow_fwd_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)); + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)); + + flow_log_(f, LOG_DEBUG, "Forwarded to %s/%u: [%s]:%hu <-> [%s]:%hu", + pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + /** flowside_from_sock - Initialize flowside to match an existing socket * @fside: flowside to initialize * @pif: pif id of this flowside diff --git a/flow.h b/flow.h index 37885b2..e7c4484 100644 --- a/flow.h +++ b/flow.h @@ -97,6 +97,22 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */ +/** flow_complete - Check if common parts of flow are fully initialized + * @flow: flow to check + */ +static inline bool flow_complete(const struct flow_common *f) +{ + return f->type != FLOW_TYPE_NONE && f->type < FLOW_NUM_TYPES && + flowside_complete(&f->side[0]) && + flowside_complete(&f->side[1]); +} + +void flow_new_dbg(const struct flow_common *f, unsigned side); +#define FLOW_NEW_DBG(flow, side) (flow_new_dbg(&(flow)->f, (side))) + +void flow_fwd_dbg(const struct flow_common *f, unsigned side); +#define FLOW_FWD_DBG(flow, side) (flow_fwd_dbg(&(flow)->f, (side))) + /** * struct flow_sidx - ID for one side of a specific flow * @side: Side referenced (0 or 1) diff --git a/tcp.c b/tcp.c index 6d77cf6..954c24f 100644 --- a/tcp.c +++ b/tcp.c @@ -1949,6 +1949,8 @@ static void tcp_conn_from_tap(struct ctx *c, conn->f.type = FLOW_TCP; flowside_from_af(TAPFSIDE(conn), PIF_TAP, af, daddr, ntohs(th->dest), saddr, ntohs(th->source)); + + FLOW_NEW_DBG(conn, TAPSIDE); ASSERT(flowside_complete(TAPFSIDE(conn))); conn->sock = s; @@ -2021,7 +2023,8 @@ static void tcp_conn_from_tap(struct ctx *c, return; } - ASSERT(flowside_complete(SOCKFSIDE(conn))); + FLOW_FWD_DBG(conn, SOCKSIDE); + ASSERT(flow_complete(&conn->f)); tcp_epoll_ctl(c, conn); return; @@ -2690,7 +2693,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, } TAPFSIDE(conn)->eport = ref.port; - ASSERT(flowside_complete(TAPFSIDE(conn))); + FLOW_FWD_DBG(conn, TAPSIDE); + ASSERT(flow_complete(&conn->f)); tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); @@ -2734,6 +2738,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, return; } + FLOW_NEW_DBG(flow, 0); + ASSERT(flowside_complete(&flow->f.side[0])); + if (c->mode == MODE_PASTA && tcp_splice_conn_from_sock(c, ref.tcp_listen, &flow->tcp_splice, s)) return; diff --git a/tcp_splice.c b/tcp_splice.c index 0faeb1b..b59ac90 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -296,7 +296,8 @@ static int tcp_splice_connect_finish(const struct ctx *c, inany_from_sockaddr(&FSIDE1(conn)->faddr, &FSIDE1(conn)->fport, (struct sockaddr *)&sa); - ASSERT(flowside_complete(FSIDE1(conn))); + FLOW_FWD_DBG(conn, 1); + ASSERT(flow_complete(&conn->f)); for (side = 0; side < SIDES; side++) { conn->pipe[side][0] = conn->pipe[side][1] = -1; -- 2.43.0

Stefano Brivio

17 Jan 17 Jan

8:59 p.m.

New subject: [PATCH v3 05/15] flow, tcp, tcp_splice: Uniform debug helpers for new flows

On Thu, 21 Dec 2023 18:02:27 +1100 David Gibson wrote:

...

When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.

To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.

Signed-off-by: David Gibson --- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include #include #include +#include

#include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }

+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",

I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later. It's also useful if anything goes wrong with the flow table itself. Sure, address/ports pairs should now be sufficient to uniquely identify a flow, and the flow table should be otherwise transparent, but the idea behind debugging is that there's a bug somewhere.

...

+ flow_type_str[f->type], pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + +/** + * flow_fwd_dbg() - Print debug message for newly forwarded flow + * @f: Common flow structure + * @side: Which side was the flow forwarded to + */ +void flow_fwd_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)); + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)); + + flow_log_(f, LOG_DEBUG, "Forwarded to %s/%u: [%s]:%hu <-> [%s]:%hu", + pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + /** flowside_from_sock - Initialize flowside to match an existing socket * @fside: flowside to initialize * @pif: pif id of this flowside diff --git a/flow.h b/flow.h index 37885b2..e7c4484 100644 --- a/flow.h +++ b/flow.h @@ -97,6 +97,22 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */

+/** flow_complete - Check if common parts of flow are fully initialized + * @flow: flow to check + */ +static inline bool flow_complete(const struct flow_common *f) +{ + return f->type != FLOW_TYPE_NONE && f->type < FLOW_NUM_TYPES && + flowside_complete(&f->side[0]) && + flowside_complete(&f->side[1]);

Preferably align next lines after "return ". -- Stefano

David Gibson

18 Jan 18 Jan

2:04 a.m.

New subject: [PATCH v3 05/15] flow, tcp, tcp_splice: Uniform debug helpers for new flows

On Wed, Jan 17, 2024 at 08:59:22PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:27 +1100 David Gibson wrote:

...
When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.

To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.

Signed-off-by: David Gibson --- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include #include #include +#include

#include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }

+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",

I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later.

Printing the index is built into flow_log_(), so we're already doing that.

...

It's also useful if anything goes wrong with the flow table itself.

Sure, address/ports pairs should now be sufficient to uniquely identify a flow, and the flow table should be otherwise transparent, but the idea behind debugging is that there's a bug somewhere.

...
+ flow_type_str[f->type], pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + +/** + * flow_fwd_dbg() - Print debug message for newly forwarded flow + * @f: Common flow structure + * @side: Which side was the flow forwarded to + */ +void flow_fwd_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)); + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)); + + flow_log_(f, LOG_DEBUG, "Forwarded to %s/%u: [%s]:%hu <-> [%s]:%hu", + pif_name(fside->pif), side, + inet_ntop(AF_INET6, &fside->faddr, fbuf, sizeof(fbuf)), + fside->fport, + inet_ntop(AF_INET6, &fside->eaddr, ebuf, sizeof(ebuf)), + fside->eport); +} + /** flowside_from_sock - Initialize flowside to match an existing socket * @fside: flowside to initialize * @pif: pif id of this flowside diff --git a/flow.h b/flow.h index 37885b2..e7c4484 100644 --- a/flow.h +++ b/flow.h @@ -97,6 +97,22 @@ struct flow_common { #define FLOW_TABLE_PRESSURE 30 /* % of FLOW_MAX */ #define FLOW_FILE_PRESSURE 30 /* % of c->nofile */

+/** flow_complete - Check if common parts of flow are fully initialized + * @flow: flow to check + */ +static inline bool flow_complete(const struct flow_common *f) +{ + return f->type != FLOW_TYPE_NONE && f->type < FLOW_NUM_TYPES && + flowside_complete(&f->side[0]) && + flowside_complete(&f->side[1]);

Preferably align next lines after "return ".

Done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

4:40 p.m.

New subject: [PATCH v3 05/15] flow, tcp, tcp_splice: Uniform debug helpers for new flows

On Thu, 18 Jan 2024 12:04:15 +1100 David Gibson wrote:

...

On Wed, Jan 17, 2024 at 08:59:22PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 18:02:27 +1100 David Gibson wrote:

...
When debugging passt/pasta, and the flow table in particular, one of the most obvious things to know is when a new flow is initiated, along with the details of its interface, addresses and ports. Once we've determined to what interface the flow should be forwarded, it's useful to know the details of how it will appear on that other interface.

To help present that information uniformly, introduce FLOW_NEW_DBG() and FLOW_FWD_DBG() helpers and use them for TCP connections, both "tap" and spliced.

Signed-off-by: David Gibson --- flow.c | 40 ++++++++++++++++++++++++++++++++++++++++ flow.h | 16 ++++++++++++++++ tcp.c | 11 +++++++++-- tcp_splice.c | 3 ++- 4 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/flow.c b/flow.c index b9c4a18..bc8cfc6 100644 --- a/flow.c +++ b/flow.c @@ -9,6 +9,7 @@ #include #include #include +#include

#include "util.h" #include "passt.h" @@ -50,6 +51,45 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) logmsg(pri, "Flow %u (%s): %s", flow_idx(f), FLOW_TYPE(f), msg); }

+/** + * flow_new_dbg() - Print debug message for new flow + * @f: Common flow structure + * @side: Which side initiated the new flow + */ +void flow_new_dbg(const struct flow_common *f, unsigned side) +{ + char ebuf[INET6_ADDRSTRLEN], fbuf[INET6_ADDRSTRLEN]; + const struct flowside *fside = &f->side[side]; + + flow_log_(f, LOG_DEBUG, "New %s from %s/%u: [%s]:%hu <-> [%s]:%hu",

I think we should always print the connection index too (passed from the macro, or passing 'conn' as argument here) -- especially if we want to correlate this to what flow_fwd_dbg() will print later.

Printing the index is built into flow_log_(), so we're already doing that.

Oh, right, sorry, I forgot about it. -- Stefano

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 06/15] tcp, flow: Replace TCP specific hash function with general flow hash

Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP. Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number. Signed-off-by: David Gibson --- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-) diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); } +/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr); + return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport); +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside) #define SIDES 2 +/** + * flowside_eq() - Check if two flowsides are equal + * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return left->pif == right->pif && + inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport; +} + /** * struct flow_common - Common fields for packet flows * @side[]: Information for each side of the flow @@ -113,6 +129,9 @@ void flow_new_dbg(const struct flow_common *f, unsigned side); void flow_fwd_dbg(const struct flow_common *f, unsigned side); #define FLOW_FWD_DBG(flow, side) (flow_fwd_dbg(&(flow)->f, (side))) +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside); + /** * struct flow_sidx - ID for one side of a specific flow * @side: Side referenced (0 or 1) diff --git a/tcp.c b/tcp.c index 954c24f..a7907f2 100644 --- a/tcp.c +++ b/tcp.c @@ -575,7 +575,7 @@ static unsigned int tcp6_l2_flags_buf_used; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from remote address, local port, remote port */ +/* Table for lookup from flowside information */ static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, @@ -1134,44 +1134,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_hash_match() - Check if a connection entry matches address and ports - * @conn: Connection entry to match against - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: 1 on match, 0 otherwise - */ -static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - if (inany_equals(&TAPFSIDE(conn)->faddr, faddr) && - TAPFSIDE(conn)->eport == eport && TAPFSIDE(conn)->fport == fport) - return 1; - - return 0; -} - -/** - * tcp_hash() - Calculate hash value for connection given address and ports - * @c: Execution context - * @faddr: Guest side forwarding address - * @eport: Guest side endpoint port - * @fport: Guest side forwarding port - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, - in_port_t eport, in_port_t fport) -{ - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - - inany_siphash_feed(&state, faddr); - return siphash_final(&state, 20, (uint64_t)eport << 16 | fport); -} - /** * tcp_conn_hash() - Calculate hash bucket of an existing connection * @c: Execution context @@ -1182,8 +1144,7 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &TAPFSIDE(conn)->faddr, - TAPFSIDE(conn)->eport, TAPFSIDE(conn)->fport); + return flow_hash(c, IPPROTO_TCP, TAPFSIDE(conn)); } /** @@ -1258,25 +1219,26 @@ static void tcp_hash_remove(const struct ctx *c, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) * @faddr: Guest side forwarding address (guest remote address) * @eport: Guest side endpoint port (guest local port) * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - int af, const void *faddr, +static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, int af, + const void *eaddr, const void *faddr, in_port_t eport, in_port_t fport) { - union inany_addr aany; + struct flowside fside; union flow *flow; unsigned b; - inany_from_af(&aany, af, faddr); + flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport); - b = tcp_hash(c, &aany, eport, fport) % TCP_HASH_TABLE_SIZE; + b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE; while ((flow = flow_at_sidx(tc_hash[b])) && - !tcp_hash_match(&flow->tcp, &aany, eport, fport)) + !flowside_eq(&flow->f.side[TAPSIDE], &fside)) b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); return &flow->tcp; @@ -2506,7 +2468,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, daddr, ntohs(th->source), ntohs(th->dest)); + conn = tcp_hash_lookup(c, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); /* New connection from tap */ if (!conn) { -- 2.43.0

Stefano Brivio

17 Jan 17 Jan

8:59 p.m.

New subject: [PATCH v3 06/15] tcp, flow: Replace TCP specific hash function with general flow hash

On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson wrote:

...

Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson --- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-)

diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }

+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);

Customary newline here.

...

+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);

If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.

...

+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)

#define SIDES 2

+/** + * flowside_eq() - Check if two flowsides are equal

...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too? I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers. Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful. But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case? Same here, I'm not proposing a concrete change here, I'm rather raising this if you didn't think about it (assuming it makes any sense).

...

+ * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return left->pif == right->pif && + inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport;

Preferably align under "return ". -- Stefano

David Gibson

18 Jan 18 Jan

2:15 a.m.

New subject: [PATCH v3 06/15] tcp, flow: Replace TCP specific hash function with general flow hash

On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson wrote:

...
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson --- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-)

diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }

+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);

Customary newline here.

Done.

...

...
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);

If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.

Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass * we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised * the struct part would need to be packed, or padding will mess with us * the exact value would now depend on the host endianness, which is probably fine, but worth noting

...

...
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)

#define SIDES 2

+/** + * flowside_eq() - Check if two flowsides are equal

...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?

...

I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.

Right, which is not something I'm planning on doing, or at least not very soon.

...

Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.

Exactly.

...

But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?

Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.

...

Same here, I'm not proposing a concrete change here, I'm rather raising this if you didn't think about it (assuming it makes any sense).

...
+ * @left, @right: Flowsides to compare + * + * Return: true if equal, false otherwise + */ +static inline bool flowside_eq(const struct flowside *left, + const struct flowside *right) +{ + return left->pif == right->pif && + inany_equals(&left->eaddr, &right->eaddr) && + left->eport == right->eport && + inany_equals(&left->faddr, &right->faddr) && + left->fport == right->fport;

Preferably align under "return ".

Done. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

4:42 p.m.

New subject: [PATCH v3 06/15] tcp, flow: Replace TCP specific hash function with general flow hash

On Thu, 18 Jan 2024 12:15:12 +1100 David Gibson wrote:

...

On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson wrote:

...
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson --- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-)

diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }

+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);

Customary newline here.

Done.

...
...
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);

If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.

Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass

I haven't tried, but I guess you can just cast a (packed) struct.

...

* we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised

Where do the 64 _bytes_ come from?

...

* the struct part would need to be packed, or padding will mess with us

Right, yes, just see tcp_seq_init() before fc8f0f8c48ef.

...

* the exact value would now depend on the host endianness, which is probably fine, but worth noting

Oh, I didn't even think of that when I wrote the old tcp_seq_init(). Anyway, yes, it doesn't matter.

...

...
...
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)

#define SIDES 2

+/** + * flowside_eq() - Check if two flowsides are equal

...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?

...
I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.

Right, which is not something I'm planning on doing, or at least not very soon.

...
Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.

Exactly.

...
But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?

Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.

I'm not suggesting to support that right away, I was just wondering if it actually makes sense, right from the beginning, to keep the hash information "consistent" with the flow table information, including having the protocol number in the flow table. But I didn't really think it through, you know better. -- Stefano

David Gibson

19 Jan 19 Jan

12:55 a.m.

New subject: [PATCH v3 06/15] tcp, flow: Replace TCP specific hash function with general flow hash

On Thu, Jan 18, 2024 at 04:42:34PM +0100, Stefano Brivio wrote:

...

On Thu, 18 Jan 2024 12:15:12 +1100 David Gibson wrote:

...
On Wed, Jan 17, 2024 at 08:59:34PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 18:02:28 +1100 David Gibson wrote:

...
Currently we match TCP packets received on the tap connection to a TCP connection via a hash table based on the forwarding address and both ports. We hope in future to allow for multiple guest side addresses, or for multiple interfaces which means we may need to distinguish based on the endpoint address and pif as well. We also want a unified hash table to cover multiple protocols, not just TCP.

Replace the TCP specific hash function with one suitable for general flows, or rather for one side of a general flow. This includes all the information from struct flowside, plus the L4 protocol number.

Signed-off-by: David Gibson --- flow.c | 21 +++++++++++++++++++++ flow.h | 19 +++++++++++++++++++ tcp.c | 59 +++++++++++----------------------------------------------- 3 files changed, 51 insertions(+), 48 deletions(-)

diff --git a/flow.c b/flow.c index bc8cfc6..263633e 100644 --- a/flow.c +++ b/flow.c @@ -229,6 +229,27 @@ void flow_alloc_cancel(union flow *flow) flow_first_free = FLOW_IDX(flow); }

+/** + * flow_hash() - Calculate hash value for one side of a flow + * @c: Execution context + * @proto: Protocol of this flow (IP L4 protocol number) + * @fside: Flowside + * + * Return: hash value + */ +uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) +{ + struct siphash_state state = SIPHASH_INIT(c->hash_secret); + + ASSERT(flowside_complete(fside)); + inany_siphash_feed(&state, &fside->faddr); + inany_siphash_feed(&state, &fside->eaddr);

Customary newline here.

Done.

...
...
+ return siphash_final(&state, 38, (uint64_t)proto << 40 | + (uint64_t)fside->pif << 32 | + fside->fport << 16 | fside->eport);

If we add the fields from the 'tail' part (not the whole fside) to an anonymous struct in a similar way to what we had before fc8f0f8c48ef ("siphash: Use incremental rather than all-at-once siphash functions"), then we could drop those shift and masks, and use sizeof(that) + sizeof(fside->faddr) + sizeof(fside->eaddr) instead of '38'.

Hrm, I guess so. There are some wrinkles: * we'd need a union so we can get the actual u64 value we need to pass

I haven't tried, but I guess you can just cast a (packed) struct.

No... I don't think you can: $ cat pack64.c #include #include #include int main(int argc, char *argv[]) { struct foo { uint16_t a, b; uint8_t c; } bar = {1, 2, 3}; printf("bar is %zd bytes and has 64-bit value %08lx\n", sizeof(bar), (uint64_t)bar); exit(0); } $ gcc pack64.c pack64.c: In function ‘main’: pack64.c:13:16: error: aggregate value used where an integer was expected 13 | sizeof(bar), (uint64_t)bar); | ^~~~~~

...

...
* we'd need to make sure that the the remaining (64-38 == 26) bytes of that union are consistently initialised

Where do the 64 _bytes_ come from?

Duh, sorry. I meant the remaining (8 - 6 == 2) bytes of the union.

...

...
* the struct part would need to be packed, or padding will mess with us

Right, yes, just see tcp_seq_init() before fc8f0f8c48ef.

...
* the exact value would now depend on the host endianness, which is probably fine, but worth noting

Oh, I didn't even think of that when I wrote the old tcp_seq_init(). Anyway, yes, it doesn't matter.

...
...
...
+} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context diff --git a/flow.h b/flow.h index e7c4484..72ded54 100644 --- a/flow.h +++ b/flow.h @@ -81,6 +81,22 @@ static inline bool flowside_complete(const struct flowside *fside)

#define SIDES 2

+/** + * flowside_eq() - Check if two flowsides are equal

...this raises the question: if the protocol number is now used in the hash, shouldn't it eventually become part of struct flowside -- and compared here, too?

...
I guess it's useful iff we allow flowsides for the same flow to have different protocol numbers.

Right, which is not something I'm planning on doing, or at least not very soon.

...
Now, forwarding between TCP and UDP endpoints might not make a lot of sense, because we would have to make so many arbitrary assumptions as to make it probably not generic enough to be useful.

Exactly.

...
But TCP to stream-oriented UNIX domain socket makes sense, and we also had user requests in that sense. Oh and... what would be the second protocol number in that case?

Right, that's a possible future extension. But if we're going outside the realm of IP, a number of things need to be changed. I think that's something for another day.

I'm not suggesting to support that right away, I was just wondering if it actually makes sense, right from the beginning, to keep the hash information "consistent" with the flow table information, including having the protocol number in the flow table.

But I didn't really think it through, you know better.

Yeah, I think this would at most make things very slightly easier down the track when we want to add that, at the cost of making things significantly less natural now. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 07/15] flow: Add helper to determine a flow's protocol

Each flow already has a type field. This implies the protocol the flow represents, but also has more information: we have two ways to represent TCP flows, "tap" and "spliced". In order to generalise some of the flow mechanics, we'll need to determine a flow's protocol in terms of the IP (L4) protocol number. Introduce a constant table and helper macro to derive this from the flow type. Signed-off-by: David Gibson --- flow.c | 7 +++++++ flow.h | 4 ++++ 2 files changed, 11 insertions(+) diff --git a/flow.c b/flow.c index 263633e..70036fc 100644 --- a/flow.c +++ b/flow.c @@ -26,6 +26,13 @@ const char *flow_type_str[] = { static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); +const uint8_t flow_proto[] = { + [FLOW_TCP] = IPPROTO_TCP, + [FLOW_TCP_SPLICE] = IPPROTO_TCP, +}; +static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, + "flow_proto[] doesn't match enum flow_type"); + /* Global Flow Table */ unsigned flow_first_free; union flow flowtab[FLOW_MAX]; diff --git a/flow.h b/flow.h index 72ded54..ab831d1 100644 --- a/flow.h +++ b/flow.h @@ -27,6 +27,10 @@ extern const char *flow_type_str[]; #define FLOW_TYPE(f) \ ((f)->type < FLOW_NUM_TYPES ? flow_type_str[(f)->type] : "?") +extern const uint8_t flow_proto[]; +#define FLOW_PROTO(f) \ + ((f)->type < FLOW_NUM_TYPES ? flow_proto[(f)->type] : 0) + /** * struct flowside - Common information for one side of a flow * @eaddr: Endpoint address (remote address from passt's PoV) -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 08/15] flow, tcp: Generalise TCP hash table to general flow hash table

Move the data structures and helper functions for the TCP hash table to flow.c, making it a general hash table indexing sides of flows. This is largely code motion and straightforward renames. There are two semantic changes: * flow_hash_lookup() now needs to verify that the entry has a matching protocol as well as matching addresses, ports and interface * When moving entries in the flow table tcp_tap_conn_update() could previously assume that only the TAP side of the TCP connection was hashed and might need an update. For the general flow hash table either side of a flow could be hashed, so when we move an entry in the flow table we need to allow for updating both sides in the hash table. Signed-off-by: David Gibson --- flow.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++++- flow.h | 11 +++-- tcp.c | 146 +++++---------------------------------------------------- 3 files changed, 147 insertions(+), 138 deletions(-) diff --git a/flow.c b/flow.c index 70036fc..3d016c3 100644 --- a/flow.c +++ b/flow.c @@ -40,6 +40,16 @@ union flow flowtab[FLOW_MAX]; /* Last time the flow timers ran */ static struct timespec flow_timer_run; +/* Hash table to index it */ +#define FLOW_HASH_LOAD 70 /* % */ +#define FLOW_HASH_SIZE ((2 * FLOW_MAX * 100 / FLOW_HASH_LOAD)) + +/* Table for lookup from flowside information */ +static flow_sidx_t flow_hashtab[FLOW_HASH_SIZE]; + +static_assert(ARRAY_SIZE(flow_hashtab) >= 2 * FLOW_MAX, +"Safe linear probing requires hash table with more entries than the number of sides in the flow table"); + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -244,8 +254,8 @@ void flow_alloc_cancel(union flow *flow) * * Return: hash value */ -uint64_t flow_hash(const struct ctx *c, uint8_t proto, - const struct flowside *fside) +static uint64_t flow_hash(const struct ctx *c, uint8_t proto, + const struct flowside *fside) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); @@ -257,6 +267,115 @@ uint64_t flow_hash(const struct ctx *c, uint8_t proto, fside->fport << 16 | fside->eport); } +/** + * flow_sidx_hash() - Calculate hash value for given side of a given flow + * @c: Execution context + * @sidx: Flow & side index to get hash for + * + * Return: hash value, of the flow & side represented by @sidx + */ +static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) +{ + const struct flow_common *f = &flow_at_sidx(sidx)->f; + return flow_hash(c, FLOW_PROTO(f), &f->side[sidx.side]); +} + +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + + /* Linear probing */ + while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) && + !flow_sidx_eq(flow_hashtab[b], sidx)) + b = mod_sub(b, 1, FLOW_HASH_SIZE); + + return b; +} + +/** + * flow_hash_insert() - Insert side of a flow into into hash table + * @c: Execution context + * @sidx: Flow & side index + */ +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx); + + flow_hashtab[b] = sidx; + flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b); +} + +/** + * flow_hash_remove() - Drop side of a flow from the hash table + * @c: Execution context + * @sidx: Side of flow to remove + */ +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx) +{ + unsigned b = flow_hash_probe(c, sidx), s; + + if (flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE)) + return; /* Redundant remove */ + + flow_dbg(flow_at_sidx(sidx), "hash table remove: bucket: %u", b); + + /* Scan the remainder of the cluster */ + for (s = mod_sub(b, 1, FLOW_HASH_SIZE); + !flow_sidx_eq(flow_hashtab[s], FLOW_SIDX_NONE); + s = mod_sub(s, 1, FLOW_HASH_SIZE)) { + unsigned h = flow_sidx_hash(c, flow_hashtab[s]) % FLOW_HASH_SIZE; + + if (!mod_between(h, s, b, FLOW_HASH_SIZE)) { + /* flow_hashtab[s] can live in flow_hashtab[b]'s slot */ + debug("hash table remove: shuffle %u -> %u", s, b); + flow_hashtab[b] = flow_hashtab[s]; + b = s; + } + } + + flow_hashtab[b] = FLOW_SIDX_NONE; +} + +/** + * flow_hash_lookup() - Look up a flow given addressing information + * @c: Execution context + * @proto: Protocol of the flow (IP L4 protocol number) + * @pif: Interface of the flow + * @af: Address family, AF_INET or AF_INET6 + * @eaddr: Guest side endpoint address (guest local address) + * @faddr: Guest side forwarding address (guest remote address) + * @eport: Guest side endpoint port (guest local port) + * @fport: Guest side forwarding port (guest remote port) + * + * Return: sidx of the matching flow & side, FLOW_SIDX_NONE if not found + */ +flow_sidx_t flow_hash_lookup(const struct ctx *c, uint8_t proto, uint8_t pif, + int af, const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport) +{ + struct flowside fside; + union flow *flow; + int b; + + flowside_from_af(&fside, pif, af, faddr, fport, eaddr, eport); + + b = flow_hash(c, proto, &fside) % FLOW_HASH_SIZE; + while ((flow = flow_at_sidx(flow_hashtab[b])) && + FLOW_PROTO(&flow->f) == proto && + !flowside_eq(&flow->f.side[flow_hashtab[b].side], &fside)) + b = (b + 1) % FLOW_HASH_SIZE; + + return flow_hashtab[b]; +} + /** * flow_defer_handler() - Handler for per-flow deferred and timed tasks * @c: Execution context @@ -334,7 +453,12 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) */ void flow_init(void) { + unsigned b; + /* Initial state is a single free block containing the whole table */ flowtab[0].free.n = FLOW_MAX; flowtab[0].free.next = FLOW_MAX; + + for (b = 0; b < FLOW_HASH_SIZE; b++) + flow_hashtab[b] = FLOW_SIDX_NONE; } diff --git a/flow.h b/flow.h index ab831d1..adb5d44 100644 --- a/flow.h +++ b/flow.h @@ -133,9 +133,6 @@ void flow_new_dbg(const struct flow_common *f, unsigned side); void flow_fwd_dbg(const struct flow_common *f, unsigned side); #define FLOW_FWD_DBG(flow, side) (flow_fwd_dbg(&(flow)->f, (side))) -uint64_t flow_hash(const struct ctx *c, uint8_t proto, - const struct flowside *fside); - /** * struct flow_sidx - ID for one side of a specific flow * @side: Side referenced (0 or 1) @@ -161,6 +158,12 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } +void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); +flow_sidx_t flow_hash_lookup(const struct ctx *c, uint8_t proto, uint8_t pif, + int af, const void *eaddr, const void *faddr, + in_port_t eport, in_port_t fport); + union flow; void flow_init(void); @@ -180,4 +183,6 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) flow_dbg((f), __VA_ARGS__); \ } while (0) +void flow_init(void); + #endif /* FLOW_H */ diff --git a/tcp.c b/tcp.c index a7907f2..b6d046f 100644 --- a/tcp.c +++ b/tcp.c @@ -308,9 +308,6 @@ #define TCP_FRAMES \ (c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1) -#define TCP_HASH_TABLE_LOAD 70 /* % */ -#define TCP_HASH_TABLE_SIZE (FLOW_MAX * 100 / TCP_HASH_TABLE_LOAD) - #define MAX_WS 8 #define MAX_WINDOW (1 << (16 + (MAX_WS))) @@ -575,12 +572,6 @@ static unsigned int tcp6_l2_flags_buf_used; #define CONN(idx) (&(FLOW(idx)->tcp)) -/* Table for lookup from flowside information */ -static flow_sidx_t tc_hash[TCP_HASH_TABLE_SIZE]; - -static_assert(ARRAY_SIZE(tc_hash) >= FLOW_MAX, - "Safe linear probing requires hash table larger than connection table"); - /* Pools for pre-opened sockets (in init) */ int init_sock_pool4 [TCP_SOCK_POOL_SIZE]; int init_sock_pool6 [TCP_SOCK_POOL_SIZE]; @@ -773,9 +764,6 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, tcp_timer_ctl(c, conn); } -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn); - /** * conn_event_do() - Set and log connection events, update epoll state * @c: Execution context @@ -821,7 +809,7 @@ static void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, num == -1 ? "CLOSED" : tcp_event_str[num]); if (event == CLOSED) - tcp_hash_remove(c, conn); + flow_hash_remove(c, FLOW_SIDX(conn, TAPSIDE)); else if ((event == TAP_FIN_RCVD) && !(conn->events & SOCK_FIN_RCVD)) conn_flag(c, conn, ACTIVE_CLOSE); else @@ -1134,116 +1122,6 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, return -1; } -/** - * tcp_conn_hash() - Calculate hash bucket of an existing connection - * @c: Execution context - * @conn: Connection - * - * Return: hash value, needs to be adjusted for table size - */ -static uint64_t tcp_conn_hash(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - return flow_hash(c, IPPROTO_TCP, TAPFSIDE(conn)); -} - -/** - * tcp_hash_probe() - Find hash bucket for a connection - * @c: Execution context - * @conn: Connection to find bucket for - * - * Return: If @conn is in the table, its current bucket, otherwise a suitable - * free bucket for it. - */ -static inline unsigned tcp_hash_probe(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - flow_sidx_t sidx = FLOW_SIDX(conn, TAPSIDE); - unsigned b = tcp_conn_hash(c, conn) % TCP_HASH_TABLE_SIZE; - - /* Linear probing */ - while (!flow_sidx_eq(tc_hash[b], FLOW_SIDX_NONE) && - !flow_sidx_eq(tc_hash[b], sidx)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return b; -} - -/** - * tcp_hash_insert() - Insert connection into hash table, chain link - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn); - - tc_hash[b] = FLOW_SIDX(conn, TAPSIDE); - flow_dbg(conn, "hash table insert: sock %i, bucket: %u", conn->sock, b); -} - -/** - * tcp_hash_remove() - Drop connection from hash table, chain unlink - * @c: Execution context - * @conn: Connection pointer - */ -static void tcp_hash_remove(const struct ctx *c, - const struct tcp_tap_conn *conn) -{ - unsigned b = tcp_hash_probe(c, conn), s; - union flow *flow = flow_at_sidx(tc_hash[b]); - - if (!flow) - return; /* Redundant remove */ - - flow_dbg(conn, "hash table remove: sock %i, bucket: %u", conn->sock, b); - - /* Scan the remainder of the cluster */ - for (s = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - (flow = flow_at_sidx(tc_hash[s])); - s = mod_sub(s, 1, TCP_HASH_TABLE_SIZE)) { - unsigned h = tcp_conn_hash(c, &flow->tcp) % TCP_HASH_TABLE_SIZE; - - if (!mod_between(h, s, b, TCP_HASH_TABLE_SIZE)) { - /* tc_hash[s] can live in tc_hash[b]'s slot */ - debug("hash table remove: shuffle %u -> %u", s, b); - tc_hash[b] = tc_hash[s]; - b = s; - } - } - - tc_hash[b] = FLOW_SIDX_NONE; -} - -/** - * tcp_hash_lookup() - Look up connection given remote address and ports - * @c: Execution context - * @af: Address family, AF_INET or AF_INET6 - * @eaddr: Guest side endpoint address (guest local address) - * @faddr: Guest side forwarding address (guest remote address) - * @eport: Guest side endpoint port (guest local port) - * @fport: Guest side forwarding port (guest remote port) - * - * Return: connection pointer, if found, -ENOENT otherwise - */ -static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, int af, - const void *eaddr, const void *faddr, - in_port_t eport, in_port_t fport) -{ - struct flowside fside; - union flow *flow; - unsigned b; - - flowside_from_af(&fside, PIF_TAP, af, faddr, fport, eaddr, eport); - - b = flow_hash(c, IPPROTO_TCP, &fside) % TCP_HASH_TABLE_SIZE; - while ((flow = flow_at_sidx(tc_hash[b])) && - !flowside_eq(&flow->f.side[TAPSIDE], &fside)) - b = mod_sub(b, 1, TCP_HASH_TABLE_SIZE); - - return &flow->tcp; -} - /** * tcp_flow_defer() - Deferred per-flow handling (clean up closed connections) * @flow: Flow table entry for this connection @@ -1949,7 +1827,7 @@ static void tcp_conn_from_tap(struct ctx *c, tcp_seq_init(c, conn, now); conn->seq_ack_from_tap = conn->seq_to_tap; - tcp_hash_insert(c, conn); + flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); if (!bind(s, sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ @@ -2452,6 +2330,8 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, struct tcp_tap_conn *conn; size_t optlen, len; struct tcphdr *th; + union flow *flow; + flow_sidx_t sidx; int ack_due = 0; char *opts; int count; @@ -2468,17 +2348,22 @@ int tcp_tap_handler(struct ctx *c, uint8_t pif, int af, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, idx, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, saddr, daddr, - ntohs(th->source), ntohs(th->dest)); + sidx = flow_hash_lookup(c, IPPROTO_TCP, PIF_TAP, af, saddr, daddr, + ntohs(th->source), ntohs(th->dest)); + flow = flow_at_sidx(sidx); /* New connection from tap */ - if (!conn) { + if (!flow) { if (opts && th->syn && !th->ack) tcp_conn_from_tap(c, af, saddr, daddr, th, opts, optlen, now); return 1; } + ASSERT(flow->f.type == FLOW_TCP); + ASSERT(sidx.side == TAPSIDE); + conn = &flow->tcp; + flow_trace(conn, "packet length %zu from tap", len); if (th->rst) { @@ -2660,7 +2545,7 @@ static void tcp_tap_conn_from_sock(struct ctx *c, ASSERT(flow_complete(&conn->f)); tcp_seq_init(c, conn, now); - tcp_hash_insert(c, conn); + flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); conn->seq_ack_from_tap = conn->seq_to_tap; @@ -3032,11 +2917,6 @@ static void tcp_sock_refill_init(const struct ctx *c) */ int tcp_init(struct ctx *c) { - unsigned b; - - for (b = 0; b < TCP_HASH_TABLE_SIZE; b++) - tc_hash[b] = FLOW_SIDX_NONE; - if (c->ifi4) tcp_sock4_iov_init(c); -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 09/15] tcp: Re-use flow hash for initial sequence number generation

We generate TCP initial sequence numbers, when we need them, from a hash of the source and destination addresses and ports, plus a timestamp. Moments later, we generate another hash of the same information, plus a few extras to slot the connection into the flow table. With some tweaks to the flow_hash_insert() interface and changing the order we can re-use that hash table hash for the initial sequence number, rather than calculating another one. It won't generate identical results, but that doesn't matter as long as the sequence numbers are well scattered. Signed-off-by: David Gibson --- flow.c | 30 ++++++++++++++++++++++++------ flow.h | 2 +- tcp.c | 35 +++++++++++++---------------------- 3 files changed, 38 insertions(+), 29 deletions(-) diff --git a/flow.c b/flow.c index 3d016c3..e4ea931 100644 --- a/flow.c +++ b/flow.c @@ -281,16 +281,16 @@ static uint64_t flow_sidx_hash(const struct ctx *c, flow_sidx_t sidx) } /** - * flow_hash_probe() - Find hash bucket for a flow - * @c: Execution context + * flow_hash_probe_() - Find hash bucket for a flow, given hash + * @hash: Raw hash value for flow & side * @sidx: Flow and side to find bucket for * * Return: If @sidx is in the hash table, its current bucket, otherwise a * suitable free bucket for it. */ -static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +static inline unsigned flow_hash_probe_(uint64_t hash, flow_sidx_t sidx) { - unsigned b = flow_sidx_hash(c, sidx) % FLOW_HASH_SIZE; + unsigned b = hash % FLOW_HASH_SIZE; /* Linear probing */ while (!flow_sidx_eq(flow_hashtab[b], FLOW_SIDX_NONE) && @@ -300,17 +300,35 @@ static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) return b; } +/** + * flow_hash_probe() - Find hash bucket for a flow + * @c: Execution context + * @sidx: Flow and side to find bucket for + * + * Return: If @sidx is in the hash table, its current bucket, otherwise a + * suitable free bucket for it. + */ +static inline unsigned flow_hash_probe(const struct ctx *c, flow_sidx_t sidx) +{ + return flow_hash_probe_(flow_sidx_hash(c, sidx), sidx); +} + /** * flow_hash_insert() - Insert side of a flow into into hash table * @c: Execution context * @sidx: Flow & side index + * + * Return: raw (un-modded) hash value of side of flow */ -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx) { - unsigned b = flow_hash_probe(c, sidx); + uint64_t hash = flow_sidx_hash(c, sidx); + unsigned b = flow_hash_probe_(hash, sidx); flow_hashtab[b] = sidx; flow_dbg(flow_at_sidx(sidx), "hash table insert: bucket: %u", b); + + return hash; } /** diff --git a/flow.h b/flow.h index adb5d44..e1aeeed 100644 --- a/flow.h +++ b/flow.h @@ -158,7 +158,7 @@ static inline bool flow_sidx_eq(flow_sidx_t a, flow_sidx_t b) return (a.flow == b.flow) && (a.side == b.side); } -void flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); +uint64_t flow_hash_insert(const struct ctx *c, flow_sidx_t sidx); void flow_hash_remove(const struct ctx *c, flow_sidx_t sidx); flow_sidx_t flow_hash_lookup(const struct ctx *c, uint8_t proto, uint8_t pif, int af, const void *eaddr, const void *faddr, diff --git a/tcp.c b/tcp.c index b6d046f..008eb0d 100644 --- a/tcp.c +++ b/tcp.c @@ -1578,28 +1578,16 @@ static void tcp_tap_window_update(struct tcp_tap_conn *conn, unsigned wnd) } /** - * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 - * @c: Execution context - * @conn: TCP connection, with faddr, fport, eaddr, eport populated + * tcp_init_seq() - Calculate initial sequence number according to RFC 6528 + * @hash: Hash of connection details * @now: Current timestamp */ -static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, - const struct timespec *now) +static uint32_t tcp_init_seq(uint64_t hash, const struct timespec *now) { - struct siphash_state state = SIPHASH_INIT(c->hash_secret); - uint64_t hash; - uint32_t ns; - - inany_siphash_feed(&state, &TAPFSIDE(conn)->faddr); - inany_siphash_feed(&state, &TAPFSIDE(conn)->eaddr); - hash = siphash_final(&state, 36, - (uint64_t)TAPFSIDE(conn)->fport << 16 | - TAPFSIDE(conn)->eport); - /* 32ns ticks, overflows 32 bits every 137s */ - ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; + uint32_t ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; - conn->seq_to_tap = ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; + return ((uint32_t)(hash >> 32) ^ (uint32_t)hash) + ns; } /** @@ -1758,6 +1746,7 @@ static void tcp_conn_from_tap(struct ctx *c, const struct sockaddr *sa; struct tcp_tap_conn *conn; union flow *flow; + uint64_t hash; socklen_t sl; int s, mss; @@ -1824,10 +1813,10 @@ static void tcp_conn_from_tap(struct ctx *c, conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; - tcp_seq_init(c, conn, now); - conn->seq_ack_from_tap = conn->seq_to_tap; + hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); - flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); + conn->seq_to_tap = tcp_init_seq(hash, now); + conn->seq_ack_from_tap = conn->seq_to_tap; if (!bind(s, sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ @@ -2518,6 +2507,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, struct tcp_tap_conn *conn, int s, const struct timespec *now) { + uint64_t hash; + ASSERT(flowside_complete(SOCKFSIDE(conn))); conn->f.type = FLOW_TCP; @@ -2544,8 +2535,8 @@ static void tcp_tap_conn_from_sock(struct ctx *c, FLOW_FWD_DBG(conn, TAPSIDE); ASSERT(flow_complete(&conn->f)); - tcp_seq_init(c, conn, now); - flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); + hash = flow_hash_insert(c, FLOW_SIDX(conn, TAPSIDE)); + conn->seq_to_tap = tcp_init_seq(hash, now); conn->seq_ack_from_tap = conn->seq_to_tap; -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 10/15] icmp: Store ping socket information in the flow table

Currently icmp_id_map[][] stores information about ping sockets in a bespoke structure. Move the same information into a new type of flow that lives in the flow table. To match that change, replace the existing ICMP timer with a flow-based timer for expiring ping sockets. This has the advantage that we only need to scan the active flows, not all possible ids. For now we still index the flow table entries with icmp_id_map[][], and we don't populate all the common flow information. This is kind of an abuse of the flow table, but it's a useful interim step towards full flow table integration for ICMP. Signed-off-by: David Gibson --- Makefile | 6 +-- flow.c | 9 ++++ flow.h | 4 ++ flow_table.h | 2 + icmp.c | 127 +++++++++++++++++++++++---------------------------- icmp.h | 2 +- icmp_flow.h | 31 +++++++++++++ passt.c | 5 -- 8 files changed, 106 insertions(+), 80 deletions(-) create mode 100644 icmp_flow.h diff --git a/Makefile b/Makefile index af4fa87..610ab96 100644 --- a/Makefile +++ b/Makefile @@ -54,9 +54,9 @@ SRCS = $(PASST_SRCS) $(QRAP_SRCS) MANPAGES = passt.1 pasta.1 qrap.1 PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h \ - flow_table.h icmp.h inany.h isolation.h lineread.h log.h ndp.h \ - netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h siphash.h \ - tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h + flow_table.h icmp.h icmp_flow.h inany.h isolation.h lineread.h log.h \ + ndp.h netlink.h packet.h passt.h pasta.h pcap.h pif.h port_fwd.h \ + siphash.h tap.h tcp.h tcp_conn.h tcp_splice.h udp.h util.h HEADERS = $(PASST_HEADERS) seccomp.h C := \#include \nstruct tcp_info x = { .tcpi_snd_wnd = 0 }; diff --git a/flow.c b/flow.c index e4ea931..ef146dc 100644 --- a/flow.c +++ b/flow.c @@ -22,6 +22,8 @@ const char *flow_type_str[] = { [FLOW_TYPE_NONE] = "<none>", [FLOW_TCP] = "TCP connection", [FLOW_TCP_SPLICE] = "TCP connection (spliced)", + [FLOW_PING4] = "ICMP ping sequence", + [FLOW_PING6] = "ICMPv6 ping sequence", }; static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, "flow_type_str[] doesn't match enum flow_type"); @@ -29,6 +31,8 @@ static_assert(ARRAY_SIZE(flow_type_str) == FLOW_NUM_TYPES, const uint8_t flow_proto[] = { [FLOW_TCP] = IPPROTO_TCP, [FLOW_TCP_SPLICE] = IPPROTO_TCP, + [FLOW_PING4] = IPPROTO_ICMP, + [FLOW_PING6] = IPPROTO_ICMPV6, }; static_assert(ARRAY_SIZE(flow_proto) == FLOW_NUM_TYPES, "flow_proto[] doesn't match enum flow_type"); @@ -438,6 +442,11 @@ void flow_defer_handler(const struct ctx *c, const struct timespec *now) if (!closed && timer) tcp_splice_timer(c, flow); break; + case FLOW_PING4: + case FLOW_PING6: + if (timer) + closed = icmp_ping_timer(c, flow, now); + break; default: /* Assume other flow types don't need any handling */ ; diff --git a/flow.h b/flow.h index e1aeeed..da2a629 100644 --- a/flow.h +++ b/flow.h @@ -19,6 +19,10 @@ enum flow_type { FLOW_TCP, /* A TCP connection between a host socket and ns socket */ FLOW_TCP_SPLICE, + /* ICMP echo requests from guest to host and matching replies back */ + FLOW_PING4, + /* ICMPv6 echo requests from guest to host and matching replies back */ + FLOW_PING6, FLOW_NUM_TYPES, }; diff --git a/flow_table.h b/flow_table.h index 34fc679..091b430 100644 --- a/flow_table.h +++ b/flow_table.h @@ -8,6 +8,7 @@ #define FLOW_TABLE_H #include "tcp_conn.h" +#include "icmp_flow.h" /** * struct flow_free_block - Information about a block of free entries @@ -33,6 +34,7 @@ union flow { struct flow_free_block free; struct tcp_tap_conn tcp; struct tcp_splice_conn tcp_splice; + struct icmp_ping_flow ping; }; /* Global Flow Table */ diff --git a/icmp.c b/icmp.c index d669351..27b150d 100644 --- a/icmp.c +++ b/icmp.c @@ -37,24 +37,15 @@ #include "tap.h" #include "log.h" #include "icmp.h" +#include "flow_table.h" #define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */ #define ICMP_NUM_IDS (1U << 16) -/** - * struct icmp_id_sock - Tracking information for single ICMP echo identifier - * @sock: Bound socket for identifier - * @seq: Last sequence number sent to tap, host order, -1: not sent yet - * @ts: Last associated activity from tap, seconds - */ -struct icmp_id_sock { - int sock; - int seq; - time_t ts; -}; +#define PINGF(idx) (&(FLOW(idx)->ping)) /* Indexed by ICMP echo identifier */ -static struct icmp_id_sock icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; +static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; /** * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket @@ -64,8 +55,8 @@ static struct icmp_id_sock icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; */ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) { - struct icmp_id_sock *const id_map = af == AF_INET - ? &icmp_id_map[V4][ref.icmp.id] : &icmp_id_map[V6][ref.icmp.id]; + struct icmp_ping_flow *pingf = af == AF_INET + ? icmp_id_map[V4][ref.icmp.id] : icmp_id_map[V6][ref.icmp.id]; const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; char buf[USHRT_MAX]; union { @@ -80,6 +71,8 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) if (c->no_icmp) return; + ASSERT(pingf); + n = recvfrom(ref.fd, buf, sizeof(buf), 0, &sr.sa, &sl); if (n < 0) { warn("%s: recvfrom() error on ping socket: %s", @@ -113,10 +106,10 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) } if (c->mode == MODE_PASTA) { - if (id_map->seq == seq) + if (pingf->seq == seq) return; - id_map->seq = seq; + pingf->seq = seq; } debug("%s: echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, pname, @@ -133,16 +126,19 @@ unexpected: } /** - * icmp_ping_close() - Close out and cleanup a ping sequence + * icmp_ping_close() - Close out and cleanup a ping flow * @c: Execution context - * @id_map: id map entry of the sequence to close + * @pingf: ping flow entry to close */ -static void icmp_ping_close(const struct ctx *c, struct icmp_id_sock *id_map) +static void icmp_ping_close(const struct ctx *c, struct icmp_ping_flow *pingf) { - epoll_ctl(c->epollfd, EPOLL_CTL_DEL, id_map->sock, NULL); - close(id_map->sock); - id_map->sock = -1; - id_map->seq = -1; + epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); + close(pingf->sock); + + if (pingf->f.type == FLOW_PING4) + icmp_id_map[V4][pingf->id] = NULL; + else + icmp_id_map[V6][pingf->id] = NULL; } /** @@ -152,18 +148,24 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_id_sock *id_map) * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new sequence * - * Return: Newly opened ping socket fd, or -1 on failure + * Return: Newly opened ping flow, or NULL on failure */ -static int icmp_ping_new(const struct ctx *c, struct icmp_id_sock *id_map, - int af, uint16_t id) +static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, + struct icmp_ping_flow **id_map, + int af, uint16_t id) { - uint8_t proto = af == AF_INET ? IPPROTO_ICMP : IPPROTO_ICMPV6; const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; + uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union icmp_epoll_ref iref = { .id = id }; + union flow *flow = flow_alloc(); + struct icmp_ping_flow *pingf; const void *bind_addr; const char *bind_if; int s; + if (!flow) + return NULL; + if (af == AF_INET) { bind_addr = &c->ip4.addr_out; bind_if = c->ip4.ifname_out; @@ -172,7 +174,7 @@ static int icmp_ping_new(const struct ctx *c, struct icmp_id_sock *id_map, bind_if = c->ip6.ifname_out; } - s = sock_l4(c, af, proto, bind_addr, bind_if, 0, iref.u32); + s = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if, 0, iref.u32); if (s < 0) { warn("Cannot open \"ping\" socket. You might need to:"); @@ -184,16 +186,23 @@ static int icmp_ping_new(const struct ctx *c, struct icmp_id_sock *id_map, if (s > FD_REF_MAX) goto cancel; - id_map->sock = s; + pingf = &flow->ping; + pingf->f.type = flowtype; + pingf->seq = -1; + pingf->sock = s; + pingf->id = id; + + *id_map = pingf; debug("%s: new socket %i for echo ID %"PRIu16, pname, s, id); - return s; + return pingf; cancel: if (s >= 0) close(s); - return -1; + flow_alloc_cancel(flow); + return NULL; } /** @@ -219,11 +228,10 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, struct sockaddr_in6 sa6; } sa = { .sa.sa_family = af }; const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); - struct icmp_id_sock *id_map; + struct icmp_ping_flow *pingf, **id_map; uint16_t id, seq; size_t plen; void *pkt; - int s; (void)saddr; (void)pif; @@ -263,13 +271,13 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, ASSERT(0); } - if ((s = id_map->sock) < 0) - if ((s = icmp_ping_new(c, id_map, af, id)) < 0) + if (!(pingf = *id_map)) + if (!(pingf = icmp_ping_new(c, id_map, af, id))) return 1; - id_map->ts = now->tv_sec; + pingf->ts = now->tv_sec; - if (sendto(s, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { + if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); } else { @@ -281,44 +289,21 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, } /** - * icmp_timer_one() - Handler for timed events related to a given identifier - * @c: Execution context - * @id_map: id socket information to check timers for - * @now: Current timestamp - */ -static void icmp_timer_one(const struct ctx *c, struct icmp_id_sock *id_map, - const struct timespec *now) -{ - if (id_map->sock < 0 || now->tv_sec - id_map->ts <= ICMP_ECHO_TIMEOUT) - return; - - icmp_ping_close(c, id_map); -} - -/** - * icmp_timer() - Scan activity bitmap for identifiers with timed events + * icmp_ping_timer() - Handler for timed events related to a given flow * @c: Execution context + * @flow: flow table entry to check for timeout * @now: Current timestamp + * + * Return: true if the flow is ready to free, false otherwise */ -void icmp_timer(const struct ctx *c, const struct timespec *now) +bool icmp_ping_timer(const struct ctx *c, union flow *flow, + const struct timespec *now) { - unsigned int i; - - for (i = 0; i < ICMP_NUM_IDS; i++) { - icmp_timer_one(c, &icmp_id_map[V4][i], now); - icmp_timer_one(c, &icmp_id_map[V6][i], now); - } -} + struct icmp_ping_flow *pingf = &flow->ping; -/** - * icmp_init() - Initialise sequences in ID map to -1 (no sequence sent yet) - */ -void icmp_init(void) -{ - unsigned i; + if (now->tv_sec - pingf->ts <= ICMP_ECHO_TIMEOUT) + return false; - for (i = 0; i < ICMP_NUM_IDS; i++) { - icmp_id_map[V4][i].seq = icmp_id_map[V6][i].seq = -1; - icmp_id_map[V4][i].sock = icmp_id_map[V6][i].sock = -1; - } + icmp_ping_close(c, pingf); + return true; } diff --git a/icmp.h b/icmp.h index 0083597..2897f31 100644 --- a/icmp.h +++ b/icmp.h @@ -9,12 +9,12 @@ #define ICMP_TIMER_INTERVAL 10000 /* ms */ struct ctx; +struct icmp_ping_flow; void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref); int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); -void icmp_timer(const struct ctx *c, const struct timespec *now); void icmp_init(void); /** diff --git a/icmp_flow.h b/icmp_flow.h new file mode 100644 index 0000000..7b7f4ee --- /dev/null +++ b/icmp_flow.h @@ -0,0 +1,31 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: David Gibson + * + * ICMP flow tracking data structures + */ +#ifndef ICMP_FLOW_H +#define ICMP_FLOW_H + +/** + * struct icmp_ping_flow - Descriptor for a flow of ping requests/replies + * @f: Generic flow information + * @seq: Last sequence number sent to tap, host order, -1: not sent yet + * @sock: "ping" socket + * @ts: Last associated activity from tap, seconds + * @id: Guest side ICMP ping id for this flow + */ +struct icmp_ping_flow { + /* Must be first element */ + struct flow_common f; + + int seq; + int sock; + time_t ts; + uint16_t id; +}; + +bool icmp_ping_timer(const struct ctx *c, union flow *flow, + const struct timespec *now); + +#endif /* ICMP_FLOW_H */ diff --git a/passt.c b/passt.c index 44d3a0b..9191404 100644 --- a/passt.c +++ b/passt.c @@ -105,8 +105,6 @@ static void post_handler(struct ctx *c, const struct timespec *now) CALL_PROTO_HANDLER(c, now, tcp, TCP); /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ CALL_PROTO_HANDLER(c, now, udp, UDP); - /* NOLINTNEXTLINE(bugprone-branch-clone): intervals can be the same */ - CALL_PROTO_HANDLER(c, now, icmp, ICMP); flow_defer_handler(c, now); #undef CALL_PROTO_HANDLER @@ -290,9 +288,6 @@ int main(int argc, char **argv) if ((!c.no_udp && udp_init(&c)) || (!c.no_tcp && tcp_init(&c))) exit(EXIT_FAILURE); - if (!c.no_icmp) - icmp_init(); - proto_update_l2_buf(c.mac_guest, c.mac); if (c.ifi4 && !c.no_dhcp) -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 11/15] icmp: Populate guest side information for ping flows

ping flows now have flow table entries, but we don't populate most of the common information yet. Start doing this by populating the guest side addresses. We set the "port" for both ends of the flow to the ICMP id. With that populated, rather than looking up the correct flow in the icmp_id_map[] array, we can use the flow hash table. Furthermore, we can use the addresses stored in the flow table to direct returning packets, without having to rely on tap_ip[46]_daddr() to reobtain the guest address. This marks the last user of tap_ip4_daddr() so it is also removed. Signed-off-by: David Gibson --- icmp.c | 48 +++++++++++++++++++++++++++++++++++++++--------- tap.c | 11 ----------- tap.h | 1 - 3 files changed, 39 insertions(+), 21 deletions(-) diff --git a/icmp.c b/icmp.c index 27b150d..53ad087 100644 --- a/icmp.c +++ b/icmp.c @@ -42,8 +42,15 @@ #define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */ #define ICMP_NUM_IDS (1U << 16) +/* Sides of a flow as we use them for ping streams */ +#define SOCKSIDE 0 +#define TAPSIDE 1 + #define PINGF(idx) (&(FLOW(idx)->ping)) +#define TAPFSIDE(pingf) (&(pingf)->f.side[TAPSIDE]) +#define SOCKFSIDE(pingf) (&(pingf)->f.side[SOCKSIDE]) + /* Indexed by ICMP echo identifier */ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; @@ -114,11 +121,17 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) debug("%s: echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, pname, ref.icmp.id, seq); - if (af == AF_INET) - tap_icmp4_send(c, sr.sa4.sin_addr, tap_ip4_daddr(c), buf, n); - else if (af == AF_INET6) - tap_icmp6_send(c, &sr.sa6.sin6_addr, - tap_ip6_daddr(c, &sr.sa6.sin6_addr), buf, n); + if (af == AF_INET) { + const struct in_addr *saddr = inany_v4(&TAPFSIDE(pingf)->faddr); + const struct in_addr *daddr = inany_v4(&TAPFSIDE(pingf)->eaddr); + + ASSERT(saddr && daddr); /* Must have IPv4 addresses */ + tap_icmp4_send(c, *saddr, *daddr, buf, n); + } else if (af == AF_INET6) { + const struct in6_addr *saddr = &TAPFSIDE(pingf)->faddr.a6; + const struct in6_addr *daddr = &TAPFSIDE(pingf)->eaddr.a6; + tap_icmp6_send(c, saddr, daddr, buf, n); + } return; unexpected: @@ -134,6 +147,7 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_ping_flow *pingf) { epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); + flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE)); if (pingf->f.type == FLOW_PING4) icmp_id_map[V4][pingf->id] = NULL; @@ -147,12 +161,15 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_ping_flow *pingf) * @id_map: id map entry of the sequence to open * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new sequence + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_map, - int af, uint16_t id) + int af, uint16_t id, + const void *saddr, const void *daddr) { const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; @@ -196,6 +213,10 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, debug("%s: new socket %i for echo ID %"PRIu16, pname, s, id); + flowside_from_af(TAPFSIDE(pingf), PIF_TAP, af, daddr, id, saddr, id); + FLOW_NEW_DBG(pingf, TAPSIDE); + flow_hash_insert(c, FLOW_SIDX(pingf, TAPSIDE)); + return pingf; cancel: @@ -229,7 +250,9 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, } sa = { .sa.sa_family = af }; const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); struct icmp_ping_flow *pingf, **id_map; + union flow *flow; uint16_t id, seq; + uint8_t proto; size_t plen; void *pkt; @@ -249,6 +272,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (ih->type != ICMP_ECHO) return 1; + proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); id_map = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); @@ -262,6 +286,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (ih->icmp6_type != ICMPV6_ECHO_REQUEST) return 1; + proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); id_map = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); @@ -271,10 +296,15 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, ASSERT(0); } - if (!(pingf = *id_map)) - if (!(pingf = icmp_ping_new(c, id_map, af, id))) - return 1; + flow = flow_at_sidx(flow_hash_lookup(c, proto, PIF_TAP, + af, saddr, daddr, id, id)); + + if (flow) + pingf = &flow->ping; + else if (!(pingf = icmp_ping_new(c, id_map, af, id, saddr, daddr))) + return 1; + ASSERT(flow_proto[pingf->f.type] == proto); pingf->ts = now->tv_sec; if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { diff --git a/tap.c b/tap.c index 2ceda8d..31909c3 100644 --- a/tap.c +++ b/tap.c @@ -89,17 +89,6 @@ int tap_send(const struct ctx *c, const void *data, size_t len) return write(c->fd_tap, (char *)data, len); } -/** - * tap_ip4_daddr() - Normal IPv4 destination address for inbound packets - * @c: Execution context - * - * Return: IPv4 address, network order - */ -struct in_addr tap_ip4_daddr(const struct ctx *c) -{ - return c->ip4.addr_seen; -} - /** * tap_ip6_daddr() - Normal IPv6 destination address for inbound packets * @c: Execution context diff --git a/tap.h b/tap.h index 466d914..7a8619d 100644 --- a/tap.h +++ b/tap.h @@ -57,7 +57,6 @@ static inline size_t tap_iov_len(const struct ctx *c, struct tap_hdr *taph, return plen + tap_hdr_len_(c); } -struct in_addr tap_ip4_daddr(const struct ctx *c); void tap_udp4_send(const struct ctx *c, struct in_addr src, in_port_t sport, struct in_addr dst, in_port_t dport, const void *in, size_t len); -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 12/15] icmp: Populate and use host side flow information

Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID. Signed-off-by: David Gibson --- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-) diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */ + if (flowside_from_sock(SOCKFSIDE(pingf), PIF_HOST, pingf->sock, + NULL, &sa) < 0) { + err("%s: Failed to get local name for outgoing ping socket", + pname); + goto cancel; } + /* We want the id as the "port" on both sides */ + SOCKFSIDE(pingf)->eport = SOCKFSIDE(pingf)->fport; + + FLOW_FWD_DBG(pingf, SOCKSIDE); + + return 1; +cancel: + /* Something went wrong, back out creation of the flow */ + icmp_ping_close(c, pingf); + flow_alloc_cancel(flow); return 1; } -- 2.43.0

Stefano Brivio

17 Jan 17 Jan

8:59 p.m.

New subject: [PATCH v3 12/15] icmp: Populate and use host side flow information

On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson wrote:

...

Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.

Signed-off-by: David Gibson --- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */

Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well. But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway. Patches 13/15 to 15/15 all look good to me. -- Stefano

David Gibson

18 Jan 18 Jan

2:22 a.m.

New subject: [PATCH v3 12/15] icmp: Populate and use host side flow information

On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:

...

On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson wrote:

...
Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.

Signed-off-by: David Gibson --- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */

Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.

If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.

...

But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.

That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

Stefano Brivio

4:43 p.m.

New subject: [PATCH v3 12/15] icmp: Populate and use host side flow information

On Thu, 18 Jan 2024 12:22:29 +1100 David Gibson wrote:

...

On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson wrote:

...
Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.

Signed-off-by: David Gibson --- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */

Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.

If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.

Well, unless you consider 0.0.0.0 / :: as the actual value (which makes it a wildcard, but not in the hash table sense). I'm not suggesting we should, though (at least for the moment being).

...

...
But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.

That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed.

The comment makes this sound like a compromise on functionality, or a problem we have (and that's what mostly caught my attention: can we "solve" this?). Maybe you could add something like "(not that there's any use for it)" at the end, or similar. -- Stefano

David Gibson

19 Jan 19 Jan

12:58 a.m.

New subject: [PATCH v3 12/15] icmp: Populate and use host side flow information

On Thu, Jan 18, 2024 at 04:43:05PM +0100, Stefano Brivio wrote:

...

On Thu, 18 Jan 2024 12:22:29 +1100 David Gibson wrote:

...
On Wed, Jan 17, 2024 at 08:59:41PM +0100, Stefano Brivio wrote:

...
On Thu, 21 Dec 2023 18:02:34 +1100 David Gibson wrote:

...
Complete filling out the common flow information for "ping" flows by storing the host side information for the ping socket. We can only retrieve this from the socket after we send the echo-request, because that's when the kernel will assign an ID.

Signed-off-by: David Gibson --- icmp.c | 36 +++++++++++++++++++++++++++++++++--- 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/icmp.c b/icmp.c index 53ad087..917c810 100644 --- a/icmp.c +++ b/icmp.c @@ -310,11 +310,41 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (sendto(pingf->sock, pkt, plen, MSG_NOSIGNAL, &sa.sa, sl) < 0) { debug("%s: failed to relay request to socket: %s", pname, strerror(errno)); - } else { - debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, - pname, id, seq); + if (flow) + goto cancel; + } + + debug("%s: echo request to socket, ID: %"PRIu16", seq: %"PRIu16, + pname, id, seq); + + if (!flow) + /* Nothing more to do for an existing flow */ + return 1; + + /* We need to wait until after the sendto() to fill in the SOCKSIDE + * information, so that we can find out the host side id the kernel + * assigned. If there's no bind address specified, this will still have + * 0.0.0.0 or :: as the host side forwarding address. There's not + * really anything we can do to fill that in, which means we can never + * insert the SOCKSIDE of a ping flow into the hash table. + */

Well, if it was just because of that we could argue at some point that, with wildcards or suchlike, we could insert that in the hash table as well.

If we allow wildcards, it's probably not a hash table any more, or at least not a normal one. Hashing is pretty inherently tied to a single value lookup.

Well, unless you consider 0.0.0.0 / :: as the actual value (which makes it a wildcard, but not in the hash table sense). I'm not suggesting we should, though (at least for the moment being).

Right, ok. So, I've been looking at what we need to do flow based NAT, and I am now leaning towards revising the existing series so that it's less insistent on "complete" flowside information, except in the case where we really are hashing. That should also let us remove many of the getsockname() calls.

...

...
...
But that wouldn't make sense anyway, right? That is, unless I'm missing something, we would have no use for a hash table lookup of the SOCKSIDE of a ping flow anyway.

That's correct, which is why we can get away with this. Likewise SOCKSIDE of tcp flows is also never hashed. However in some cases we may want to hash the host sides of flows: I think we'll want to for UDP, for example, because we can receive multiple flows via the same bound socket. That's why I want to be very clear about when and why we're constructing a flowside that can't be hashed.

The comment makes this sound like a compromise on functionality, or a problem we have (and that's what mostly caught my attention: can we "solve" this?).

Maybe you could add something like "(not that there's any use for it)" at the end, or similar.

Right, the revision I'm thinking of should make that clearer, I hope. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson

David Gibson

21 Dec 21 Dec

8:02 a.m.

New subject: [PATCH v3 13/15] icmp: Use 'flowside' epoll references for ping sockets

Currently ping sockets use a custom epoll reference type which includes the ICMP id. However, now that we have entries in the flow table for ping flows, finding that is sufficient to get everything else we want, including the id. Therefore remove the icmp_epoll_ref type and use the general 'flowside' field for ping sockets. Signed-off-by: David Gibson --- icmp.c | 18 ++++++++++-------- icmp.h | 11 ----------- passt.h | 1 - 3 files changed, 10 insertions(+), 20 deletions(-) diff --git a/icmp.c b/icmp.c index 917c810..a073813 100644 --- a/icmp.c +++ b/icmp.c @@ -62,17 +62,16 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; */ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) { - struct icmp_ping_flow *pingf = af == AF_INET - ? icmp_id_map[V4][ref.icmp.id] : icmp_id_map[V6][ref.icmp.id]; const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; + struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow); char buf[USHRT_MAX]; union { struct sockaddr sa; struct sockaddr_in sa4; struct sockaddr_in6 sa6; } sr; + uint16_t id = TAPFSIDE(pingf)->eport, seq; socklen_t sl = sizeof(sr); - uint16_t seq; ssize_t n; if (c->no_icmp) @@ -96,7 +95,7 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih4->un.echo.id = htons(ref.icmp.id); + ih4->un.echo.id = htons(id); seq = ntohs(ih4->un.echo.sequence); } else if (af == AF_INET6) { struct icmp6hdr *ih6 = (struct icmp6hdr *)buf; @@ -106,7 +105,7 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) goto unexpected; /* Adjust packet back to guest-side ID */ - ih6->icmp6_identifier = htons(ref.icmp.id); + ih6->icmp6_identifier = htons(id); seq = ntohs(ih6->icmp6_sequence); } else { ASSERT(0); @@ -120,7 +119,7 @@ void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) } debug("%s: echo reply to tap, ID: %"PRIu16", seq: %"PRIu16, pname, - ref.icmp.id, seq); + id, seq); if (af == AF_INET) { const struct in_addr *saddr = inany_v4(&TAPFSIDE(pingf)->faddr); const struct in_addr *daddr = inany_v4(&TAPFSIDE(pingf)->eaddr); @@ -173,25 +172,28 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, { const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; - union icmp_epoll_ref iref = { .id = id }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; const void *bind_addr; const char *bind_if; + union epoll_ref ref; int s; if (!flow) return NULL; if (af == AF_INET) { + ref.type = EPOLL_TYPE_ICMP; bind_addr = &c->ip4.addr_out; bind_if = c->ip4.ifname_out; } else { + ref.type = EPOLL_TYPE_ICMPV6; bind_addr = &c->ip6.addr_out; bind_if = c->ip6.ifname_out; } - s = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if, 0, iref.u32); + ref.flowside = FLOW_SIDX(flow, SOCKSIDE); + s = sock_l4(c, af, flow_proto[flowtype], bind_addr, bind_if, 0, ref.data); if (s < 0) { warn("Cannot open \"ping\" socket. You might need to:"); diff --git a/icmp.h b/icmp.h index 2897f31..50807c4 100644 --- a/icmp.h +++ b/icmp.h @@ -17,17 +17,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, const struct pool *p, const struct timespec *now); void icmp_init(void); -/** - * union icmp_epoll_ref - epoll reference portion for ICMP tracking - * @v6: Set for IPv6 sockets or connections - * @u32: Opaque u32 value of reference - * @id: Associated echo identifier, needed if bind() fails - */ -union icmp_epoll_ref { - uint16_t id; - uint32_t u32; -}; - /** * struct icmp_ctx - Execution context for ICMP routines * @timer_run: Timestamp of most recent timer run diff --git a/passt.h b/passt.h index db0cd10..d1ed716 100644 --- a/passt.h +++ b/passt.h @@ -99,7 +99,6 @@ union epoll_ref { flow_sidx_t flowside; union tcp_listen_epoll_ref tcp_listen; union udp_epoll_ref udp; - union icmp_epoll_ref icmp; uint32_t data; }; }; -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 14/15] icmp: Merge EPOLL_TYPE_ICMP and EPOLL_TYPE_ICMPV6

We have separate epoll reference types for ICMP and ICMPv6 ping sockets. Now that ping socket references point to a flow table entry, we can easily determine whether we're dealing with ICMP or ICMPv6 from the flow type, making the two epoll types redundant. Merge both types into EPOLL_TYPE_PING. Signed-off-by: David Gibson --- icmp.c | 10 ++++------ icmp.h | 2 +- passt.c | 10 +++------- passt.h | 6 ++---- util.c | 4 +--- 5 files changed, 11 insertions(+), 21 deletions(-) diff --git a/icmp.c b/icmp.c index a073813..272edae 100644 --- a/icmp.c +++ b/icmp.c @@ -57,13 +57,13 @@ static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; /** * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket * @c: Execution context - * @af: Address family (AF_INET or AF_INET6) * @ref: epoll reference */ -void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref) +void icmp_sock_handler(const struct ctx *c, union epoll_ref ref) { - const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; struct icmp_ping_flow *pingf = PINGF(ref.flowside.flow); + int af = pingf->f.type == FLOW_PING4 ? AF_INET : AF_INET6; + const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; char buf[USHRT_MAX]; union { struct sockaddr sa; @@ -172,22 +172,20 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, { const char *const pname = af == AF_INET ? "ICMP" : "ICMPv6"; uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; + union epoll_ref ref = { .type = EPOLL_TYPE_PING }; union flow *flow = flow_alloc(); struct icmp_ping_flow *pingf; const void *bind_addr; const char *bind_if; - union epoll_ref ref; int s; if (!flow) return NULL; if (af == AF_INET) { - ref.type = EPOLL_TYPE_ICMP; bind_addr = &c->ip4.addr_out; bind_if = c->ip4.ifname_out; } else { - ref.type = EPOLL_TYPE_ICMPV6; bind_addr = &c->ip6.addr_out; bind_if = c->ip6.ifname_out; } diff --git a/icmp.h b/icmp.h index 50807c4..5642dda 100644 --- a/icmp.h +++ b/icmp.h @@ -11,7 +11,7 @@ struct ctx; struct icmp_ping_flow; -void icmp_sock_handler(const struct ctx *c, int af, union epoll_ref ref); +void icmp_sock_handler(const struct ctx *c, union epoll_ref ref); int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); diff --git a/passt.c b/passt.c index 9191404..eb9f6d8 100644 --- a/passt.c +++ b/passt.c @@ -66,8 +66,7 @@ char *epoll_type_str[] = { [EPOLL_TYPE_TCP_LISTEN] = "listening TCP socket", [EPOLL_TYPE_TCP_TIMER] = "TCP timer", [EPOLL_TYPE_UDP] = "UDP socket", - [EPOLL_TYPE_ICMP] = "ICMP socket", - [EPOLL_TYPE_ICMPV6] = "ICMPv6 socket", + [EPOLL_TYPE_PING] = "ICMP/ICMPv6 ping socket", [EPOLL_TYPE_NSQUIT] = "namespace inotify", [EPOLL_TYPE_TAP_PASTA] = "/dev/net/tun device", [EPOLL_TYPE_TAP_PASST] = "connected qemu socket", @@ -386,11 +385,8 @@ loop: case EPOLL_TYPE_UDP: udp_sock_handler(&c, ref, eventmask, &now); break; - case EPOLL_TYPE_ICMP: - icmp_sock_handler(&c, AF_INET, ref); - break; - case EPOLL_TYPE_ICMPV6: - icmp_sock_handler(&c, AF_INET6, ref); + case EPOLL_TYPE_PING: + icmp_sock_handler(&c, ref); break; default: /* Can't happen */ diff --git a/passt.h b/passt.h index d1ed716..372eb64 100644 --- a/passt.h +++ b/passt.h @@ -61,10 +61,8 @@ enum epoll_type { EPOLL_TYPE_TCP_TIMER, /* UDP sockets */ EPOLL_TYPE_UDP, - /* IPv4 ICMP sockets */ - EPOLL_TYPE_ICMP, - /* ICMPv6 sockets */ - EPOLL_TYPE_ICMPV6, + /* ICMP/ICMPv6 ping sockets */ + EPOLL_TYPE_PING, /* inotify fd watching for end of netns (pasta) */ EPOLL_TYPE_NSQUIT, /* tuntap character device */ diff --git a/util.c b/util.c index 9c7012c..ab08006 100644 --- a/util.c +++ b/util.c @@ -125,10 +125,8 @@ int sock_l4(const struct ctx *c, int af, uint8_t proto, ref.type = EPOLL_TYPE_UDP; break; case IPPROTO_ICMP: - ref.type = EPOLL_TYPE_ICMP; - break; case IPPROTO_ICMPV6: - ref.type = EPOLL_TYPE_ICMPV6; + ref.type = EPOLL_TYPE_PING; break; default: return -EPFNOSUPPORT; /* Not implemented. */ -- 2.43.0

David Gibson

8:02 a.m.

New subject: [PATCH v3 15/15] icmp: Eliminate icmp_id_map

With previous reworks the icmp_id_map data structure is now maintained, but never used for anything. Eliminate it. Signed-off-by: David Gibson --- icmp.c | 18 ++---------------- 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/icmp.c b/icmp.c index 272edae..4db8003 100644 --- a/icmp.c +++ b/icmp.c @@ -51,9 +51,6 @@ #define TAPFSIDE(pingf) (&(pingf)->f.side[TAPSIDE]) #define SOCKFSIDE(pingf) (&(pingf)->f.side[SOCKSIDE]) -/* Indexed by ICMP echo identifier */ -static struct icmp_ping_flow *icmp_id_map[IP_VERSIONS][ICMP_NUM_IDS]; - /** * icmp_sock_handler() - Handle new data from ICMP or ICMPv6 socket * @c: Execution context @@ -147,17 +144,11 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_ping_flow *pingf) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, pingf->sock, NULL); close(pingf->sock); flow_hash_remove(c, FLOW_SIDX(pingf, TAPSIDE)); - - if (pingf->f.type == FLOW_PING4) - icmp_id_map[V4][pingf->id] = NULL; - else - icmp_id_map[V6][pingf->id] = NULL; } /** * icmp_ping_new() - Prepare a new ping socket for a new id * @c: Execution context - * @id_map: id map entry of the sequence to open * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new sequence * @saddr: Source address @@ -166,7 +157,6 @@ static void icmp_ping_close(const struct ctx *c, struct icmp_ping_flow *pingf) * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, - struct icmp_ping_flow **id_map, int af, uint16_t id, const void *saddr, const void *daddr) { @@ -209,8 +199,6 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, pingf->sock = s; pingf->id = id; - *id_map = pingf; - debug("%s: new socket %i for echo ID %"PRIu16, pname, s, id); flowside_from_af(TAPFSIDE(pingf), PIF_TAP, af, daddr, id, saddr, id); @@ -249,7 +237,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, struct sockaddr_in6 sa6; } sa = { .sa.sa_family = af }; const socklen_t sl = af == AF_INET ? sizeof(sa.sa4) : sizeof(sa.sa6); - struct icmp_ping_flow *pingf, **id_map; + struct icmp_ping_flow *pingf; union flow *flow; uint16_t id, seq; uint8_t proto; @@ -274,7 +262,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, proto = IPPROTO_ICMP; id = ntohs(ih->un.echo.id); - id_map = &icmp_id_map[V4][id]; seq = ntohs(ih->un.echo.sequence); sa.sa4.sin_addr = *(struct in_addr *)daddr; } else if (af == AF_INET6) { @@ -288,7 +275,6 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, proto = IPPROTO_ICMPV6; id = ntohs(ih->icmp6_identifier); - id_map = &icmp_id_map[V6][id]; seq = ntohs(ih->icmp6_sequence); sa.sa6.sin6_addr = *(struct in6_addr *)daddr; sa.sa6.sin6_scope_id = c->ifi6; @@ -301,7 +287,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, int af, if (flow) pingf = &flow->ping; - else if (!(pingf = icmp_ping_new(c, id_map, af, id, saddr, daddr))) + else if (!(pingf = icmp_ping_new(c, af, id, saddr, daddr))) return 1; ASSERT(flow_proto[pingf->f.type] == proto); -- 2.43.0

774

Age (days ago)

802

Last active (days ago)

List overview

Download

32 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH v3 00/15] RFC: Unified flow table

tags

participants (2)