[PATCH v6 00/26] RFC: Unified flow table
This is a sixth draft of an implementation of more general "connection" tracking, as described at: https://pad.passt.top/p/NewForwardingModel This series changes the TCP connection table and hash table into a more general flow table that can track other protocols as well. Each flow uniformly keeps track of all the relevant addresses and ports, which will allow for more robust control of NAT and port forwarding. ICMP and UDP are converted to use the new flow table. This is based on the most recent series of flowtable preliminaries. NOTE: this fails the pasta UDP perf tests. This is one aspect of a curly issue I'm not sure how to deail with yet. So, this isn't ready for merge yet, but I'm posting because there have been a *lot* of changes since v5, and it could do with another round of review, Other caveats: * We roughly double the size of a connection/flow entry * UDP still has a number of forwarding bugs because we don't consider the local address when looking for a socket to use. There are at least some FIXMEs noted for this. Changes since v5: * flowside_from_af() is now static * Small fixes to state verification * Pass protocol specific types into deferred/timer callbacks * No longer require complete forwarding address info for the hash table (we won't have it for UDP) * Fix bugs with logging of flow addresses * Make sure to initialise sin_zero field sockaddr_from_inany * Added patch better typing parameters to flow type specific callbacks * Terminology change "forwarded side" to "target side" * Assorted wording and style tweaks based on Stefano's review * Fold introduction of struct flowside and populating the initiating side together * Manage outbound addresses via the flow table as well * Support for UDP * Correct type of 'b' in flowside_lookup() (was a signed int) Changes since v4: * flowside_from_af() no longer fills in unspecified addresses when passed NULL * Split and rename flow hash lookup function * Clarified flow state transitions, and enforced where practical * Made side 0 always the initiating side of a flow, rather than letting the protocol specific code decide * Separated pifs from flowside addresses to allow better structure packing Changes since v3: * Complex rebase on top of the many things that have happened upstream since v2. * Assorted other changes. * Replace TAPFSIDE() and SOCKFSIDE() macros with local variables. Changes since v2: * Cosmetic fixes based on review * Extra doc comments for enum flow_type * Rename flowside to flowaddrs which turns out to make more sense in light of future changes * Fix bug where the socket flowaddrs for tap initiated connections wasn't initialised to match the socket address we were using in the case of map-gw NAT * New flowaddrs_from_sock() helper used in most cases which is cleaner and should avoid bugs like the above * Using newer centralised workarounds for clang-tidy issue 58992 * Remove duplicate definition of FLOW_MAX as maximum flow type and maximum number of tracked flows * Rebased on newer versions of preliminary work (ICMP, flow based dispatch and allocation, bind/address cleanups) * Unified hash table as well as base flow table * Integrated ICMP Changes since v1: * Terminology changes - "Endpoint" address/port instead of "correspondent" address/port - "flowside" instead of "demiflow" * Actually move the connection table to a new flow table structure in new files * Significant rearrangement of earlier patchs on top of that new table, to reduce churn David Gibson (26): flow: Common address information for initiating side flow: Common address information for target side tcp, flow: Remove redundant information, repack connection structures tcp: Obtain guest address from flowside tcp: Manage outbound address via flow table tcp: Simplify endpoint validation using flowside information tcp_splice: Eliminate SPLICE_V6 flag tcp, flow: Replace TCP specific hash function with general flow hash flow, tcp: Generalise TCP hash table to general flow hash table tcp: Re-use flow hash for initial sequence number generation icmp: Remove redundant id field from flow table entry icmp: Obtain destination addresses from the flowsides icmp: Look up ping flows using flow hash icmp: Eliminate icmp_id_map icmp: Manage outbound socket address via flow table flow, tcp: Flow based NAT and port forwarding for TCP flow, icmp: Use general flow forwarding rules for ICMP fwd: Update flow forwarding logic for UDP udp: Create flow table entries for UDP udp: Direct traffic from tap according to flow table udp: Direct traffic from host to guest tap according to flow table udp: Direct spliced traffic according to flow table udp: Remove 'splicesrc' tracking udp: Remove tap port flags field udp: Remove rdelta port forwarding maps udp: Eliminate 'splice' flag from epoll reference Makefile | 2 +- conf.c | 14 +- flow.c | 402 +++++++++++++++++++++++++- flow.h | 45 +++ flow_table.h | 45 ++- fwd.c | 177 +++++++++++- fwd.h | 9 + icmp.c | 105 +++---- icmp_flow.h | 2 - inany.h | 31 +- passt.h | 3 + tap.c | 11 - tap.h | 1 - tcp.c | 524 +++++++++------------------------- tcp_buf.c | 6 +- tcp_conn.h | 51 ++-- tcp_internal.h | 10 +- tcp_splice.c | 98 +------ tcp_splice.h | 5 +- udp.c | 757 +++++++++++++++++++++++-------------------------- udp.h | 22 +- udp_flow.h | 25 ++ util.c | 6 +- util.h | 3 + 24 files changed, 1328 insertions(+), 1026 deletions(-) create mode 100644 udp_flow.h -- 2.45.2
Handling of each protocol needs some degree of tracking of the
addresses and ports at the end of each connection or flow. Sometimes
that's explicit (as in the guest visible addresses for TCP
connections), sometimes implicit (the bound and connected addresses of
sockets).
To allow more consistent handling across protocols we want to
uniformly track the address and port at each end of the connection.
Furthermore, because we allow port remapping, and we sometimes need to
apply NAT, the addresses and ports can be different as seen by the
guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port
information related to one side of a flow. Store two of these in the
common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that
information be completed when a flows enter INI. Later patches will
populate the target side.
For now this leaves some information redundantly recorded in both generic
and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson
On Fri, 14 Jun 2024 16:13:23 +1000
David Gibson
Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side.
For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson
--- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..1819111d 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run;
+/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6;
I kind of wonder if, for clarity, we should perhaps: #define inany_any inany_any6 ? And... I don't actually have in mind how IP versions are stored in the flow table for unspecified addresses, but I wonder if we're losing some information this way: if this is called with AF_INET as 'af', we're not saying anywhere in the flowside information that this is going to be an IPv4 flowside, right?
+ fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); }
/** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting setting pif[INISIDE]
s/setting setting/setting/
* @flow: Flow to change state * @pif: pif of the initiating side */ -void flow_initiate(union flow *flow, uint8_t pif) +static void flow_initiate_(union flow *flow, uint8_t pif) { struct flow_common *f = &flow->f;
@@ -174,6 +211,55 @@ void flow_initiate(union flow *flow, uint8_t pif) flow_set_state(f, FLOW_STATE_INI); }
+/** + * flow_initiate_af() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @af: Address family of @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + flowside_from_af(ini, af, saddr, sport, daddr, dport); + flow_initiate_(flow, pif); + return ini; +} + +/** + * flow_initiate_sa() - Move flow to INI, setting INISIDE details + * @flow: Flow to change state + * @pif: pif of the initiating side + * @ssa: Source socket address + * @dport: Destination port + * + * Return: pointer to the initiating flowside information + */ +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport) +{ + struct flowside *ini = &flow->f.side[INISIDE]; + + inany_from_sockaddr(&ini->eaddr, &ini->eport, ssa); + if (inany_v4(&ini->eaddr)) + ini->faddr = inany_any4; + else + ini->faddr = inany_any6; + ini->fport = dport; + flow_initiate_(flow, pif); + return ini; +} + /** * flow_target() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state diff --git a/flow.h b/flow.h index 29ef9f12..b873ea7f 100644 --- a/flow.h +++ b/flow.h @@ -135,11 +135,26 @@ extern const uint8_t flow_proto[]; #define INISIDE 0 /* Initiating side */ #define TGTSIDE 1 /* Target side */
+/** + * struct flowside - Address information for one side of a flow + * @eaddr: Endpoint address (remote address from passt's PoV) + * @faddr: Forwarding address (local address from passt's PoV) + * @eport: Endpoint port + * @fport: Forwarding port + */ +struct flowside { + union inany_addr faddr; + union inany_addr eaddr; + in_port_t fport; + in_port_t eport; +}; + /** * struct flow_common - Common fields for packet flows * @state: State of the flow table entry * @type: Type of packet flow * @pif[]: Interface for each side of the flow + * @side[]: Information for each side of the flow */ struct flow_common { #ifdef __GNUC__ @@ -154,6 +169,7 @@ struct flow_common { "Not enough bits for type field"); #endif uint8_t pif[SIDES]; + struct flowside side[SIDES]; };
#define FLOW_INDEX_BITS 17 /* 128k - 1 */ diff --git a/flow_table.h b/flow_table.h index 1b163491..2e912532 100644 --- a/flow_table.h +++ b/flow_table.h @@ -107,7 +107,13 @@ static inline flow_sidx_t flow_sidx(const struct flow_common *f, union flow *flow_alloc(void); void flow_alloc_cancel(union flow *flow);
-void flow_initiate(union flow *flow, uint8_t pif); +const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport); +const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, + const union sockaddr_inany *ssa, + in_port_t dport); void flow_target(union flow *flow, uint8_t pif);
union flow *flow_set_type(union flow *flow, enum flow_type type); diff --git a/icmp.c b/icmp.c index 80330f6f..38d1deeb 100644 --- a/icmp.c +++ b/icmp.c @@ -146,12 +146,15 @@ static void icmp_ping_close(const struct ctx *c, * @id_sock: Pointer to ping flow entry slot in icmp_id_map[] to update * @af: Address family, AF_INET or AF_INET6 * @id: ICMP id for the new socket + * @saddr: Source address + * @daddr: Destination address * * Return: Newly opened ping flow, or NULL on failure */ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, struct icmp_ping_flow **id_sock, - sa_family_t af, uint16_t id) + sa_family_t af, uint16_t id, + const void *saddr, const void *daddr) { uint8_t flowtype = af == AF_INET ? FLOW_PING4 : FLOW_PING6; union epoll_ref ref = { .type = EPOLL_TYPE_PING }; @@ -163,7 +166,7 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, if (!flow) return NULL;
- flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); flow_target(flow, PIF_HOST); pingf = FLOW_SET_TYPE(flow, flowtype, ping);
@@ -269,7 +272,7 @@ int icmp_tap_handler(const struct ctx *c, uint8_t pif, sa_family_t af, }
if (!(pingf = *id_sock)) - if (!(pingf = icmp_ping_new(c, id_sock, af, id))) + if (!(pingf = icmp_ping_new(c, id_sock, af, id, saddr, daddr))) return 1;
pingf->ts = now->tv_sec; diff --git a/passt.h b/passt.h index 46d073a2..76810867 100644 --- a/passt.h +++ b/passt.h @@ -17,6 +17,9 @@ union epoll_ref;
#include "pif.h" #include "packet.h" +#include "siphash.h" +#include "ip.h" +#include "inany.h" #include "flow.h" #include "icmp.h" #include "fwd.h" diff --git a/tcp.c b/tcp.c index 68524235..07a2eb1c 100644 --- a/tcp.c +++ b/tcp.c @@ -1629,7 +1629,7 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return;
- flow_initiate(flow, PIF_TAP); + flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport);
if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -2288,7 +2288,9 @@ void tcp_listen_handler(struct ctx *c, union epoll_ref ref, if (s < 0) goto cancel;
- flow_initiate(flow, ref.tcp_listen.pif); + /* FIXME: When listening port has a specific bound address, record that + * as the forwarding address */ + flow_initiate_sa(flow, ref.tcp_listen.pif, &sa, ref.tcp_listen.port);
if (sa.sa_family == AF_INET) { const struct in_addr *addr = &sa.sa4.sin_addr;
-- Stefano
On Wed, Jun 26, 2024 at 12:23:28AM +0200, Stefano Brivio wrote:
On Fri, 14 Jun 2024 16:13:23 +1000 David Gibson
wrote: Handling of each protocol needs some degree of tracking of the addresses and ports at the end of each connection or flow. Sometimes that's explicit (as in the guest visible addresses for TCP connections), sometimes implicit (the bound and connected addresses of sockets).
To allow more consistent handling across protocols we want to uniformly track the address and port at each end of the connection. Furthermore, because we allow port remapping, and we sometimes need to apply NAT, the addresses and ports can be different as seen by the guest/namespace and as by the host.
Introduce 'struct flowside' to keep track of address and port information related to one side of a flow. Store two of these in the common fields of a flow to track that information for both sides.
For now we only populate the initiating side, requiring that information be completed when a flows enter INI. Later patches will populate the target side.
For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
Signed-off-by: David Gibson
--- flow.c | 96 +++++++++++++++++++++++++++++++++++++++++++++++++--- flow.h | 16 +++++++++ flow_table.h | 8 ++++- icmp.c | 9 +++-- passt.h | 3 ++ tcp.c | 6 ++-- 6 files changed, 127 insertions(+), 11 deletions(-) diff --git a/flow.c b/flow.c index d05aa495..1819111d 100644 --- a/flow.c +++ b/flow.c @@ -108,6 +108,31 @@ static const union flow *flow_new_entry; /* = NULL */ /* Last time the flow timers ran */ static struct timespec flow_timer_run;
+/** flowside_from_af() - Initialise flowside from addresses + * @fside: flowside to initialise + * @af: Address family (AF_INET or AF_INET6) + * @eaddr: Endpoint address (pointer to in_addr or in6_addr) + * @eport: Endpoint port + * @faddr: Forwarding address (pointer to in_addr or in6_addr) + * @fport: Forwarding port + */ +static void flowside_from_af(struct flowside *fside, sa_family_t af, + const void *eaddr, in_port_t eport, + const void *faddr, in_port_t fport) +{ + if (faddr) + inany_from_af(&fside->faddr, af, faddr); + else + fside->faddr = inany_any6;
I kind of wonder if, for clarity, we should perhaps:
#define inany_any inany_any6
?
That might be a good idea.
And... I don't actually have in mind how IP versions are stored in the flow table for unspecified addresses, but I wonder if we're losing some information this way: if this is called with AF_INET as 'af', we're not saying anywhere in the flowside information that this is going to be an IPv4 flowside, right?
So, I've gone back and forth on this a bit, and it's not entirely consistent in this version whether I always use :: for "blank", or whether I use 0.0.0.0 for IPv4 cases. Using :: everywhere does technically lose some information, although you can generally tell that a flowside is IPv4 from the other address. The argument for using :: everywhere is that in many of the cases this indicates "blank" or "missing" in a way that's not really dependent on the IP version, and sometimes it's simpler to do this. There are (or may be in future) also some cases where an otherwise IPv4 flowside might be routed through a dual-stack :: socket. The argument for using 0.0.0.0 where accurate is that it makes for more consistency across the flowside and makes matching up to addresses from elswhere easier in some cases. With the reworks for how I handle UDP, I'm going to need to revisit this anyway, so I'll see where I land.
+ fside->fport = fport; + + if (eaddr) + inany_from_af(&fside->eaddr, af, eaddr); + else + fside->eaddr = inany_any6; + fside->eport = eport; +} + /** flow_log_ - Log flow-related message * @f: flow the message is related to * @pri: Log priority @@ -140,6 +165,8 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { + char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + const struct flowside *ini = &f->side[INISIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -150,18 +177,28 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s => %s", pif_name(f->pif[INISIDE]), - pif_name(f->pif[TGTSIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport, + pif_name(f->pif[TGTSIDE])); else if (MAX(state, oldstate) >= FLOW_STATE_INI) - flow_log_(f, LOG_DEBUG, "%s => ?", pif_name(f->pif[INISIDE])); + flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", + pif_name(f->pif[INISIDE]), + inany_ntop(&ini->eaddr, estr, sizeof(estr)), + ini->eport, + inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + ini->fport); }
/** - * flow_initiate() - Move flow to INI, setting INISIDE details + * flow_initiate_() - Move flow to INI, setting setting pif[INISIDE]
s/setting setting/setting/
Fixed. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Require the address and port information for the target (non
initiating) side to be populated when a flow enters TGT state.
Implement that for TCP and ICMP. For now this leaves some information
redundantly recorded in both generic and type specific fields. We'll
fix that in later patches.
For TCP we now use the information from the flow to construct the
destination socket address in both tcp_conn_from_tap() and
tcp_splice_connect().
Signed-off-by: David Gibson
On Fri, 14 Jun 2024 16:13:24 +1000
David Gibson
Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect().
Signed-off-by: David Gibson
--- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 30 ++++++++++++++++++- tcp.c | 83 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 6 files changed, 126 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 1819111d..39e046bf 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); }
@@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, }
/** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE];
ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
+ flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; }
/** diff --git a/flow_table.h b/flow_table.h index 2e912532..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 38d1deeb..c86c54c3 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping);
pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..2bf3becf 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); @@ -273,4 +272,33 @@ static inline void inany_siphash_feed(struct siphash_state *state,
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
+/** sockaddr_from_inany() - Construct a sockaddr from an inany + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @addr: IPv[46] address + * @port: Port (host byte order) + * @scope: Scope ID (ignored for IPv4 addresses)
Perhaps worth specifying that it's the interface index for IPv6 link-local addresses only (even if it's not the case as of this patch, more below): * @scope: Scope ID for link-local IPv6 addresses only: interface index
+ */ +static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl, + const union inany_addr *addr, + in_port_t port, uint32_t scope) +{ + const struct in_addr *v4 = inany_v4(addr); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + sa->sa6.sin6_scope_id = scope; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} + #endif /* INANY_H */ diff --git a/tcp.c b/tcp.c index 07a2eb1c..c6cd0c72 100644 --- a/tcp.c +++ b/tcp.c @@ -1610,18 +1610,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1629,7 +1621,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return;
- flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport);
if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -1661,19 +1654,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } + } else { + ASSERT(0); }
if ((s = tcp_conn_sock(c, af)) < 0) goto cancel;
+ dstaddr = ini->faddr; + if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; }
- if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); + + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1681,9 +1683,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); }
- flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1706,14 +1709,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
inany_from_af(&conn->faddr, af, daddr);
- if (af == AF_INET) { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } else { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } - conn->fport = dstport; conn->eport = srcport;
@@ -1726,19 +1721,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
tcp_hash_insert(c, conn);
- if (!bind(s, sa, sl)) { + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6);
...so, I never tried to use a non-zero scope identifier for, say, global unicast IPv6 addresses. Perhaps it's harmless. But I still think that, logically, we should pass it as zero in that case.
+ + if (!bind(s, &sa.sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ goto cancel; } if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL);
This will have a semi-trivial conflict with 54a9d3801b95 ("tcp: Don't rely on bind() to fail to decide that connection target is valid").
- if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2236,9 +2228,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr);
- flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
conn->sock = s; @@ -2246,10 +2254,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport;
tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index 5a406c63..bcb42c97 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl;
- if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }
- if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, 0);
- if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; }
- flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING);
FLOW_ACTIVATE(conn);
-- Stefano
On Wed, Jun 26, 2024 at 12:23:59AM +0200, Stefano Brivio wrote:
On Fri, 14 Jun 2024 16:13:24 +1000 David Gibson
wrote: Require the address and port information for the target (non initiating) side to be populated when a flow enters TGT state. Implement that for TCP and ICMP. For now this leaves some information redundantly recorded in both generic and type specific fields. We'll fix that in later patches.
For TCP we now use the information from the flow to construct the destination socket address in both tcp_conn_from_tap() and tcp_splice_connect().
Signed-off-by: David Gibson
--- flow.c | 38 ++++++++++++++++++------ flow_table.h | 5 +++- icmp.c | 3 +- inany.h | 30 ++++++++++++++++++- tcp.c | 83 ++++++++++++++++++++++++++++------------------------ tcp_splice.c | 45 +++++++++++----------------- 6 files changed, 126 insertions(+), 78 deletions(-) diff --git a/flow.c b/flow.c index 1819111d..39e046bf 100644 --- a/flow.c +++ b/flow.c @@ -165,8 +165,10 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) */ static void flow_set_state(struct flow_common *f, enum flow_state state) { - char estr[INANY_ADDRSTRLEN], fstr[INANY_ADDRSTRLEN]; + char estr0[INANY_ADDRSTRLEN], fstr0[INANY_ADDRSTRLEN]; + char estr1[INANY_ADDRSTRLEN], fstr1[INANY_ADDRSTRLEN]; const struct flowside *ini = &f->side[INISIDE]; + const struct flowside *tgt = &f->side[TGTSIDE]; uint8_t oldstate = f->state;
ASSERT(state < FLOW_NUM_STATES); @@ -177,19 +179,24 @@ static void flow_set_state(struct flow_common *f, enum flow_state state) FLOW_STATE(f));
if (MAX(state, oldstate) >= FLOW_STATE_TGT) - flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => %s", + flow_log_(f, LOG_DEBUG, + "%s [%s]:%hu -> [%s]:%hu => %s [%s]:%hu -> [%s]:%hu", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport, - pif_name(f->pif[TGTSIDE])); + pif_name(f->pif[TGTSIDE]), + inany_ntop(&tgt->faddr, fstr1, sizeof(fstr1)), + tgt->fport, + inany_ntop(&tgt->eaddr, estr1, sizeof(estr1)), + tgt->eport); else if (MAX(state, oldstate) >= FLOW_STATE_INI) flow_log_(f, LOG_DEBUG, "%s [%s]:%hu -> [%s]:%hu => ?", pif_name(f->pif[INISIDE]), - inany_ntop(&ini->eaddr, estr, sizeof(estr)), + inany_ntop(&ini->eaddr, estr0, sizeof(estr0)), ini->eport, - inany_ntop(&ini->faddr, fstr, sizeof(fstr)), + inany_ntop(&ini->faddr, fstr0, sizeof(fstr0)), ini->fport); }
@@ -261,21 +268,34 @@ const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, }
/** - * flow_target() - Move flow to TGT, setting TGTSIDE details + * flow_target_af() - Move flow to TGT, setting TGTSIDE details * @flow: Flow to change state * @pif: pif of the target side + * @af: Address family for @eaddr and @faddr + * @saddr: Source address (pointer to in_addr or in6_addr) + * @sport: Endpoint port + * @daddr: Destination address (pointer to in_addr or in6_addr) + * @dport: Destination port + * + * Return: pointer to the target flowside information */ -void flow_target(union flow *flow, uint8_t pif) +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport) { struct flow_common *f = &flow->f; + struct flowside *tgt = &f->side[TGTSIDE];
ASSERT(pif != PIF_NONE); ASSERT(flow_new_entry == flow && f->state == FLOW_STATE_INI); ASSERT(f->type == FLOW_TYPE_NONE); ASSERT(f->pif[INISIDE] != PIF_NONE && f->pif[TGTSIDE] == PIF_NONE);
+ flowside_from_af(tgt, af, daddr, dport, saddr, sport); f->pif[TGTSIDE] = pif; flow_set_state(f, FLOW_STATE_TGT); + return tgt; }
/** diff --git a/flow_table.h b/flow_table.h index 2e912532..7af32c6a 100644 --- a/flow_table.h +++ b/flow_table.h @@ -114,7 +114,10 @@ const struct flowside *flow_initiate_af(union flow *flow, uint8_t pif, const struct flowside *flow_initiate_sa(union flow *flow, uint8_t pif, const union sockaddr_inany *ssa, in_port_t dport); -void flow_target(union flow *flow, uint8_t pif); +const struct flowside *flow_target_af(union flow *flow, uint8_t pif, + sa_family_t af, + const void *saddr, in_port_t sport, + const void *daddr, in_port_t dport);
union flow *flow_set_type(union flow *flow, enum flow_type type); #define FLOW_SET_TYPE(flow_, t_, var_) (&flow_set_type((flow_), (t_))->var_) diff --git a/icmp.c b/icmp.c index 38d1deeb..c86c54c3 100644 --- a/icmp.c +++ b/icmp.c @@ -167,7 +167,8 @@ static struct icmp_ping_flow *icmp_ping_new(const struct ctx *c, return NULL;
flow_initiate_af(flow, PIF_TAP, af, saddr, id, daddr, id); - flow_target(flow, PIF_HOST); + /* FIXME: Record outbound source address when known */ + flow_target_af(flow, PIF_HOST, af, NULL, 0, daddr, 0); pingf = FLOW_SET_TYPE(flow, flowtype, ping);
pingf->seq = -1; diff --git a/inany.h b/inany.h index 47b66fa9..2bf3becf 100644 --- a/inany.h +++ b/inany.h @@ -187,7 +187,6 @@ static inline bool inany_is_unspecified(const union inany_addr *a) * * Return: true if @a is in fe80::/10 (IPv6 link local unicast) */ -/* cppcheck-suppress unusedFunction */ static inline bool inany_is_linklocal6(const union inany_addr *a) { return IN6_IS_ADDR_LINKLOCAL(&a->a6); @@ -273,4 +272,33 @@ static inline void inany_siphash_feed(struct siphash_state *state,
const char *inany_ntop(const union inany_addr *src, char *dst, socklen_t size);
+/** sockaddr_from_inany() - Construct a sockaddr from an inany + * @sa: Pointer to sockaddr to fill in + * @sl: Updated to relevant of length of initialised @sa + * @addr: IPv[46] address + * @port: Port (host byte order) + * @scope: Scope ID (ignored for IPv4 addresses)
Perhaps worth specifying that it's the interface index for IPv6 link-local addresses only (even if it's not the case as of this patch, more below):
* @scope: Scope ID for link-local IPv6 addresses only: interface index
Right.. as it happens, I'm reworking this in a way that will obsolete this parameter anyway.
+ */ +static inline void sockaddr_from_inany(union sockaddr_inany *sa, socklen_t *sl, + const union inany_addr *addr, + in_port_t port, uint32_t scope) +{ + const struct in_addr *v4 = inany_v4(addr); + + if (v4) { + sa->sa_family = AF_INET; + sa->sa4.sin_addr = *v4; + sa->sa4.sin_port = htons(port); + memset(&sa->sa4.sin_zero, 0, sizeof(sa->sa4.sin_zero)); + *sl = sizeof(sa->sa4); + } else { + sa->sa_family = AF_INET6; + sa->sa6.sin6_addr = addr->a6; + sa->sa6.sin6_port = htons(port); + sa->sa6.sin6_scope_id = scope; + sa->sa6.sin6_flowinfo = 0; + *sl = sizeof(sa->sa6); + } +} + #endif /* INANY_H */ diff --git a/tcp.c b/tcp.c index 07a2eb1c..c6cd0c72 100644 --- a/tcp.c +++ b/tcp.c @@ -1610,18 +1610,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, { in_port_t srcport = ntohs(th->source); in_port_t dstport = ntohs(th->dest); - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(dstport), - .sin_addr = *(struct in_addr *)daddr, - }; - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(dstport), - .sin6_addr = *(struct in6_addr *)daddr, - }; - const struct sockaddr *sa; + const struct flowside *ini, *tgt; struct tcp_tap_conn *conn; + union inany_addr dstaddr; /* FIXME: Avoid bulky temporary */ + union sockaddr_inany sa; union flow *flow; int s = -1, mss; socklen_t sl; @@ -1629,7 +1621,8 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(flow = flow_alloc())) return;
- flow_initiate_af(flow, PIF_TAP, af, saddr, srcport, daddr, dstport); + ini = flow_initiate_af(flow, PIF_TAP, + af, saddr, srcport, daddr, dstport);
if (af == AF_INET) { if (IN4_IS_ADDR_UNSPECIFIED(saddr) || @@ -1661,19 +1654,28 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, dstport); goto cancel; } + } else { + ASSERT(0); }
if ((s = tcp_conn_sock(c, af)) < 0) goto cancel;
+ dstaddr = ini->faddr; + if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) - addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) - addr6.sin6_addr = in6addr_loopback; + if (inany_equals4(&dstaddr, &c->ip4.gw)) + dstaddr = inany_loopback4; + else if (inany_equals6(&dstaddr, &c->ip6.gw)) + dstaddr = inany_loopback6; }
- if (af == AF_INET6 && IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr)) { + /* FIXME: Record outbound source address when known */ + tgt = flow_target_af(flow, PIF_HOST, AF_INET6, + NULL, 0, /* Kernel decides source address */ + &dstaddr, dstport); + + if (inany_is_linklocal6(&tgt->eaddr)) { struct sockaddr_in6 addr6_ll = { .sin6_family = AF_INET6, .sin6_addr = c->ip6.addr_ll, @@ -1681,9 +1683,10 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, }; if (bind(s, (struct sockaddr *)&addr6_ll, sizeof(addr6_ll))) goto cancel; + } else if (!inany_is_loopback(&tgt->eaddr)) { + tcp_bind_outbound(c, s, af); }
- flow_target(flow, PIF_HOST); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp); conn->sock = s; conn->timer = -1; @@ -1706,14 +1709,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
inany_from_af(&conn->faddr, af, daddr);
- if (af == AF_INET) { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } else { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } - conn->fport = dstport; conn->eport = srcport;
@@ -1726,19 +1721,16 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af,
tcp_hash_insert(c, conn);
- if (!bind(s, sa, sl)) { + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, c->ifi6);
...so, I never tried to use a non-zero scope identifier for, say, global unicast IPv6 addresses. Perhaps it's harmless. But I still think that, logically, we should pass it as zero in that case.
Right, that will be changed by the rework I have in mind anyway.
+ + if (!bind(s, &sa.sa, sl)) { tcp_rst(c, conn); /* Nobody is listening then */ goto cancel; } if (errno != EADDRNOTAVAIL && errno != EACCES) conn_flag(c, conn, LOCAL);
This will have a semi-trivial conflict with 54a9d3801b95 ("tcp: Don't rely on bind() to fail to decide that connection target is valid").
Yeah, I already did that rebase, it was surprisingly fiddly.
- if ((af == AF_INET && !IN4_IS_ADDR_LOOPBACK(&addr4.sin_addr)) || - (af == AF_INET6 && !IN6_IS_ADDR_LOOPBACK(&addr6.sin6_addr) && - !IN6_IS_ADDR_LINKLOCAL(&addr6.sin6_addr))) - tcp_bind_outbound(c, s, af); - - if (connect(s, sa, sl)) { + if (connect(s, &sa.sa, sl)) { if (errno != EINPROGRESS) { tcp_rst(c, conn); goto cancel; @@ -2236,9 +2228,25 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, const union sockaddr_inany *sa, const struct timespec *now) { + union inany_addr saddr, daddr; /* FIXME: avoid bulky temporaries */ struct tcp_tap_conn *conn; + in_port_t srcport; + + inany_from_sockaddr(&saddr, &srcport, sa); + tcp_snat_inbound(c, &saddr);
- flow_target(flow, PIF_TAP); + if (inany_v4(&saddr)) { + daddr = inany_from_v4(c->ip4.addr_seen); + } else { + if (inany_is_linklocal6(&saddr)) + daddr.a6 = c->ip6.addr_ll_seen; + else + daddr.a6 = c->ip6.addr_seen; + } + dstport += c->tcp.fwd_in.delta[dstport]; + + flow_target_af(flow, PIF_TAP, AF_INET6, + &saddr, srcport, &daddr, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP, tcp);
conn->sock = s; @@ -2246,10 +2254,9 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- inany_from_sockaddr(&conn->faddr, &conn->fport, sa); - conn->eport = dstport + c->tcp.fwd_in.delta[dstport]; - - tcp_snat_inbound(c, &conn->faddr); + conn->faddr = saddr; + conn->fport = srcport; + conn->eport = dstport;
tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_splice.c b/tcp_splice.c index 5a406c63..bcb42c97 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -320,31 +320,20 @@ static int tcp_splice_connect_finish(const struct ctx *c, * tcp_splice_connect() - Create and connect socket for new spliced connection * @c: Execution context * @conn: Connection pointer - * @af: Address family - * @pif: pif on which to create socket - * @port: Destination port, host order * * Return: 0 for connect() succeeded or in progress, negative value on error */ -static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, - sa_family_t af, uint8_t pif, in_port_t port) +static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn) { - struct sockaddr_in6 addr6 = { - .sin6_family = AF_INET6, - .sin6_port = htons(port), - .sin6_addr = IN6ADDR_LOOPBACK_INIT, - }; - struct sockaddr_in addr4 = { - .sin_family = AF_INET, - .sin_port = htons(port), - .sin_addr = IN4ADDR_LOOPBACK_INIT, - }; - const struct sockaddr *sa; + const struct flowside *tgt = &conn->f.side[TGTSIDE]; + sa_family_t af = inany_v4(&tgt->eaddr) ? AF_INET : AF_INET6; + uint8_t tgtpif = conn->f.pif[TGTSIDE]; + union sockaddr_inany sa; socklen_t sl;
- if (pif == PIF_HOST) + if (tgtpif == PIF_HOST) conn->s[1] = tcp_conn_sock(c, af); - else if (pif == PIF_SPLICE) + else if (tgtpif == PIF_SPLICE) conn->s[1] = tcp_conn_sock_ns(c, af); else ASSERT(0); @@ -358,15 +347,9 @@ static int tcp_splice_connect(const struct ctx *c, struct tcp_splice_conn *conn, conn->s[1]); }
- if (CONN_V6(conn)) { - sa = (struct sockaddr *)&addr6; - sl = sizeof(addr6); - } else { - sa = (struct sockaddr *)&addr4; - sl = sizeof(addr4); - } + sockaddr_from_inany(&sa, &sl, &tgt->eaddr, tgt->eport, 0);
- if (connect(conn->s[1], sa, sl)) { + if (connect(conn->s[1], &sa.sa, sl)) { if (errno != EINPROGRESS) { flow_trace(conn, "Couldn't connect socket for splice: %s", strerror(errno)); @@ -471,7 +454,13 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, return false; }
- flow_target(flow, tgtpif); + /* FIXME: Record outbound source address when known */ + if (af == AF_INET) + flow_target_af(flow, tgtpif, AF_INET, + NULL, 0, &in4addr_loopback, dstport); + else + flow_target_af(flow, tgtpif, AF_INET6, + NULL, 0, &in6addr_loopback, dstport); conn = FLOW_SET_TYPE(flow, FLOW_TCP_SPLICE, tcp_splice);
conn->flags = af == AF_INET ? 0 : SPLICE_V6; @@ -483,7 +472,7 @@ bool tcp_splice_conn_from_sock(const struct ctx *c, if (setsockopt(s0, SOL_TCP, TCP_QUICKACK, &((int){ 1 }), sizeof(int))) flow_trace(conn, "failed to set TCP_QUICKACK on %i", s0);
- if (tcp_splice_connect(c, conn, af, tgtpif, dstport)) + if (tcp_splice_connect(c, conn)) conn_flag(c, conn, CLOSING);
FLOW_ACTIVATE(conn);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Some information we explicitly store in the TCP connection is now
duplicated in the common flow structure. Access it from there instead, and
remove it from the TCP specific structure. With that done we can reorder
both the "tap" and "splice" TCP structures a bit to get better packing for
the new combined flow table entries.
Signed-off-by: David Gibson
On Fri, 14 Jun 2024 16:13:25 +1000
David Gibson
Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries.
Signed-off-by: David Gibson
--- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index c6cd0c72..30ad3dd4 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@
#define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
-#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1;
return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1;
if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;
- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1;
return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); }
/** @@ -998,10 +1003,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1033,7 +1040,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph);
@@ -1075,10 +1083,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th);
ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1117,7 +1126,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr);
if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1420,6 +1430,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1429,10 +1440,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr);
- inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport);
/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1707,11 +1718,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;
- inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2254,10 +2260,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn);
diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..b741ce32 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS;
+#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
int sock :FD_REF_BITS;
@@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4)
- -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn {
uint8_t seq_dup_ack_approx;
- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;
@@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f;
- bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2];
+ uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5)
- uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1;
Excess tab.
};
/* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn))
/*
I reviewed up to 7/26 by the way, no further comments until that point. -- Stefano
On Wed, Jun 26, 2024 at 12:25:05AM +0200, Stefano Brivio wrote:
On Fri, 14 Jun 2024 16:13:25 +1000 David Gibson
wrote: Some information we explicitly store in the TCP connection is now duplicated in the common flow structure. Access it from there instead, and remove it from the TCP specific structure. With that done we can reorder both the "tap" and "splice" TCP structures a bit to get better packing for the new combined flow table entries.
Signed-off-by: David Gibson
--- tcp.c | 52 ++++++++++++++++++++++++++------------------------ tcp_conn.h | 40 +++++++++++++++----------------------- tcp_internal.h | 6 +++++- 3 files changed, 47 insertions(+), 51 deletions(-) diff --git a/tcp.c b/tcp.c index c6cd0c72..30ad3dd4 100644 --- a/tcp.c +++ b/tcp.c @@ -333,8 +333,6 @@
#define ACK_IF_NEEDED 0 /* See tcp_send_flag() */
-#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) - #define CONN_IS_CLOSING(conn) \ (((conn)->events & ESTABLISHED) && \ ((conn)->events & (SOCK_FIN_RCVD | TAP_FIN_RCVD))) @@ -635,10 +633,11 @@ void conn_event_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) { + const struct flowside *tapside = TAPFLOW(conn); int i;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return 1;
return 0; @@ -653,6 +652,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, const struct tcp_info *tinfo) { #ifdef HAS_MIN_RTT + const struct flowside *tapside = TAPFLOW(conn); int i, hole = -1;
if (!tinfo->tcpi_min_rtt || @@ -660,7 +660,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return;
for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->faddr, low_rtt_dst + i)) + if (inany_equals(&tapside->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -672,7 +672,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return;
- low_rtt_dst[hole++] = conn->faddr; + low_rtt_dst[hole++] = tapside->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -827,8 +827,10 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, const union inany_addr *faddr, in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->faddr, faddr) && - conn->eport == eport && conn->fport == fport) + const struct flowside *tapside = TAPFLOW(conn); + + if (inany_equals(&tapside->faddr, faddr) && + tapside->eport == eport && tapside->fport == fport) return 1;
return 0; @@ -862,7 +864,10 @@ static uint64_t tcp_hash(const struct ctx *c, const union inany_addr *faddr, static uint64_t tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); + const struct flowside *tapside = TAPFLOW(conn); + + return tcp_hash(c, &tapside->faddr, tapside->eport, + tapside->fport); }
/** @@ -998,10 +1003,12 @@ void tcp_defer_handler(struct ctx *c) * @seq: Sequence number */ static void tcp_fill_header(struct tcphdr *th, - const struct tcp_tap_conn *conn, uint32_t seq) + const struct tcp_tap_conn *conn, uint32_t seq) { - th->source = htons(conn->fport); - th->dest = htons(conn->eport); + const struct flowside *tapside = TAPFLOW(conn); + + th->source = htons(tapside->fport); + th->dest = htons(tapside->eport); th->seq = htonl(seq); th->ack_seq = htonl(conn->seq_ack_to_tap); if (conn->events & ESTABLISHED) { @@ -1033,7 +1040,8 @@ static size_t tcp_fill_headers4(const struct ctx *c, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr); size_t l4len = dlen + sizeof(*th); size_t l3len = l4len + sizeof(*iph);
@@ -1075,10 +1083,11 @@ static size_t tcp_fill_headers6(const struct ctx *c, struct ipv6hdr *ip6h, struct tcphdr *th, size_t dlen, uint32_t seq) { + const struct flowside *tapside = TAPFLOW(conn); size_t l4len = dlen + sizeof(*th);
ip6h->payload_len = htons(l4len); - ip6h->saddr = conn->faddr.a6; + ip6h->saddr = tapside->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&ip6h->saddr)) ip6h->daddr = c->ip6.addr_ll_seen; else @@ -1117,7 +1126,8 @@ size_t tcp_l2_buf_fill_headers(const struct ctx *c, struct iovec *iov, size_t dlen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->faddr); + const struct flowside *tapside = TAPFLOW(conn); + const struct in_addr *a4 = inany_v4(&tapside->faddr);
if (a4) { return tcp_fill_headers4(c, conn, iov[TCP_IOV_TAP].iov_base, @@ -1420,6 +1430,7 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, const struct timespec *now) { struct siphash_state state = SIPHASH_INIT(c->hash_secret); + const struct flowside *tapside = TAPFLOW(conn); union inany_addr aany; uint64_t hash; uint32_t ns; @@ -1429,10 +1440,10 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, else inany_from_af(&aany, AF_INET6, &c->ip6.addr);
- inany_siphash_feed(&state, &conn->faddr); + inany_siphash_feed(&state, &tapside->faddr); inany_siphash_feed(&state, &aany); hash = siphash_final(&state, 36, - (uint64_t)conn->fport << 16 | conn->eport); + (uint64_t)tapside->fport << 16 | tapside->eport);
/* 32ns ticks, overflows 32 bits every 137s */ ns = (now->tv_sec * 1000000000 + now->tv_nsec) >> 5; @@ -1707,11 +1718,6 @@ static void tcp_conn_from_tap(struct ctx *c, sa_family_t af, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1;
- inany_from_af(&conn->faddr, af, daddr); - - conn->fport = dstport; - conn->eport = srcport; - conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; conn->seq_ack_to_tap = conn->seq_from_tap; @@ -2254,10 +2260,6 @@ static void tcp_tap_conn_from_sock(struct ctx *c, in_port_t dstport, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED);
- conn->faddr = saddr; - conn->fport = srcport; - conn->eport = dstport; - tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn);
diff --git a/tcp_conn.h b/tcp_conn.h index 5f8c8fb6..b741ce32 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -13,19 +13,16 @@ * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @f: Generic flow information * @in_epoll: Is the connection in the epoll set? + * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT + * @ws_from_tap: Window scaling factor advertised from tap/guest + * @ws_to_tap: Window scaling factor advertised to tap/guest * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number * @events: Connection events, implying connection states * @timer: timerfd descriptor for timeout events * @flags: Connection flags representing internal attributes - * @retrans: Number of retransmissions occurred due to ACK_TIMEOUT - * @ws_from_tap: Window scaling factor advertised from tap/guest - * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @faddr: Guest side forwarding address (guest's remote address) - * @eport: Guest side endpoint port (guest's local port) - * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -49,6 +46,10 @@ struct tcp_tap_conn { unsigned int ws_from_tap :TCP_WS_BITS; unsigned int ws_to_tap :TCP_WS_BITS;
+#define TCP_MSS_BITS 14 + unsigned int tap_mss :TCP_MSS_BITS; +#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) +#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS))
int sock :FD_REF_BITS;
@@ -77,13 +78,6 @@ struct tcp_tap_conn { #define ACK_TO_TAP_DUE BIT(3) #define ACK_FROM_TAP_DUE BIT(4)
- -#define TCP_MSS_BITS 14 - unsigned int tap_mss :TCP_MSS_BITS; -#define MSS_SET(conn, mss) (conn->tap_mss = (mss >> (16 - TCP_MSS_BITS))) -#define MSS_GET(conn) (conn->tap_mss << (16 - TCP_MSS_BITS)) - - #define SNDBUF_BITS 24 unsigned int sndbuf :SNDBUF_BITS; #define SNDBUF_SET(conn, bytes) (conn->sndbuf = ((bytes) >> (32 - SNDBUF_BITS))) @@ -91,11 +85,6 @@ struct tcp_tap_conn {
uint8_t seq_dup_ack_approx;
- - union inany_addr faddr; - in_port_t eport; - in_port_t fport; - uint16_t wnd_from_tap; uint16_t wnd_to_tap;
@@ -109,22 +98,24 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @f: Generic flow information - * @in_epoll: Is the connection in the epoll set? * @s: File descriptor for sockets * @pipe: File descriptors for pipes - * @events: Events observed/actions performed on connection - * @flags: Connection flags (attributes, not events) * @read: Bytes read (not fully written to other side in one shot) * @written: Bytes written (not fully written from one other side read) -*/ + * @events: Events observed/actions performed on connection + * @flags: Connection flags (attributes, not events) + * @in_epoll: Is the connection in the epoll set? + */ struct tcp_splice_conn { /* Must be first element */ struct flow_common f;
- bool in_epoll :1; int s[SIDES]; int pipe[SIDES][2];
+ uint32_t read[SIDES]; + uint32_t written[SIDES]; + uint8_t events; #define SPLICE_CLOSED 0 #define SPLICE_CONNECT BIT(0) @@ -144,8 +135,7 @@ struct tcp_splice_conn { #define RCVLOWAT_ACT_1 BIT(4) #define CLOSING BIT(5)
- uint32_t read[SIDES]; - uint32_t written[SIDES]; + bool in_epoll :1;
Excess tab.
Oops, fixed.
};
/* Socket pools */ diff --git a/tcp_internal.h b/tcp_internal.h index 51aaa169..4f61e5c3 100644 --- a/tcp_internal.h +++ b/tcp_internal.h @@ -39,7 +39,11 @@ #define OPT_SACKP 4 #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) + +#define TAPSIDE(conn_) ((conn_)->f.pif[1] == PIF_TAP) +#define TAPFLOW(conn_) (&((conn_)->f.side[TAPSIDE(conn_)])) + +#define CONN_V4(conn) (!!inany_v4(&TAPFLOW(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn))
/*
I reviewed up to 7/26 by the way, no further comments until that point.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Currently we always deliver inbound TCP packets to the guest's most
recent observed IP address. This has the odd side effect that if the
guest changes its IP address with active TCP connections we might
deliver packets from old connections to the new address. That won't
work; it will probably result in an RST from the guest. Worse, if the
guest added a new address but also retains the old one, then we could
break those old connections by redirecting them to the new address.
Now that we maintain flowside information, we have a record of the correct
guest side address and can just use it.
Signed-off-by: David Gibson
For now when we forward a connection to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and port will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or port
we use in future.
Signed-off-by: David Gibson
Now that we store all our endpoints in the flowside structure, use some
inany helpers to make validation of those endpoints simpler.
Signed-off-by: David Gibson
Since we're now constructing socket addresses based on information in the
flowside, we no longer need an explicit flag to tell if we're dealing with
an IPv4 or IPv6 connection. Hence, drop the now unused SPLICE_V6 flag.
As well as just simplifying the code, this allows for possible future
extensions where we could splice an IPv4 connection to an IPv6 connection
or vice versa.
Signed-off-by: David Gibson
Currently we match TCP packets received on the tap connection to a TCP
connection via a hash table based on the forwarding address and both
ports. We hope in future to allow for multiple guest side addresses, or
for multiple interfaces which means we may need to distinguish based on
the endpoint address and pif as well. We also want a unified hash table
to cover multiple protocols, not just TCP.
Replace the TCP specific hash function with one suitable for general flows,
or rather for one side of a general flow. This includes all the
information from struct flowside, plus the pif and the L4 protocol number.
Signed-off-by: David Gibson
Move the data structures and helper functions for the TCP hash table to
flow.c, making it a general hash table indexing sides of flows. This is
largely code motion and straightforward renames. There are two semantic
changes:
* flow_lookup_af() now needs to verify that the entry has a matching
protocol and interface as well as matching addresses and ports.
* We double the size of the hash table, because it's now at least
theoretically possible for both sides of each flow to be hashed.
Signed-off-by: David Gibson
We generate TCP initial sequence numbers, when we need them, from a
hash of the source and destination addresses and ports, plus a
timestamp. Moments later, we generate another hash of the same
information plus some more to insert the connection into the flow hash
table.
With some tweaks to the flow_hash_insert() interface and changing the
order we can re-use that hash table hash for the initial sequence
number, rather than calculating another one. It won't generate
identical results, but that doesn't matter as long as the sequence
numbers are well scattered.
Signed-off-by: David Gibson
struct icmp_ping_flow contains a field for the ICMP id of the ping, but
this is now redundant, since the id is also stored as the "port" in the
common flowsides.
Signed-off-by: David Gibson
icmp_sock_handler() obtains the guest address from it's most recently
observed IP. However, this can now be obtained from the common flowside
information.
icmp_tap_handler() builds its socket address for sendto() directly
from the destination address supplied by the incoming tap packet.
This can instead be generated from the flow.
Using the flowsides as the common source of truth here prepares us for
allowing more flexible NAT and forwarding by properly initialising
that flowside information.
Signed-off-by: David Gibson
When we receive a ping packet from the tap interface, we currently locate
the correct flow entry (if present) using an anciliary data structure, the
icmp_id_map[] tables. However, we can look this up using the flow hash
table - that's what it's for.
Signed-off-by: David Gibson
With previous reworks the icmp_id_map data structure is now maintained, but
never used for anything. Eliminate it.
Signed-off-by: David Gibson
For now when we forward a ping to the host we leave the host side
forwarding address and port blank since we don't necessarily know what
source address and id will be used by the kernel. When the outbound
address option is active, though, we do know the address at least, so we
can record it in the flowside.
Having done that, use it as the primary source of truth, binding the
outgoing socket based on the information in there. This allows the
possibility of more complex rules for what outbound address and/or id
we use in future.
To implement this we create a new helper which sets up a new socket based
on information in a flowside, which will also have future uses. It
behaves slightly differently from the existing ICMP code, in that it
doesn't bind to a specific interface if given a loopback address. This is
logically correct - the loopback address means we need to operate through
the host's loopback interface, not ifname_out. We didn't need it in ICMP
because ICMP will never generate a loopback address at this point, however
we intend to change that in future.
Signed-off-by: David Gibson
Currently the code to translate host side addresses and ports to guest side
addresses and ports, and vice versa, is scattered across the TCP code.
This includes both port redirection as controlled by the -t and -T options,
and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input
interface in fwd.c which take protocol and address information for the
initiating side and generates the pif and address information for the
forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions
as needed to automatically move a flow from INI to TGT state.
Signed-off-by: David Gibson
On Fri, 14 Jun 2024 16:13:38 +1000
David Gibson
Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state.
Given that you already added flow_target() in another series, I didn't really review that part of this patch as I guess it will change. The rest of the patches from 8/26 to 17/26 all look good to me: after all, changes from v5 look rather minimal for those. I didn't review patches starting from 18/26, as you mentioned they will change substantially. -- Stefano
On Thu, Jun 27, 2024 at 12:49:41AM +0200, Stefano Brivio wrote:
On Fri, 14 Jun 2024 16:13:38 +1000 David Gibson
wrote: Currently the code to translate host side addresses and ports to guest side addresses and ports, and vice versa, is scattered across the TCP code. This includes both port redirection as controlled by the -t and -T options, and our special case NAT controlled by the --no-map-gw option.
Gather this logic into fwd_nat_from_*() functions for each input interface in fwd.c which take protocol and address information for the initiating side and generates the pif and address information for the forwarded side. This performs any NAT or port forwarding needed.
We create a flow_target() helper which applies those forwarding functions as needed to automatically move a flow from INI to TGT state.
Given that you already added flow_target() in another series, I didn't really review that part of this patch as I guess it will change.
Actually, I think this version is already on top of that, but the commit message is a bit out of date. The steps here are: 1. Add flow_target() which takes an explicit target pif (already merged) 2. Replace flow_target() with variants which also take explicit target addresses (patch #2 in this series) 3. Replace flow_target_*() variants with plain flow_target() which automatically determines the target addresses based on the forwarding logic (this patch)
The rest of the patches from 8/26 to 17/26 all look good to me: after all, changes from v5 look rather minimal for those.
I didn't review patches starting from 18/26, as you mentioned they will change substantially.
18/26 itself is probably fine, but the ones after that are being more or less entirely rewritten, yes. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Current ICMP hard codes its forwarding rules, and never applies any
translations. Change it to use the flow_target() function, so that
it's translated the same as TCP (excluding TCP specific port
redirection).
This means that gw mapping now applies to ICMP so "ping <gw address>" will
now ping the host's loopback instead of the actual gw machine. This
removes the surprising behaviour that the target you ping might not be the
same as you connect to with TCP.
This removes the last user of flow_target_af(), so that's removed as well.
Signed-off-by: David Gibson
Add logic to the fwd_nat_from_*() functions to forwarding UDP packets. The
logic here doesn't exactly match our current forwarding, since our current
forwarding has some very strange and buggy edge cases. Instead it's
attempting to replicate what appears to be the intended logic behind the
current forwarding.
Signed-off-by: David Gibson
Currently UDP only has a very rudimentary (and buggy) form of connection
tracking implemented with per-port flags. Make a start on converting this
to more robust tracking via the flow table.
Start matching UDP packets to flow table entries, creating them when
necessary. We also add a timer so that the flows will expire. For now
don't actually use the information in the flow table, that will come later.
Signed-off-by: David Gibson
Although we construct flow entries for UDP packets, we don't yet actually
direct traffic according to the information in there. Start fixing that
by directing traffic originating from the tap device according to the flow
table.
Signed-off-by: David Gibson
Currently although we associate a flow with traffic coming from a socket,
we don't use that flow to determine how to forward the traffic. Fix this
for the case of traffic going from socket to tap (i.e. host to guest).
Determine first that we should be sending to tap from the pif in the flow
entry, rather than with separate logic. Also use the addresses and ports
in the flow entry to construct the headers for the tap interface.
Signed-off-by: David Gibson
Although we associate a flow with traffic coming from a socket, for spliced
traffic we don't use that flow entry to address it. Fix that, using the
flow table as the source of truth for addressing "spliced" datagrams.
As part of this we tweak how we batch datagrams. Previously we'd batch
together contiguous datagrams as long as they have the same source port
on the destination side. Now we batch together datagrams only if they
belong to the same flow.
Previously the structure required that datagrams with the same source port
would also have the same destination port, and we relied on that fact.
Although that will be true for flows at the moment it may not be true in
future, so we need to ensure that everything in the batch has the same
destination port.
Similarly, although all datagrams will have loopback as both the source and
destination address, these could be different addresses in the 127.0.0.1/8
subnet. To be sent as a single batch we need those addresses to be the
same as well. Again, future changes to how we construct flow entries may
make this a real concern.
If both ports and addresses are the same, it must be the same flow, since
that's how flows are looked up. So, we can simplify the logic simply by
checking for the same flow.
Signed-off-by: David Gibson
This field in udp_meta_t was used to work out whether and how we're
splicing datagrams. However, that role has been taken over by the flow
table, so this field is updated, but never used. Eliminate it.
Signed-off-by: David Gibson
The flags field in udp_tap_port was used to implement a rudimentary (and
buggy) form of connection tracking. This has now been taken over by the
flow table. So, eliminate the field. This in turn allows udp_tap_port
and udp_splice_port to be merged into a single type.
Signed-off-by: David Gibson
In addition to the struct fwd_ports used by both UDP and TCP to track
port forwarding, UDP also included an 'rdelta' field, which contained the
reverse mapping of the main port map. This was used so that we could
properly direct reply packets to a forwarded packet where we change the
destination port. This has now been taken over by the flow table: reply
packets will match the flow of the originating packet, and that gives the
correct ports on the originating side.
So, eliminate the rdelta field, and with it struct udp_fwd_ports, which
now has no additional information over struct fwd_ports.
Signed-off-by: David Gibson
We no longer check the 'splice' flag in the UDP epoll reference. Instead,
we determine whether a datagram needs to be spliced from the pifs in the
flow table. So, eliminate this field.
Signed-off-by: David Gibson
participants (2)
-
David Gibson
-
Stefano Brivio