[PATCH v3 0/9] Flow Table Preliminaries

David Gibson

22 Aug 2023 22 Aug '23

4:11 a.m.

I'm still working on bunch of things to start implementing the generalised flow table. However, I think this set of preliminary clean ups and fixes stand well enough on their own that they're ready for merge now. Changes since v2: * Fix a formatting error in the in_epoll patch * Add patch for inany.h include guards * Add patch to remove broken pressure estimates for tcp_defer_handler() Changes since v1: * Add missing patch moving in_epoll flag David Gibson (9): tap: Don't clobber source address in tap6_handler() tap: Pass source address to protocol handler functions tcp: More precise terms for addresses and ports tcp: Consistent usage of ports in tcp_seq_init() tcp, udp: Don't include destination address in partially precomputed csums tcp, udp: Don't pre-fill IPv4 destination address in headers tcp: Move in_epoll flag out of common connection structure inany: Add missing double include guard to inany.h tcp: Remove broken pressure calculations for tcp_defer_handler() icmp.c | 12 ++- icmp.h | 3 +- inany.h | 5 ++ passt.c | 10 +-- passt.h | 4 +- pasta.c | 2 +- tap.c | 29 ++++---- tcp.c | 203 ++++++++++++++++++++++----------------------------- tcp.h | 5 +- tcp_conn.h | 18 +++-- tcp_splice.c | 4 +- udp.c | 37 ++++------ udp.h | 5 +- util.h | 4 +- 14 files changed, 156 insertions(+), 185 deletions(-) -- 2.41.0

Show replies by date

David Gibson

22 Aug 22 Aug

4:11 a.m.

New subject: [PATCH v3 1/9] tap: Don't clobber source address in tap6_handler()

In tap6_handler() saddr is initialized to the IPv6 source address from the incoming packet. However part way through, but before organizing the packet into a "sequence" we set it unconditionally to the guest's assigned address. We don't do anything equivalent for IPv4. This doesn't make a lot of sense: if the guest is using a different source address it makes sense to consider these different sequences of packets and we shouldn't try to combine them together. Signed-off-by: David Gibson --- tap.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/tap.c b/tap.c index 760deb7..6a14692 100644 --- a/tap.c +++ b/tap.c @@ -818,8 +818,6 @@ resume: continue; } - *saddr = c->ip6.addr; - if (proto != IPPROTO_TCP && proto != IPPROTO_UDP) { tap_packet_debug(NULL, ip6h, NULL, proto, NULL, 1); continue; -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 2/9] tap: Pass source address to protocol handler functions

The tap code passes the IPv4 or IPv6 destination address of packets it receives to the protocol specific code. Currently that protocol code doesn't use the source address, but we want it to in future. So, in preparation, pass the IPv4/IPv6 source address of tap packets to those functions as well. Signed-off-by: David Gibson --- icmp.c | 12 ++++++++---- icmp.h | 3 ++- tap.c | 19 +++++++++++-------- tcp.c | 28 +++++++++++++++++----------- tcp.h | 2 +- udp.c | 14 ++++++++------ udp.h | 2 +- 7 files changed, 48 insertions(+), 32 deletions(-) diff --git a/icmp.c b/icmp.c index b676a1a..f2cc4d6 100644 --- a/icmp.c +++ b/icmp.c @@ -154,17 +154,21 @@ void icmpv6_sock_handler(const struct ctx *c, union epoll_ref ref) * icmp_tap_handler() - Handle packets from tap * @c: Execution context * @af: Address family, AF_INET or AF_INET6 - * @addr: Destination address + * @saddr: Source address + * @daddr: Destination address * @p: Packet pool, single packet with ICMP/ICMPv6 header * @now: Current timestamp * * Return: count of consumed packets (always 1, even if malformed) */ -int icmp_tap_handler(const struct ctx *c, int af, const void *addr, +int icmp_tap_handler(const struct ctx *c, int af, + const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { size_t plen; + (void)saddr; + if (af == AF_INET) { struct sockaddr_in sa = { .sin_family = AF_INET, @@ -210,7 +214,7 @@ int icmp_tap_handler(const struct ctx *c, int af, const void *addr, icmp_id_map[V4][id].ts = now->tv_sec; bitmap_set(icmp_act[V4], id); - sa.sin_addr = *(struct in_addr *)addr; + sa.sin_addr = *(struct in_addr *)daddr; if (sendto(s, ih, sizeof(*ih) + plen, MSG_NOSIGNAL, (struct sockaddr *)&sa, sizeof(sa)) < 0) { debug("ICMP: failed to relay request to socket"); @@ -264,7 +268,7 @@ int icmp_tap_handler(const struct ctx *c, int af, const void *addr, icmp_id_map[V6][id].ts = now->tv_sec; bitmap_set(icmp_act[V6], id); - sa.sin6_addr = *(struct in6_addr *)addr; + sa.sin6_addr = *(struct in6_addr *)daddr; if (sendto(s, ih, sizeof(*ih) + plen, MSG_NOSIGNAL, (struct sockaddr *)&sa, sizeof(sa)) < 1) { debug("ICMPv6: failed to relay request to socket"); diff --git a/icmp.h b/icmp.h index 32f0c47..00d10ea 100644 --- a/icmp.h +++ b/icmp.h @@ -12,7 +12,8 @@ struct ctx; void icmp_sock_handler(const struct ctx *c, union epoll_ref ref); void icmpv6_sock_handler(const struct ctx *c, union epoll_ref ref); -int icmp_tap_handler(const struct ctx *c, int af, const void *addr, +int icmp_tap_handler(const struct ctx *c, + int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); void icmp_timer(const struct ctx *c, const struct timespec *ts); void icmp_init(void); diff --git a/tap.c b/tap.c index 6a14692..4ed63f1 100644 --- a/tap.c +++ b/tap.c @@ -643,7 +643,8 @@ resume: tap_packet_debug(iph, NULL, NULL, 0, NULL, 1); packet_add(pkt, l4_len, l4h); - icmp_tap_handler(c, AF_INET, &iph->daddr, pkt, now); + icmp_tap_handler(c, AF_INET, &iph->saddr, &iph->daddr, + pkt, now); continue; } @@ -708,7 +709,6 @@ append: for (j = 0, seq = tap4_l4; j < seq_count; j++, seq++) { struct pool *p = (struct pool *)&seq->p; - struct in_addr *da = &seq->daddr; size_t n = p->count; tap_packet_debug(NULL, NULL, seq, 0, NULL, n); @@ -716,11 +716,13 @@ append: if (seq->protocol == IPPROTO_TCP) { if (c->no_tcp) continue; - while ((n -= tcp_tap_handler(c, AF_INET, da, p, now))); + while ((n -= tcp_tap_handler(c, AF_INET, &seq->saddr, + &seq->daddr, p, now))); } else if (seq->protocol == IPPROTO_UDP) { if (c->no_udp) continue; - while ((n -= udp_tap_handler(c, AF_INET, da, p, now))); + while ((n -= udp_tap_handler(c, AF_INET, &seq->saddr, + &seq->daddr, p, now))); } } @@ -801,7 +803,7 @@ resume: tap_packet_debug(NULL, ip6h, NULL, proto, NULL, 1); packet_add(pkt, l4_len, l4h); - icmp_tap_handler(c, AF_INET6, daddr, pkt, now); + icmp_tap_handler(c, AF_INET6, saddr, daddr, pkt, now); continue; } @@ -868,7 +870,6 @@ append: for (j = 0, seq = tap6_l4; j < seq_count; j++, seq++) { struct pool *p = (struct pool *)&seq->p; - struct in6_addr *da = &seq->daddr; size_t n = p->count; tap_packet_debug(NULL, NULL, NULL, seq->protocol, seq, n); @@ -876,11 +877,13 @@ append: if (seq->protocol == IPPROTO_TCP) { if (c->no_tcp) continue; - while ((n -= tcp_tap_handler(c, AF_INET6, da, p, now))); + while ((n -= tcp_tap_handler(c, AF_INET6, &seq->saddr, + &seq->daddr, p, now))); } else if (seq->protocol == IPPROTO_UDP) { if (c->no_udp) continue; - while ((n -= udp_tap_handler(c, AF_INET6, da, p, now))); + while ((n -= udp_tap_handler(c, AF_INET6, &seq->saddr, + &seq->daddr, p, now))); } } diff --git a/tcp.c b/tcp.c index 0322842..68141e9 100644 --- a/tcp.c +++ b/tcp.c @@ -2005,13 +2005,15 @@ static void tcp_bind_outbound(const struct ctx *c, int s, sa_family_t af) * tcp_conn_from_tap() - Handle connection request (SYN segment) from tap * @c: Execution context * @af: Address family, AF_INET or AF_INET6 - * @addr: Remote address, pointer to in_addr or in6_addr + * @saddr: Source address, pointer to in_addr or in6_addr + * @daddr: Destination address, pointer to in_addr or in6_addr * @th: TCP header from tap: caller MUST ensure it's there * @opts: Pointer to start of options * @optlen: Bytes in options: caller MUST ensure available length * @now: Current timestamp */ -static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, +static void tcp_conn_from_tap(struct ctx *c, + int af, const void *saddr, const void *daddr, const struct tcphdr *th, const char *opts, size_t optlen, const struct timespec *now) { @@ -2019,18 +2021,20 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, struct sockaddr_in addr4 = { .sin_family = AF_INET, .sin_port = th->dest, - .sin_addr = *(struct in_addr *)addr, + .sin_addr = *(struct in_addr *)daddr, }; struct sockaddr_in6 addr6 = { .sin6_family = AF_INET6, .sin6_port = th->dest, - .sin6_addr = *(struct in6_addr *)addr, + .sin6_addr = *(struct in6_addr *)daddr, }; const struct sockaddr *sa; struct tcp_tap_conn *conn; socklen_t sl; int s, mss; + (void)saddr; + if (c->tcp.conn_count >= TCP_MAX_CONNS) return; @@ -2039,9 +2043,9 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, return; if (!c->no_map_gw) { - if (af == AF_INET && IN4_ARE_ADDR_EQUAL(addr, &c->ip4.gw)) + if (af == AF_INET && IN4_ARE_ADDR_EQUAL(daddr, &c->ip4.gw)) addr4.sin_addr.s_addr = htonl(INADDR_LOOPBACK); - if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(addr, &c->ip6.gw)) + if (af == AF_INET6 && IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw)) addr6.sin6_addr = in6addr_loopback; } @@ -2078,7 +2082,7 @@ static void tcp_conn_from_tap(struct ctx *c, int af, const void *addr, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->addr, af, addr); + inany_from_af(&conn->addr, af, daddr); if (af == AF_INET) { sa = (struct sockaddr *)&addr4; @@ -2556,13 +2560,14 @@ static void tcp_conn_from_sock_finish(struct ctx *c, struct tcp_tap_conn *conn, * tcp_tap_handler() - Handle packets from tap and state transitions * @c: Execution context * @af: Address family, AF_INET or AF_INET6 - * @addr: Destination address + * @saddr: Source address + * @daddr: Destination address * @p: Pool of TCP packets, with TCP headers * @now: Current timestamp * * Return: count of consumed packets */ -int tcp_tap_handler(struct ctx *c, int af, const void *addr, +int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { struct tcp_tap_conn *conn; @@ -2583,12 +2588,13 @@ int tcp_tap_handler(struct ctx *c, int af, const void *addr, optlen = MIN(optlen, ((1UL << 4) /* from doff width */ - 6) * 4UL); opts = packet_get(p, 0, sizeof(*th), optlen, NULL); - conn = tcp_hash_lookup(c, af, addr, htons(th->source), htons(th->dest)); + conn = tcp_hash_lookup(c, af, daddr, htons(th->source), htons(th->dest)); /* New connection from tap */ if (!conn) { if (opts && th->syn && !th->ack) - tcp_conn_from_tap(c, af, addr, th, opts, optlen, now); + tcp_conn_from_tap(c, af, saddr, daddr, th, + opts, optlen, now); return 1; } diff --git a/tcp.h b/tcp.h index be296ec..3454d9a 100644 --- a/tcp.h +++ b/tcp.h @@ -17,7 +17,7 @@ void tcp_timer_handler(struct ctx *c, union epoll_ref ref); void tcp_listen_handler(struct ctx *c, union epoll_ref ref, const struct timespec *now); void tcp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events); -int tcp_tap_handler(struct ctx *c, int af, const void *addr, +int tcp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); int tcp_sock_init(const struct ctx *c, sa_family_t af, const void *addr, const char *ifname, in_port_t port); diff --git a/udp.c b/udp.c index 138e7ab..21c6888 100644 --- a/udp.c +++ b/udp.c @@ -799,7 +799,8 @@ void udp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, * udp_tap_handler() - Handle packets from tap * @c: Execution context * @af: Address family, AF_INET or AF_INET6 - * @addr: Destination address + * @saddr: Source address + * @daddr: Destination address * @p: Pool of UDP packets, with UDP headers * @now: Current timestamp * @@ -807,7 +808,7 @@ void udp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, * * #syscalls sendmmsg */ -int udp_tap_handler(struct ctx *c, int af, const void *addr, +int udp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now) { struct mmsghdr mm[UIO_MAXIOV]; @@ -821,6 +822,7 @@ int udp_tap_handler(struct ctx *c, int af, const void *addr, socklen_t sl; (void)c; + (void)saddr; uh = packet_get(p, 0, 0, sizeof(*uh), NULL); if (!uh) @@ -836,7 +838,7 @@ int udp_tap_handler(struct ctx *c, int af, const void *addr, s_in = (struct sockaddr_in) { .sin_family = AF_INET, .sin_port = uh->dest, - .sin_addr = *(struct in_addr *)addr, + .sin_addr = *(struct in_addr *)daddr, }; sa = (struct sockaddr *)&s_in; @@ -881,17 +883,17 @@ int udp_tap_handler(struct ctx *c, int af, const void *addr, s_in6 = (struct sockaddr_in6) { .sin6_family = AF_INET6, .sin6_port = uh->dest, - .sin6_addr = *(struct in6_addr *)addr, + .sin6_addr = *(struct in6_addr *)daddr, }; const struct in6_addr *bind_addr = &in6addr_any; sa = (struct sockaddr *)&s_in6; sl = sizeof(s_in6); - if (IN6_ARE_ADDR_EQUAL(addr, &c->ip6.dns_match) && + if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.dns_match) && ntohs(s_in6.sin6_port) == 53) { s_in6.sin6_addr = c->ip6.dns_host; - } else if (IN6_ARE_ADDR_EQUAL(addr, &c->ip6.gw) && + } else if (IN6_ARE_ADDR_EQUAL(daddr, &c->ip6.gw) && !c->no_map_gw) { if (!(udp_tap_map[V6][dst].flags & PORT_LOCAL) || (udp_tap_map[V6][dst].flags & PORT_LOOPBACK)) diff --git a/udp.h b/udp.h index 56bcd78..f9d4459 100644 --- a/udp.h +++ b/udp.h @@ -10,7 +10,7 @@ void udp_sock_handler(struct ctx *c, union epoll_ref ref, uint32_t events, const struct timespec *now); -int udp_tap_handler(struct ctx *c, int af, const void *addr, +int udp_tap_handler(struct ctx *c, int af, const void *saddr, const void *daddr, const struct pool *p, const struct timespec *now); int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 3/9] tcp: More precise terms for addresses and ports

In a number of places the comments and variable names we use to describe addresses and ports are ambiguous. It's not sufficient to describe a port as "tap-facing" or "socket-facing", because on both the tap side and the socket side there are two ports for the two ends of the connection. Similarly, "local" and "remote" aren't particularly helpful, because it's not necessarily clear whether we're talking from the point of view of the guest/namespace, the host, or passt itself. This patch makes a number of changes to be more precise about this. It introduces two new terms in aid of this: A "forwarding" address (or port) refers to an address which is local from the point of view of passt itself. That is a source address for traffic sent by passt, whether it's to the guest via the tap interface or to a host on the internet via a socket. The "endpoint" address (or port) is the reverse: a remote address from passt's point of view, the destination address for traffic sent by passt. Between them the "side" (either tap/guest-facing or sock/host-facing) and forwarding vs. endpoint unambiguously describes which address or port we're talking about. Signed-off-by: David Gibson --- tcp.c | 93 +++++++++++++++++++++++++++--------------------------- tcp_conn.h | 12 +++---- 2 files changed, 53 insertions(+), 52 deletions(-) diff --git a/tcp.c b/tcp.c index 68141e9..74bf744 100644 --- a/tcp.c +++ b/tcp.c @@ -401,7 +401,7 @@ struct tcp6_l2_head { /* For MSS6 macro: keep in sync with tcp6_l2_buf_t */ #define OPT_SACK 5 #define OPT_TS 8 -#define CONN_V4(conn) (!!inany_v4(&(conn)->addr)) +#define CONN_V4(conn) (!!inany_v4(&(conn)->faddr)) #define CONN_V6(conn) (!CONN_V4(conn)) #define CONN_IS_CLOSING(conn) \ ((conn->events & ESTABLISHED) && \ @@ -434,7 +434,9 @@ static const char *tcp_flag_str[] __attribute((__unused__)) = { static int tcp_sock_init_ext [NUM_PORTS][IP_VERSIONS]; static int tcp_sock_ns [NUM_PORTS][IP_VERSIONS]; -/* Table of destinations with very low RTT (assumed to be local), LRU */ +/* Table of guest side forwarding addresses with very low RTT (assumed + * to be local to the host), LRU + */ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; /* Static buffers */ @@ -858,7 +860,7 @@ static int tcp_rtt_dst_low(const struct tcp_tap_conn *conn) int i; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) - if (inany_equals(&conn->addr, low_rtt_dst + i)) + if (inany_equals(&conn->faddr, low_rtt_dst + i)) return 1; return 0; @@ -880,7 +882,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, return; for (i = 0; i < LOW_RTT_TABLE_SIZE; i++) { - if (inany_equals(&conn->addr, low_rtt_dst + i)) + if (inany_equals(&conn->faddr, low_rtt_dst + i)) return; if (hole == -1 && IN6_IS_ADDR_UNSPECIFIED(low_rtt_dst + i)) hole = i; @@ -892,7 +894,7 @@ static void tcp_rtt_dst_check(const struct tcp_tap_conn *conn, if (hole == -1) return; - low_rtt_dst[hole++] = conn->addr; + low_rtt_dst[hole++] = conn->faddr; if (hole == LOW_RTT_TABLE_SIZE) hole = 0; inany_from_af(low_rtt_dst + hole, AF_INET6, &in6addr_any); @@ -1162,18 +1164,18 @@ static int tcp_opt_get(const char *opts, size_t len, uint8_t type_find, /** * tcp_hash_match() - Check if a connection entry matches address and ports * @conn: Connection entry to match against - * @addr: Remote address - * @tap_port: tap-facing port - * @sock_port: Socket-facing port + * @faddr: Guest side forwarding address + * @eport: Guest side endpoint port + * @fport: Guest side forwarding port * * Return: 1 on match, 0 otherwise */ static int tcp_hash_match(const struct tcp_tap_conn *conn, - const union inany_addr *addr, - in_port_t tap_port, in_port_t sock_port) + const union inany_addr *faddr, + in_port_t eport, in_port_t fport) { - if (inany_equals(&conn->addr, addr) && - conn->tap_port == tap_port && conn->sock_port == sock_port) + if (inany_equals(&conn->faddr, faddr) && + conn->eport == eport && conn->fport == fport) return 1; return 0; @@ -1182,21 +1184,21 @@ static int tcp_hash_match(const struct tcp_tap_conn *conn, /** * tcp_hash() - Calculate hash value for connection given address and ports * @c: Execution context - * @addr: Remote address - * @tap_port: tap-facing port - * @sock_port: Socket-facing port + * @faddr: Guest side forwarding address + * @eport: Guest side endpoint port + * @fport: Guest side forwarding port * * Return: hash value, already modulo size of the hash table */ -static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *addr, - in_port_t tap_port, in_port_t sock_port) +static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *faddr, + in_port_t eport, in_port_t fport) { struct { - union inany_addr addr; - in_port_t tap_port; - in_port_t sock_port; + union inany_addr faddr; + in_port_t eport; + in_port_t fport; } __attribute__((__packed__)) in = { - *addr, tap_port, sock_port + *faddr, eport, fport }; uint64_t b = 0; @@ -1215,7 +1217,7 @@ static unsigned int tcp_hash(const struct ctx *c, const union inany_addr *addr, static unsigned int tcp_conn_hash(const struct ctx *c, const struct tcp_tap_conn *conn) { - return tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port); + return tcp_hash(c, &conn->faddr, conn->eport, conn->fport); } /** @@ -1227,7 +1229,7 @@ static void tcp_hash_insert(const struct ctx *c, struct tcp_tap_conn *conn) { int b; - b = tcp_hash(c, &conn->addr, conn->tap_port, conn->sock_port); + b = tcp_hash(c, &conn->faddr, conn->eport, conn->fport); conn->next_index = tc_hash[b] ? CONN_IDX(tc_hash[b]) : -1; tc_hash[b] = conn; @@ -1296,25 +1298,24 @@ static void tcp_tap_conn_update(struct ctx *c, struct tcp_tap_conn *old, * tcp_hash_lookup() - Look up connection given remote address and ports * @c: Execution context * @af: Address family, AF_INET or AF_INET6 - * @addr: Remote address, pointer to in_addr or in6_addr - * @tap_port: tap-facing port - * @sock_port: Socket-facing port + * @faddr: Guest side forwarding address (guest remote address) + * @eport: Guest side endpoint port (guest local port) + * @fport: Guest side forwarding port (guest remote port) * * Return: connection pointer, if found, -ENOENT otherwise */ static struct tcp_tap_conn *tcp_hash_lookup(const struct ctx *c, - int af, const void *addr, - in_port_t tap_port, - in_port_t sock_port) + int af, const void *faddr, + in_port_t eport, in_port_t fport) { union inany_addr aany; struct tcp_tap_conn *conn; int b; - inany_from_af(&aany, af, addr); - b = tcp_hash(c, &aany, tap_port, sock_port); + inany_from_af(&aany, af, faddr); + b = tcp_hash(c, &aany, eport, fport); for (conn = tc_hash[b]; conn; conn = conn_at_idx(conn->next_index)) { - if (tcp_hash_match(conn, &aany, tap_port, sock_port)) + if (tcp_hash_match(conn, &aany, eport, fport)) return conn; } @@ -1447,13 +1448,13 @@ static size_t tcp_l2_buf_fill_headers(const struct ctx *c, void *p, size_t plen, const uint16_t *check, uint32_t seq) { - const struct in_addr *a4 = inany_v4(&conn->addr); + const struct in_addr *a4 = inany_v4(&conn->faddr); size_t ip_len, tlen; #define SET_TCP_HEADER_COMMON_V4_V6(b, conn, seq) \ do { \ - b->th.source = htons(conn->sock_port); \ - b->th.dest = htons(conn->tap_port); \ + b->th.source = htons(conn->fport); \ + b->th.dest = htons(conn->eport); \ b->th.seq = htonl(seq); \ b->th.ack_seq = htonl(conn->seq_ack_to_tap); \ if (conn->events & ESTABLISHED) { \ @@ -1489,7 +1490,7 @@ do { \ ip_len = plen + sizeof(struct ipv6hdr) + sizeof(struct tcphdr); b->ip6h.payload_len = htons(plen + sizeof(struct tcphdr)); - b->ip6h.saddr = conn->addr.a6; + b->ip6h.saddr = conn->faddr.a6; if (IN6_IS_ADDR_LINKLOCAL(&b->ip6h.saddr)) b->ip6h.daddr = c->ip6.addr_ll_seen; else @@ -1842,7 +1843,7 @@ static void tcp_clamp_window(const struct ctx *c, struct tcp_tap_conn *conn, /** * tcp_seq_init() - Calculate initial sequence number according to RFC 6528 * @c: Execution context - * @conn: TCP connection, with addr, sock_port and tap_port populated + * @conn: TCP connection, with faddr, fport and eport populated * @now: Current timestamp */ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, @@ -1855,9 +1856,9 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, union inany_addr dst; in_port_t dstport; } __attribute__((__packed__)) in = { - .src = conn->addr, - .srcport = conn->tap_port, - .dstport = conn->sock_port, + .src = conn->faddr, + .srcport = conn->eport, + .dstport = conn->fport, }; uint32_t ns, seq = 0; @@ -2082,7 +2083,7 @@ static void tcp_conn_from_tap(struct ctx *c, if (!(conn->wnd_from_tap = (htons(th->window) >> conn->ws_from_tap))) conn->wnd_from_tap = 1; - inany_from_af(&conn->addr, af, daddr); + inany_from_af(&conn->faddr, af, daddr); if (af == AF_INET) { sa = (struct sockaddr *)&addr4; @@ -2092,8 +2093,8 @@ static void tcp_conn_from_tap(struct ctx *c, sl = sizeof(addr6); } - conn->sock_port = ntohs(th->dest); - conn->tap_port = ntohs(th->source); + conn->fport = ntohs(th->dest); + conn->eport = ntohs(th->source); conn->seq_init_from_tap = ntohl(th->seq); conn->seq_from_tap = conn->seq_init_from_tap + 1; @@ -2753,10 +2754,10 @@ static void tcp_tap_conn_from_sock(struct ctx *c, conn->ws_to_tap = conn->ws_from_tap = 0; conn_event(c, conn, SOCK_ACCEPTED); - inany_from_sockaddr(&conn->addr, &conn->sock_port, sa); - conn->tap_port = ref.port; + inany_from_sockaddr(&conn->faddr, &conn->fport, sa); + conn->eport = ref.port; - tcp_snat_inbound(c, &conn->addr); + tcp_snat_inbound(c, &conn->faddr); tcp_seq_init(c, conn, now); tcp_hash_insert(c, conn); diff --git a/tcp_conn.h b/tcp_conn.h index 0b36940..e533bd4 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -35,9 +35,9 @@ extern const char *tcp_common_flag_str[]; * @ws_to_tap: Window scaling factor advertised to tap/guest * @sndbuf: Sending buffer in kernel, rounded to 2 ^ SNDBUF_BITS * @seq_dup_ack_approx: Last duplicate ACK number sent to tap - * @addr: Remote address (IPv4 or IPv6) - * @tap_port: Guest-facing tap port - * @sock_port: Remote, socket-facing port + * @faddr: Guest side forwarding address (guest's remote address) + * @eport: Guest side endpoint port (guest's local port) + * @fport: Guest side forwarding port (guest's remote port) * @wnd_from_tap: Last window size from tap, unscaled (as received) * @wnd_to_tap: Sending window advertised to tap, unscaled (as sent) * @seq_to_tap: Next sequence for packets to tap @@ -105,9 +105,9 @@ struct tcp_tap_conn { uint8_t seq_dup_ack_approx; - union inany_addr addr; - in_port_t tap_port; - in_port_t sock_port; + union inany_addr faddr; + in_port_t eport; + in_port_t fport; uint16_t wnd_from_tap; uint16_t wnd_to_tap; -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 4/9] tcp: Consistent usage of ports in tcp_seq_init()

In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since it's used for connections in both directions. However, these values are just feeding a hash, so as long as we're consistent and include all the information we want, it doesn't really matter. Oddly, for the "src" side we supply the (tap side) forwarding address but the (tap side) endpoint port. This again doesn't really matter, but it's confusing. So swap this with dstport, so "src" is always forwarding and "dst" is always endpoint. Signed-off-by: David Gibson --- tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tcp.c b/tcp.c index 74bf744..56634c9 100644 --- a/tcp.c +++ b/tcp.c @@ -1857,8 +1857,8 @@ static void tcp_seq_init(const struct ctx *c, struct tcp_tap_conn *conn, in_port_t dstport; } __attribute__((__packed__)) in = { .src = conn->faddr, - .srcport = conn->eport, - .dstport = conn->fport, + .srcport = conn->fport, + .dstport = conn->eport, }; uint32_t ns, seq = 0; -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 5/9] tcp, udp: Don't include destination address in partially precomputed csums

We partially prepopulate IP and TCP header structures including, amongst other things the destination address, which for IPv4 is always the known address of the guest/namespace. We partially precompute both the IPv4 header checksum and the TCP checksum based on this. In future we're going to want more flexibility with controlling the destination for IPv4 (as we already do for IPv6), so this precomputed value gets in the way. Therefore remove the IPv4 destination from the precomputed checksum and fold it into the checksum update when we actually send a packet. Doing this means we no longer need to recompute those partial sums when the destination address changes ({tcp,udp}_update_l2_buf()) and instead the computation can be moved to compile time. This means while we perform slightly more computations on each packet, we slightly reduce the amount of memory we need to access. Signed-off-by: David Gibson --- tcp.c | 61 ++++++++++++++++++++-------------------------------------- udp.c | 14 +++----------- util.h | 4 +++- 3 files changed, 27 insertions(+), 52 deletions(-) diff --git a/tcp.c b/tcp.c index 56634c9..c52ea2b 100644 --- a/tcp.c +++ b/tcp.c @@ -323,10 +323,8 @@ #define MSS_DEFAULT 536 struct tcp4_l2_head { /* For MSS4 macro: keep in sync with tcp4_l2_buf_t */ - uint32_t psum; - uint32_t tsum; #ifdef __AVX2__ - uint8_t pad[18]; + uint8_t pad[26]; #else uint8_t pad[2]; #endif @@ -443,8 +441,6 @@ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; /** * tcp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections - * @psum: Partial IP header checksum (excluding tot_len and saddr) - * @tsum: Partial TCP header checksum (excluding length and saddr) * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only * @taph: Tap-level headers (partially pre-filled) * @iph: Pre-filled IP header (except for tot_len and saddr) @@ -452,17 +448,15 @@ static union inany_addr low_rtt_dst[LOW_RTT_TABLE_SIZE]; * @data: Storage for TCP payload */ static struct tcp4_l2_buf_t { - uint32_t psum; /* 0 */ - uint32_t tsum; /* 4 */ #ifdef __AVX2__ - uint8_t pad[18]; /* 8, align th to 32 bytes */ + uint8_t pad[26]; /* 0, align th to 32 bytes */ #else - uint8_t pad[2]; /* align iph to 4 bytes 8 */ + uint8_t pad[2]; /* align iph to 4 bytes 0 */ #endif - struct tap_hdr taph; /* 26 10 */ - struct iphdr iph; /* 44 28 */ - struct tcphdr th; /* 64 48 */ - uint8_t data[MSS4]; /* 84 68 */ + struct tap_hdr taph; /* 26 2 */ + struct iphdr iph; /* 44 20 */ + struct tcphdr th; /* 64 40 */ + uint8_t data[MSS4]; /* 84 60 */ /* 65536 65532 */ #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))) @@ -517,8 +511,6 @@ static struct iovec tcp_iov [UIO_MAXIOV]; /** * tcp4_l2_flags_buf_t - IPv4 packet buffers for segments without data (flags) - * @psum: Partial IP header checksum (excluding tot_len and saddr) - * @tsum: Partial TCP header checksum (excluding length and saddr) * @pad: Align TCP header to 32 bytes, for AVX2 checksum calculation only * @taph: Tap-level headers (partially pre-filled) * @iph: Pre-filled IP header (except for tot_len and saddr) @@ -526,16 +518,14 @@ static struct iovec tcp_iov [UIO_MAXIOV]; * @opts: Headroom for TCP options */ static struct tcp4_l2_flags_buf_t { - uint32_t psum; /* 0 */ - uint32_t tsum; /* 4 */ #ifdef __AVX2__ - uint8_t pad[18]; /* 8, align th to 32 bytes */ + uint8_t pad[26]; /* 0, align th to 32 bytes */ #else - uint8_t pad[2]; /* align iph to 4 bytes 8 */ + uint8_t pad[2]; /* align iph to 4 bytes 0 */ #endif - struct tap_hdr taph; /* 26 10 */ - struct iphdr iph; /* 44 28 */ - struct tcphdr th; /* 64 48 */ + struct tap_hdr taph; /* 26 2 */ + struct iphdr iph; /* 44 20 */ + struct tcphdr th; /* 64 40 */ char opts[OPT_MSS_LEN + OPT_WS_LEN + 1]; #ifdef __AVX2__ } __attribute__ ((packed, aligned(32))) @@ -953,11 +943,13 @@ void tcp_sock_set_bufsize(const struct ctx *c, int s) */ static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf) { - uint32_t sum = buf->psum; + uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_TCP); sum += buf->iph.tot_len; sum += (buf->iph.saddr >> 16) & 0xffff; sum += buf->iph.saddr & 0xffff; + sum += (buf->iph.daddr >> 16) & 0xffff; + sum += buf->iph.daddr & 0xffff; buf->iph.check = (uint16_t)~csum_fold(sum); } @@ -969,10 +961,12 @@ static void tcp_update_check_ip4(struct tcp4_l2_buf_t *buf) static void tcp_update_check_tcp4(struct tcp4_l2_buf_t *buf) { uint16_t tlen = ntohs(buf->iph.tot_len) - 20; - uint32_t sum = buf->tsum; + uint32_t sum = htons(IPPROTO_TCP); sum += (buf->iph.saddr >> 16) & 0xffff; sum += buf->iph.saddr & 0xffff; + sum += (buf->iph.daddr >> 16) & 0xffff; + sum += buf->iph.daddr & 0xffff; sum += htons(ntohs(buf->iph.tot_len) - 20); buf->th.check = 0; @@ -1023,20 +1017,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, if (ip_da) { b4f->iph.daddr = b4->iph.daddr = ip_da->s_addr; - if (!i) { - b4f->iph.saddr = b4->iph.saddr = 0; - b4f->iph.tot_len = b4->iph.tot_len = 0; - b4f->iph.check = b4->iph.check = 0; - b4f->psum = b4->psum = sum_16b(&b4->iph, 20); - - b4->tsum = ((ip_da->s_addr >> 16) & 0xffff) + - (ip_da->s_addr & 0xffff) + - htons(IPPROTO_TCP); - b4f->tsum = b4->tsum; - } else { - b4f->psum = b4->psum = tcp4_l2_buf[0].psum; - b4f->tsum = b4->tsum = tcp4_l2_buf[0].tsum; - } } } } @@ -1045,15 +1025,16 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, * tcp_sock4_iov_init() - Initialise scatter-gather L2 buffers for IPv4 sockets * @c: Execution context */ -static void tcp_sock4_iov_init(const struct ctx *c) +static void tcp_sock4_iov_init(struct ctx *c) { + struct iphdr iph = L2_BUF_IP4_INIT(IPPROTO_TCP); struct iovec *iov; int i; for (i = 0; i < ARRAY_SIZE(tcp4_l2_buf); i++) { tcp4_l2_buf[i] = (struct tcp4_l2_buf_t) { .taph = TAP_HDR_INIT(ETH_P_IP), - .iph = L2_BUF_IP4_INIT(IPPROTO_TCP), + .iph = iph, .th = { .doff = sizeof(struct tcphdr) / 4, .ack = 1 } }; } diff --git a/udp.c b/udp.c index 21c6888..6bb515a 100644 --- a/udp.c +++ b/udp.c @@ -168,7 +168,6 @@ static uint8_t udp_act[IP_VERSIONS][UDP_ACT_TYPE_MAX][DIV_ROUND_UP(NUM_PORTS, 8) /** * udp4_l2_buf_t - Pre-cooked IPv4 packet buffers for tap connections * @s_in: Source socket address, filled in by recvmmsg() - * @psum: Partial IP header checksum (excluding tot_len and saddr) * @taph: Tap-level headers (partially pre-filled) * @iph: Pre-filled IP header (except for tot_len and saddr) * @uh: Headroom for UDP header @@ -176,7 +175,6 @@ static uint8_t udp_act[IP_VERSIONS][UDP_ACT_TYPE_MAX][DIV_ROUND_UP(NUM_PORTS, 8) */ static struct udp4_l2_buf_t { struct sockaddr_in s_in; - uint32_t psum; struct tap_hdr taph; struct iphdr iph; @@ -263,11 +261,13 @@ static void udp_invert_portmap(struct udp_port_fwd *fwd) */ static void udp_update_check4(struct udp4_l2_buf_t *buf) { - uint32_t sum = buf->psum; + uint32_t sum = L2_BUF_IP4_PSUM(IPPROTO_UDP); sum += buf->iph.tot_len; sum += (buf->iph.saddr >> 16) & 0xffff; sum += buf->iph.saddr & 0xffff; + sum += (buf->iph.daddr >> 16) & 0xffff; + sum += buf->iph.daddr & 0xffff; buf->iph.check = (uint16_t)~csum_fold(sum); } @@ -292,14 +292,6 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, if (ip_da) { b4->iph.daddr = ip_da->s_addr; - if (!i) { - b4->iph.saddr = 0; - b4->iph.tot_len = 0; - b4->iph.check = 0; - b4->psum = sum_16b(&b4->iph, 20); - } else { - b4->psum = udp4_l2_buf[0].psum; - } } } } diff --git a/util.h b/util.h index 23dcad5..e4db33a 100644 --- a/util.h +++ b/util.h @@ -141,11 +141,13 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags, .tot_len = 0, \ .id = 0, \ .frag_off = 0, \ - .ttl = 255, \ + .ttl = 0xff, \ .protocol = (proto), \ .saddr = 0, \ .daddr = 0, \ } +#define L2_BUF_IP4_PSUM(proto) ((uint32_t)htons_constant(0x4500) + \ + (uint32_t)htons_constant(0xff00 | (proto))) #define L2_BUF_IP6_INIT(proto) \ { \ -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 6/9] tcp, udp: Don't pre-fill IPv4 destination address in headers

Because packets sent on the tap interface will always be going to the guest/namespace, we more-or-less know what address they'll be going to. So we pre-fill this destination address in our header buffers for IPv4. We can't do the same for IPv6 because we could need either the global or link-local address for the guest. In future we're going to want more flexibility for the destination address, so this pre-filling will get in the way. Change the flow so we always fill in the IPv4 destination address for each packet, rather than prefilling it from proto_update_l2_buf(). In fact for TCP we already redundantly filled the destination for each packet anyway. Signed-off-by: David Gibson --- passt.c | 10 ++++------ passt.h | 4 ++-- pasta.c | 2 +- tap.c | 8 +++----- tcp.c | 8 +------- tcp.h | 3 +-- udp.c | 9 ++------- udp.h | 3 +-- 8 files changed, 15 insertions(+), 32 deletions(-) diff --git a/passt.c b/passt.c index ca0acc6..8ddd9b3 100644 --- a/passt.c +++ b/passt.c @@ -117,13 +117,11 @@ static void timer_init(struct ctx *c, const struct timespec *now) * proto_update_l2_buf() - Update scatter-gather L2 buffers in protocol handlers * @eth_d: Ethernet destination address, NULL if unchanged * @eth_s: Ethernet source address, NULL if unchanged - * @ip_da: Pointer to IPv4 destination address, NULL if unchanged */ -void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da) +void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) { - tcp_update_l2_buf(eth_d, eth_s, ip_da); - udp_update_l2_buf(eth_d, eth_s, ip_da); + tcp_update_l2_buf(eth_d, eth_s); + udp_update_l2_buf(eth_d, eth_s); } /** @@ -247,7 +245,7 @@ int main(int argc, char **argv) if (!c.no_icmp) icmp_init(); - proto_update_l2_buf(c.mac_guest, c.mac, &c.ip4.addr); + proto_update_l2_buf(c.mac_guest, c.mac); if (c.ifi4 && !c.no_dhcp) dhcp_init(); diff --git a/passt.h b/passt.h index 0500ff0..282bd1a 100644 --- a/passt.h +++ b/passt.h @@ -303,7 +303,7 @@ struct ctx { int low_rmem; }; -void proto_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da); +void proto_update_l2_buf(const unsigned char *eth_d, + const unsigned char *eth_s); #endif /* PASST_H */ diff --git a/pasta.c b/pasta.c index dbe8e8c..9bc26d5 100644 --- a/pasta.c +++ b/pasta.c @@ -353,7 +353,7 @@ void pasta_ns_conf(struct ctx *c) } } - proto_update_l2_buf(c->mac_guest, NULL, NULL); + proto_update_l2_buf(c->mac_guest, NULL); } /** diff --git a/tap.c b/tap.c index 4ed63f1..ee79be0 100644 --- a/tap.c +++ b/tap.c @@ -625,10 +625,8 @@ resume: l4_len = l3_len - hlen; - if (iph->saddr && c->ip4.addr_seen.s_addr != iph->saddr) { + if (iph->saddr && c->ip4.addr_seen.s_addr != iph->saddr) c->ip4.addr_seen.s_addr = iph->saddr; - proto_update_l2_buf(NULL, NULL, &c->ip4.addr_seen); - } l4h = packet_get(in, i, sizeof(*eh) + hlen, l4_len, NULL); if (!l4h) @@ -969,7 +967,7 @@ redo: if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) { memcpy(c->mac_guest, eh->h_source, ETH_ALEN); - proto_update_l2_buf(c->mac_guest, NULL, NULL); + proto_update_l2_buf(c->mac_guest, NULL); } switch (ntohs(eh->h_proto)) { @@ -1030,7 +1028,7 @@ restart: if (memcmp(c->mac_guest, eh->h_source, ETH_ALEN)) { memcpy(c->mac_guest, eh->h_source, ETH_ALEN); - proto_update_l2_buf(c->mac_guest, NULL, NULL); + proto_update_l2_buf(c->mac_guest, NULL); } switch (ntohs(eh->h_proto)) { diff --git a/tcp.c b/tcp.c index c52ea2b..87f443a 100644 --- a/tcp.c +++ b/tcp.c @@ -997,10 +997,8 @@ static void tcp_update_check_tcp6(struct tcp6_l2_buf_t *buf) * tcp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged * @eth_s: Ethernet source address, NULL if unchanged - * @ip_da: Pointer to IPv4 destination address, NULL if unchanged */ -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da) +void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) { int i; @@ -1014,10 +1012,6 @@ void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, tap_update_mac(&b6->taph, eth_d, eth_s); tap_update_mac(&b4f->taph, eth_d, eth_s); tap_update_mac(&b6f->taph, eth_d, eth_s); - - if (ip_da) { - b4f->iph.daddr = b4->iph.daddr = ip_da->s_addr; - } } } diff --git a/tcp.h b/tcp.h index 3454d9a..1608d58 100644 --- a/tcp.h +++ b/tcp.h @@ -26,8 +26,7 @@ void tcp_timer(struct ctx *c, const struct timespec *ts); void tcp_defer_handler(struct ctx *c); void tcp_sock_set_bufsize(const struct ctx *c, int s); -void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da); +void tcp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); /** * union tcp_epoll_ref - epoll reference portion for TCP connections diff --git a/udp.c b/udp.c index 6bb515a..f4ed660 100644 --- a/udp.c +++ b/udp.c @@ -276,10 +276,8 @@ static void udp_update_check4(struct udp4_l2_buf_t *buf) * udp_update_l2_buf() - Update L2 buffers with Ethernet and IPv4 addresses * @eth_d: Ethernet destination address, NULL if unchanged * @eth_s: Ethernet source address, NULL if unchanged - * @ip_da: Pointer to IPv4 destination address, NULL if unchanged */ -void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da) +void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s) { int i; @@ -289,10 +287,6 @@ void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, tap_update_mac(&b4->taph, eth_d, eth_s); tap_update_mac(&b6->taph, eth_d, eth_s); - - if (ip_da) { - b4->iph.daddr = ip_da->s_addr; - } } } @@ -578,6 +572,7 @@ static size_t udp_update_hdr4(const struct ctx *c, int n, in_port_t dstport, ip_len = udp4_l2_mh_sock[n].msg_len + sizeof(b->iph) + sizeof(b->uh); b->iph.tot_len = htons(ip_len); + b->iph.daddr = c->ip4.addr_seen.s_addr; src_port = ntohs(b->s_in.sin_port); diff --git a/udp.h b/udp.h index f9d4459..f553de2 100644 --- a/udp.h +++ b/udp.h @@ -16,8 +16,7 @@ int udp_sock_init(const struct ctx *c, int ns, sa_family_t af, const void *addr, const char *ifname, in_port_t port); int udp_init(struct ctx *c); void udp_timer(struct ctx *c, const struct timespec *ts); -void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s, - const struct in_addr *ip_da); +void udp_update_l2_buf(const unsigned char *eth_d, const unsigned char *eth_s); /** * union udp_epoll_ref - epoll reference portion for TCP connections -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 7/9] tcp: Move in_epoll flag out of common connection structure

The in_epoll boolean is one of only two fields (currently) in the common structure shared between tap and spliced connections. It seems like it belongs there, because both tap and spliced connections use it, and it has roughly the same meaning. Roughly, however, isn't exactly: which fds this flag says are in the epoll varies between the two connection types, and are in type specific fields. So, it's only possible to meaningfully use this value locally in type specific code anyway. This common field is going to get in the way of more widespread generalisation of connection / flow tracking, so move it to separate fields in the tap and splice specific structures. Signed-off-by: David Gibson --- tcp.c | 6 +++--- tcp_conn.h | 6 ++++-- tcp_splice.c | 4 ++-- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/tcp.c b/tcp.c index 87f443a..f396ede 100644 --- a/tcp.c +++ b/tcp.c @@ -634,13 +634,13 @@ static void conn_flag_do(const struct ctx *c, struct tcp_tap_conn *conn, */ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) { - int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref = { .type = EPOLL_TYPE_TCP, .fd = conn->sock, .tcp.index = CONN_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64 }; if (conn->events == CLOSED) { - if (conn->c.in_epoll) + if (conn->in_epoll) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->sock, &ev); if (conn->timer != -1) epoll_ctl(c->epollfd, EPOLL_CTL_DEL, conn->timer, &ev); @@ -652,7 +652,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn) if (epoll_ctl(c->epollfd, m, conn->sock, &ev)) return -errno; - conn->c.in_epoll = true; + conn->in_epoll = true; if (conn->timer != -1) { union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER, diff --git a/tcp_conn.h b/tcp_conn.h index e533bd4..d67ea62 100644 --- a/tcp_conn.h +++ b/tcp_conn.h @@ -12,11 +12,9 @@ /** * struct tcp_conn_common - Common fields for spliced and non-spliced * @spliced: Is this a spliced connection? - * @in_epoll: Is the connection in the epoll set? */ struct tcp_conn_common { bool spliced :1; - bool in_epoll :1; }; extern const char *tcp_common_flag_str[]; @@ -24,6 +22,7 @@ extern const char *tcp_common_flag_str[]; /** * struct tcp_tap_conn - Descriptor for a TCP connection (not spliced) * @c: Fields common with tcp_splice_conn + * @in_epoll: Is the connection in the epoll set? * @next_index: Connection index of next item in hash chain, -1 for none * @tap_mss: MSS advertised by tap/guest, rounded to 2 ^ TCP_MSS_BITS * @sock: Socket descriptor number @@ -50,6 +49,7 @@ struct tcp_tap_conn { /* Must be first element to match tcp_splice_conn */ struct tcp_conn_common c; + bool in_epoll :1; int next_index :TCP_CONN_INDEX_BITS + 2; #define TCP_RETRANS_BITS 3 @@ -122,6 +122,7 @@ struct tcp_tap_conn { /** * struct tcp_splice_conn - Descriptor for a spliced TCP connection * @c: Fields common with tcp_tap_conn + * @in_epoll: Is the connection in the epoll set? * @a: File descriptor number of socket for accepted connection * @pipe_a_b: Pipe ends for splice() from @a to @b * @b: File descriptor number of peer connected socket @@ -137,6 +138,7 @@ struct tcp_splice_conn { /* Must be first element to match tcp_tap_conn */ struct tcp_conn_common c; + bool in_epoll :1; int a; int pipe_a_b[2]; int b; diff --git a/tcp_splice.c b/tcp_splice.c index 64c1263..1f89d6a 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -172,7 +172,7 @@ static void conn_flag_do(const struct ctx *c, struct tcp_splice_conn *conn, static int tcp_splice_epoll_ctl(const struct ctx *c, struct tcp_splice_conn *conn) { - int m = conn->c.in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; + int m = conn->in_epoll ? EPOLL_CTL_MOD : EPOLL_CTL_ADD; union epoll_ref ref_a = { .type = EPOLL_TYPE_TCP, .fd = conn->a, .tcp.index = CONN_IDX(conn) }; union epoll_ref ref_b = { .type = EPOLL_TYPE_TCP, .fd = conn->b, @@ -192,7 +192,7 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, epoll_ctl(c->epollfd, m, conn->b, &ev_b)) goto delete; - conn->c.in_epoll = true; + conn->in_epoll = true; return 0; -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 8/9] inany: Add missing double include guard to inany.h

This was overlooked when the file was created. Signed-off-by: David Gibson --- inany.h | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/inany.h b/inany.h index 8f0897b..aadb20b 100644 --- a/inany.h +++ b/inany.h @@ -6,6 +6,9 @@ * IPv6 or IPv4 (encoded as IPv4-mapped IPv6 addresses) */ +#ifndef INANY_H +#define INANY_H + /** union inany_addr - Represents either an IPv4 or IPv6 address * @a6: Address as an IPv6 address, may be IPv4-mapped * @v4mapped.zero: All zero-bits for an IPv4 address @@ -90,3 +93,5 @@ static inline void inany_from_sockaddr(union inany_addr *aa, in_port_t *port, ASSERT(0); } } + +#endif /* INANY_H */ -- 2.41.0

David Gibson

4:11 a.m.

New subject: [PATCH v3 9/9] tcp: Remove broken pressure calculations for tcp_defer_handler()

tcp_defer_handler() performs a potentially expensive linear scan of the connection table. So, to mitigate the cost of that we skip if if we're not under at least moderate pressure: either 30% of available connections or 30% (estimated) of available fds used. But, the calculation for this has been broken since it was introduced: we calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS, meaning we only exit early if conn_count is less than 30% of itself, i.e. never. If that calculation is "corrected" to be based on TCP_MAX_CONNS, it completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my laptop. My guess is that this is because in the case of many short lived connections, we're letting the table become much fuller before compacting it. That means that other places which perform a table scan now have to do much, much more. For the time being, simply remove the tests, since they're not doing anything useful. We can reintroduce them more carefully if we see a need for them. Signed-off-by: David Gibson --- tcp.c | 9 --------- 1 file changed, 9 deletions(-) diff --git a/tcp.c b/tcp.c index f396ede..c89e6e4 100644 --- a/tcp.c +++ b/tcp.c @@ -309,9 +309,6 @@ #define TCP_FRAMES \ (c->mode == MODE_PASST ? TCP_FRAMES_MEM : 1) -#define TCP_FILE_PRESSURE 30 /* % of c->nofile */ -#define TCP_CONN_PRESSURE 30 /* % of c->tcp.conn_count */ - #define TCP_HASH_TABLE_LOAD 70 /* % */ #define TCP_HASH_TABLE_SIZE (TCP_MAX_CONNS * 100 / \ TCP_HASH_TABLE_LOAD) @@ -1385,17 +1382,11 @@ static void tcp_l2_data_buf_flush(struct ctx *c) */ void tcp_defer_handler(struct ctx *c) { - int max_conns = c->tcp.conn_count / 100 * TCP_CONN_PRESSURE; - int max_files = c->nofile / 100 * TCP_FILE_PRESSURE; union tcp_conn *conn; tcp_l2_flags_buf_flush(c); tcp_l2_data_buf_flush(c); - if ((c->tcp.conn_count < MIN(max_files, max_conns)) && - (c->tcp.splice_conn_count < MIN(max_files / 6, max_conns))) - return; - for (conn = tc + c->tcp.conn_count - 1; conn >= tc; conn--) { if (conn->c.spliced) { if (conn->splice.flags & CLOSING) -- 2.41.0

864

Age (days ago)

864

Last active (days ago)

List overview

Download

9 comments

1 participants

participants (1)

David Gibson

[PATCH v3 0/9] Flow Table Preliminaries

tags

participants (1)