[PATCH v3 0/9] Flow Table Preliminaries
I'm still working on bunch of things to start implementing the generalised flow table. However, I think this set of preliminary clean ups and fixes stand well enough on their own that they're ready for merge now. Changes since v2: * Fix a formatting error in the in_epoll patch * Add patch for inany.h include guards * Add patch to remove broken pressure estimates for tcp_defer_handler() Changes since v1: * Add missing patch moving in_epoll flag David Gibson (9): tap: Don't clobber source address in tap6_handler() tap: Pass source address to protocol handler functions tcp: More precise terms for addresses and ports tcp: Consistent usage of ports in tcp_seq_init() tcp, udp: Don't include destination address in partially precomputed csums tcp, udp: Don't pre-fill IPv4 destination address in headers tcp: Move in_epoll flag out of common connection structure inany: Add missing double include guard to inany.h tcp: Remove broken pressure calculations for tcp_defer_handler() icmp.c | 12 ++- icmp.h | 3 +- inany.h | 5 ++ passt.c | 10 +-- passt.h | 4 +- pasta.c | 2 +- tap.c | 29 ++++---- tcp.c | 203 ++++++++++++++++++++++----------------------------- tcp.h | 5 +- tcp_conn.h | 18 +++-- tcp_splice.c | 4 +- udp.c | 37 ++++------ udp.h | 5 +- util.h | 4 +- 14 files changed, 156 insertions(+), 185 deletions(-) -- 2.41.0
In tap6_handler() saddr is initialized to the IPv6 source address from the
incoming packet. However part way through, but before organizing the
packet into a "sequence" we set it unconditionally to the guest's assigned
address. We don't do anything equivalent for IPv4.
This doesn't make a lot of sense: if the guest is using a different source
address it makes sense to consider these different sequences of packets and
we shouldn't try to combine them together.
Signed-off-by: David Gibson
The tap code passes the IPv4 or IPv6 destination address of packets it
receives to the protocol specific code. Currently that protocol code
doesn't use the source address, but we want it to in future. So, in
preparation, pass the IPv4/IPv6 source address of tap packets to those
functions as well.
Signed-off-by: David Gibson
In a number of places the comments and variable names we use to describe
addresses and ports are ambiguous. It's not sufficient to describe a port
as "tap-facing" or "socket-facing", because on both the tap side and the
socket side there are two ports for the two ends of the connection.
Similarly, "local" and "remote" aren't particularly helpful, because it's
not necessarily clear whether we're talking from the point of view of the
guest/namespace, the host, or passt itself.
This patch makes a number of changes to be more precise about this. It
introduces two new terms in aid of this:
A "forwarding" address (or port) refers to an address which is local
from the point of view of passt itself. That is a source address for
traffic sent by passt, whether it's to the guest via the tap interface
or to a host on the internet via a socket.
The "endpoint" address (or port) is the reverse: a remote address
from passt's point of view, the destination address for traffic sent
by passt.
Between them the "side" (either tap/guest-facing or sock/host-facing)
and forwarding vs. endpoint unambiguously describes which address or
port we're talking about.
Signed-off-by: David Gibson
In tcp_seq_init() the meaning of "src" and "dst" isn't really clear since
it's used for connections in both directions. However, these values are
just feeding a hash, so as long as we're consistent and include all the
information we want, it doesn't really matter.
Oddly, for the "src" side we supply the (tap side) forwarding address but
the (tap side) endpoint port. This again doesn't really matter, but it's
confusing. So swap this with dstport, so "src" is always forwarding
and "dst" is always endpoint.
Signed-off-by: David Gibson
We partially prepopulate IP and TCP header structures including, amongst
other things the destination address, which for IPv4 is always the known
address of the guest/namespace. We partially precompute both the IPv4
header checksum and the TCP checksum based on this.
In future we're going to want more flexibility with controlling the
destination for IPv4 (as we already do for IPv6), so this precomputed value
gets in the way. Therefore remove the IPv4 destination from the
precomputed checksum and fold it into the checksum update when we actually
send a packet.
Doing this means we no longer need to recompute those partial sums when
the destination address changes ({tcp,udp}_update_l2_buf()) and instead
the computation can be moved to compile time. This means while we perform
slightly more computations on each packet, we slightly reduce the amount of
memory we need to access.
Signed-off-by: David Gibson
Because packets sent on the tap interface will always be going to the
guest/namespace, we more-or-less know what address they'll be going to. So
we pre-fill this destination address in our header buffers for IPv4. We
can't do the same for IPv6 because we could need either the global or
link-local address for the guest. In future we're going to want more
flexibility for the destination address, so this pre-filling will get in
the way.
Change the flow so we always fill in the IPv4 destination address for each
packet, rather than prefilling it from proto_update_l2_buf(). In fact for
TCP we already redundantly filled the destination for each packet anyway.
Signed-off-by: David Gibson
The in_epoll boolean is one of only two fields (currently) in the common
structure shared between tap and spliced connections. It seems like it
belongs there, because both tap and spliced connections use it, and it has
roughly the same meaning.
Roughly, however, isn't exactly: which fds this flag says are in the epoll
varies between the two connection types, and are in type specific fields.
So, it's only possible to meaningfully use this value locally in type
specific code anyway.
This common field is going to get in the way of more widespread
generalisation of connection / flow tracking, so move it to separate fields
in the tap and splice specific structures.
Signed-off-by: David Gibson
This was overlooked when the file was created.
Signed-off-by: David Gibson
tcp_defer_handler() performs a potentially expensive linear scan of the
connection table. So, to mitigate the cost of that we skip if if we're not
under at least moderate pressure: either 30% of available connections or
30% (estimated) of available fds used.
But, the calculation for this has been broken since it was introduced: we
calculate "max_conns" based on c->tcp.conn_count, not TCP_MAX_CONNS,
meaning we only exit early if conn_count is less than 30% of itself, i.e.
never.
If that calculation is "corrected" to be based on TCP_MAX_CONNS, it
completely tanks the TCP CRR times for passt - from ~60ms to >1000ms on my
laptop. My guess is that this is because in the case of many short lived
connections, we're letting the table become much fuller before compacting
it. That means that other places which perform a table scan now have to
do much, much more.
For the time being, simply remove the tests, since they're not doing
anything useful. We can reintroduce them more carefully if we see a need
for them.
Signed-off-by: David Gibson
participants (1)
-
David Gibson