[PATCH 0/6] RFC: Possible fixes for bug 94 and bug 95
As discussed on IRC through the last day, here's a more polished version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95 (EPOLLERR storm). Both of those were sub-problems arising while investigating podman bug 23686. We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6), based on Stefano's testing. I ended up rewriting Stefano's draft patch for the EPOLLERR case (bug 95, remaining patches), because I thought of a possibility we hadn't discussed yet: we weren't getting an error from the socket error queue, but we might be able to get one with the SO_ERROR getsockopt(). My examination of the kernel code suggests that's plausible, and that if that's the case using SO_ERROR should also clear that error condition. Link: https://bugs.passt.top/show_bug.cgi?id=94 Link: https://bugs.passt.top/show_bug.cgi?id=95 Link: https://github.com/containers/podman/issues/23686 David Gibson (6): flow: Fix incorrect hash probe in flowside_lookup() udp: Allow UDP flows to be prematurely closed flow: Helpers to log details of a flow udp: Split socket error handling out from udp_sock_recv() udp: Treat errors getting errors as unrecoverable udp: Handle more error conditions in udp_sock_errs() flow.c | 53 ++++++++++++++++++++------------ flow.h | 7 +++++ udp.c | 88 +++++++++++++++++++++++++++++++++++++++++++++--------- udp_flow.c | 18 ++++++++++- udp_flow.h | 4 +++ 5 files changed, 136 insertions(+), 34 deletions(-) -- 2.46.0
Our flow hash table uses linear probing in which we step backwards through
clusters of adjacent hash entries when we have near collisions. Usually
that's implemented by flow_hash_probe(). However, due to some details we
need a second implementation in flowside_lookup(). An embarrassing
oversight in rebasing from earlier versions has mean that version is
incorrect, trying to step forward through clusters rather than backward.
In situations with the right sorts of has near-collisions this can lead to
us not associating an ACK from the tap device with the right flow, leaving
it in a not-quite-established state. If the remote peer does a shutdown()
at the right time, this can lead to a storm of EPOLLRDHUP events causing
high CPU load.
Fixes: acca4235c ("flow, tcp: Generalise TCP hash table to general...")
Link: https://bugs.passt.top/show_bug.cgi?id=94
Suggested-by: Stefano Brivio
Unlike TCP, UDP has no in-band signalling for the end of a flow. So the
only way we remove flows is on a timer if they have no activity for 180s.
However, we've started to investigate some error conditions in which we
want to prematurely abort / abandon a UDP flow. We can call
udp_flow_close(), which will make the flow inert (sockets closed, no epoll
events, can't be looked up in hash). However it will still wait 3 minutes
to clear away the stale entry.
Clean this up by adding an explicit 'closed' flag which will cause a flow
to be more promptly cleaned up. We also publish udp_flow_close() so it
can be called from other places to abort UDP flows().
Signed-off-by: David Gibson
The details of a flow - endpoints, interfaces etc. - can be pretty
important for debugging. We log this on flow state transitions, but it can
also be useful to log this when we report specific conditions. Add some
helper functions and macros to make it easy to do that.
Signed-off-by: David Gibson
Currently udp_sock_recv() both attempts to clear socket errors and read
a batch of datagrams for forwarding. That made sense initially, since
both listening and reply sockets need to do this. However, we have certain
error cases which will add additional complexity to the error processing.
Furthermore, if we ever wanted to more thoroughly handle errors received
here - e.g. by synthesising ICMP messages on the tap device - it will
likely require different handling for the listening and reply socket cases.
So, split handling of error events into its own udp_sock_errs() function.
While we're there, allow it to report "unrecoverable errors". We don't
have any of these so far, but some cases we're working on might require it.
Signed-off-by: David Gibson
We can get network errors, usually transient, reported via the socket error
queue. However, at least theoretically, we could get errors trying to
read the queue itself. Since we have no idea how to clear an error
condition in that case, treat it as unrecoverable.
Signed-off-by: David Gibson
udp_sock_errs() reads out everything in the socket error queue. However
we've seen some cases[0] where an EPOLLERR event is active, but there isn't
anything in the queue.
One possibility is that the error is reported instead by the SO_ERROR
sockopt. Check for that case and report it as best we can. If we still
get an EPOLLERR without visible error, we have no way to clear the error
state, so treat it as an unrecoverable error.
[0] https://github.com/containers/podman/issues/23686#issuecomment-2324945010
Link: https://bugs.passt.top/show_bug.cgi?id=95
Signed-off-by: David Gibson
On Fri, 6 Sep 2024 15:17:04 +1000
David Gibson
As discussed on IRC through the last day, here's a more polished version of possible fixes for bug 94 (EPOLLRDHUP storm) and bug 95 (EPOLLERR storm). Both of those were sub-problems arising while investigating podman bug 23686.
We're pretty confident about the EPOLLRDHUP fix (bug 94, patch 1/6), based on Stefano's testing. I ended up rewriting Stefano's draft patch for the EPOLLERR case (bug 95, remaining patches), because I thought of a possibility we hadn't discussed yet: we weren't getting an error from the socket error queue, but we might be able to get one with the SO_ERROR getsockopt(). My examination of the kernel code suggests that's plausible, and that if that's the case using SO_ERROR should also clear that error condition.
Link: https://bugs.passt.top/show_bug.cgi?id=94 Link: https://bugs.passt.top/show_bug.cgi?id=95 Link: https://github.com/containers/podman/issues/23686
Applied. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio