[PATCH 0/7] flow: Introduce flow_epoll_set() to centralize epoll operations
Currently, each protocol handler (TCP, TCP splice, ICMP/ping, UDP) has its own code to add or modify file descriptors in epoll. This leads to duplicated boilerplate across icmp.c, tcp.c, tcp_splice.c, and udp_flow.c, with each setting up epoll_ref unions and calling epoll_ctl() with flow-type-specific details. This series introduces flow_epoll_set() in flow.c to handle epoll operations for all flow types in a unified way. The function evolves across the series, progressively absorbing more responsibilities: 1. Initial version takes explicit fd, events, and ctx parameters 2. Events computation is moved inside the function by relocating tcp_conn_epoll_events() and tcp_splice_conn_epoll_events() into flow.c 3. The ctx parameter is removed by using epoll_id_to_fd[EPOLLFD_ID_DEFAULT] directly for new flows 4. The fd parameter is removed by having the function extract the file descriptor from the flow structure based on the epoll type The final API is simply: int flow_epoll_set(enum epoll_type type, const struct flow_common *f, unsigned int sidei); This centralized approach will be essential for queue pair migration in the upcoming multithreading work, where flows need to be moved between different epoll instances owned by different threads. Preparatory patches: - Patch 1 fixes EPOLL_TYPE_TCP_TIMER to use the actual timer fd - Patch 2 refactors udp_flow_sock() to assign the socket internally - Patch 3 refactors tcp_splice_conn_epoll_events() to per-side computation Core series: - Patch 4 introduces the initial flow_epoll_set() function - Patch 5 removes the ctx parameter - Patch 6 moves event computation inside flow_epoll_set() - Patch 7 removes the fd parameter, completing the refactoring Laurent Vivier (7): tcp: Update EPOLL_TYPE_TCP_TIMER fd udp_flow: Assign socket to flow inside udp_flow_sock() tcp_splice: Refactor tcp_splice_conn_epoll_events() to per-side computation flow: Introduce flow_epoll_set() to centralize epoll operations flow: Use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] in flow_epollid_set() flow: Compute epoll events inside flow_epoll_set() flow: Have flow_epoll_set() retrieve file descriptor from flow structure flow.c | 158 +++++++++++++++++++++++++++++++++++++++++++++++++++ flow.h | 4 ++ icmp.c | 8 +-- tcp.c | 68 +++------------------- tcp_splice.c | 77 +++++-------------------- udp_flow.c | 17 ++---- 6 files changed, 190 insertions(+), 142 deletions(-) -- 2.51.1
For consistency with other epoll events, set the fd field to the file
descriptor actually added by epoll_ctl() (conn->timer), rather than
conn->sock.
This is a no-op change as ref.fd is not currently used in
tcp_timer_handler().
Signed-off-by: Laurent Vivier
Move the assignment of uflow->s[sidei] from the caller (udp_flow_new())
into udp_flow_sock() itself, placing it after the successful connect().
This is a pure refactoring with no functional change. The socket fd is
now assigned within udp_flow_sock() where the socket is created, rather
than requiring the caller to capture the return value. On error paths,
uflow->s[sidei] remains at its initialized value of -1 rather than being
set to the negative error code, which is semantically cleaner (though
functionally equivalent given the >= 0 check in udp_flow_close()).
Signed-off-by: Laurent Vivier
The function tcp_splice_conn_epoll_events() currently takes an array of
struct epoll_event and fills in the .events field for both sides using
flow_foreach_sidei() loops.
This works, but the function is doing two conceptually separate things
at once: computing events for side 0 and computing events for side 1.
The OUT_WAIT handling is particularly subtle, as it has cross-side
effects: when OUT_WAIT(sidei) is set, we add EPOLLOUT to ev[sidei] but
also remove EPOLLIN from ev[!sidei].
Refactor to make the function compute events for a single side at a
time, taking sidei as a parameter and returning uint32_t. This makes
the logic more focused and easier to follow. The cross-side effects of
OUT_WAIT are preserved by checking both OUT_WAIT(sidei) and
OUT_WAIT(!sidei) within each call.
The caller tcp_splice_epoll_ctl() now invokes the function twice, once
for each side, making the two-sided nature of the operation explicit.
No functional change.
Signed-off-by: Laurent Vivier
Currently, each flow type (TCP, TCP_SPLICE, PING, UDP) has its own
code to add or modify file descriptors in epoll. This leads to
duplicated boilerplate code across icmp.c, tcp.c, tcp_splice.c, and
udp_flow.c, each setting up epoll_ref unions and calling epoll_ctl()
with flow-type-specific details.
Introduce flow_epoll_set() in flow.c to handle epoll operations for
all flow types in a unified way. The function takes an explicit epoll
type parameter, allowing it to handle not only flow socket types but
also the TCP timer (EPOLL_TYPE_TCP_TIMER).
This will be needed to migrate queue pair from an epollfd to another.
Signed-off-by: Laurent Vivier
In tcp_epoll_ctl() and tcp_splice_epoll_ctl(), flow_epollid_set() is
called after flow_epoll_set() to set the epollid to EPOLLFD_ID_DEFAULT.
When updating an existing flow (EPOLL_CTL_MOD), flow_epoll_set() already
retrieves the correct epoll fd via flow_epollfd(). For new additions
(EPOLL_CTL_ADD), we can use epoll_id_to_fd[EPOLLFD_ID_DEFAULT] directly
instead of c->epollfd, since that's the epollid that will be set.
This removes the need for the ctx parameter from flow_epoll_set() and
simplifies the API. The change cascades through all callers: tcp.c,
icmp.c, udp_flow.c, and tcp_splice.c.
In tcp_splice.c, removing ctx from flow_epoll_set() calls allows us to
also remove the ctx parameter from tcp_splice_epoll_ctl() and
conn_event_do(), further simplifying the splice connection handling.
Signed-off-by: Laurent Vivier
Rather than having each caller compute the appropriate epoll events and
pass them to flow_epoll_set(), move the event computation into
flow_epoll_set() itself. The function already switches on epoll_type to
determine other parameters, so computing events there is a natural fit.
Move tcp_conn_epoll_events() from tcp.c and tcp_splice_conn_epoll_events()
from tcp_splice.c into flow.c where they can be called internally. For
simpler cases (PING, UDP, TCP_TIMER), the events are constant values that
are now set directly in the switch cases.
This centralizes all epoll event logic in one place, making it easier to
understand and maintain the relationship between flow types and their
epoll configurations.
Signed-off-by: Laurent Vivier
Currently, callers of flow_epoll_set() pass the file descriptor as an
explicit parameter. However, the function already has access to the flow
structure and knows the epoll type, so it can retrieve the appropriate fd
itself.
Remove the fd parameter and have flow_epoll_set() extract the file
descriptor directly from the flow based on the type:
- EPOLL_TYPE_TCP_SPLICE: conn->s[sidei]
- EPOLL_TYPE_TCP: conn->sock
- EPOLL_TYPE_TCP_TIMER: conn->timer
- EPOLL_TYPE_PING: pingf->sock
- EPOLL_TYPE_UDP: uflow->s[sidei]
For TCP timers, the previous logic determined ADD vs MOD by checking
whether conn->timer was -1. Since the timer fd must now be assigned to the
flow before calling flow_epoll_set(), this check would no longer work.
Instead, introduce FLOW_EPOLL_TIMER_ADD and FLOW_EPOLL_TIMER_MOD constants
and have callers specify the operation explicitly via the sidei parameter.
This requires callers to assign the socket or timer fd to the flow
structure before calling flow_epoll_set(), with appropriate cleanup
(resetting to -1) if the epoll registration fails.
Signed-off-by: Laurent Vivier
On Fri, Dec 19, 2025 at 05:45:12PM +0100, Laurent Vivier wrote:
For consistency with other epoll events, set the fd field to the file descriptor actually added by epoll_ctl() (conn->timer), rather than conn->sock.
This is a no-op change as ref.fd is not currently used in tcp_timer_handler().
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tcp.c b/tcp.c index b179e3991236..6e6156098c12 100644 --- a/tcp.c +++ b/tcp.c @@ -554,7 +554,7 @@ static int tcp_epoll_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->timer != -1) { union epoll_ref ref_t = { .type = EPOLL_TYPE_TCP_TIMER, - .fd = conn->sock, + .fd = conn->timer, .flow = FLOW_IDX(conn) }; struct epoll_event ev_t = { .data.u64 = ref_t.u64, .events = EPOLLIN | EPOLLET }; @@ -582,7 +582,6 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn)
if (conn->timer == -1) { union epoll_ref ref = { .type = EPOLL_TYPE_TCP_TIMER, - .fd = conn->sock, .flow = FLOW_IDX(conn) }; struct epoll_event ev = { .data.u64 = ref.u64, .events = EPOLLIN | EPOLLET }; @@ -598,6 +597,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) return; } conn->timer = fd; + ref.fd = conn->timer;
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) { flow_dbg_perror(conn, "failed to add timer"); -- 2.51.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Dec 19, 2025 at 05:45:13PM +0100, Laurent Vivier wrote:
Move the assignment of uflow->s[sidei] from the caller (udp_flow_new()) into udp_flow_sock() itself, placing it after the successful connect().
This is a pure refactoring with no functional change. The socket fd is now assigned within udp_flow_sock() where the socket is created, rather than requiring the caller to capture the return value. On error paths, uflow->s[sidei] remains at its initialized value of -1 rather than being set to the negative error code, which is semantically cleaner (though functionally equivalent given the >= 0 check in udp_flow_close()).
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- udp_flow.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/udp_flow.c b/udp_flow.c index 8907f2f72741..33f29f21e69e 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -109,6 +109,7 @@ static int udp_flow_sock(const struct ctx *c, flow_dbg_perror(uflow, "Couldn't connect flow socket"); return rc; } + uflow->s[sidei] = s;
/* It's possible, if unlikely, that we could receive some packets in * between the bind() and connect() which may or may not be for this @@ -163,7 +164,7 @@ static flow_sidx_t udp_flow_new(const struct ctx *c, union flow *flow,
flow_foreach_sidei(sidei) { if (pif_is_socket(uflow->f.pif[sidei])) - if ((uflow->s[sidei] = udp_flow_sock(c, uflow, sidei)) < 0) + if (udp_flow_sock(c, uflow, sidei) < 0) goto cancel; }
-- 2.51.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, Dec 19, 2025 at 05:45:14PM +0100, Laurent Vivier wrote:
The function tcp_splice_conn_epoll_events() currently takes an array of struct epoll_event and fills in the .events field for both sides using flow_foreach_sidei() loops.
This works, but the function is doing two conceptually separate things at once: computing events for side 0 and computing events for side 1. The OUT_WAIT handling is particularly subtle, as it has cross-side effects: when OUT_WAIT(sidei) is set, we add EPOLLOUT to ev[sidei] but also remove EPOLLIN from ev[!sidei].
Refactor to make the function compute events for a single side at a time, taking sidei as a parameter and returning uint32_t. This makes the logic more focused and easier to follow. The cross-side effects of OUT_WAIT are preserved by checking both OUT_WAIT(sidei) and OUT_WAIT(!sidei) within each call.
The caller tcp_splice_epoll_ctl() now invokes the function twice, once for each side, making the two-sided nature of the operation explicit.
No functional change.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- tcp_splice.c | 33 ++++++++++++++------------------- 1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/tcp_splice.c b/tcp_splice.c index 440522449c13..bf4ff466de07 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -114,29 +114,23 @@ static struct tcp_splice_conn *conn_at_sidx(flow_sidx_t sidx) * @events: Connection event flags * @ev: Events to fill in, 0 is accepted socket, 1 is connecting socket */ -static void tcp_splice_conn_epoll_events(uint16_t events, - struct epoll_event ev[]) +static uint32_t tcp_splice_conn_epoll_events(uint16_t events, unsigned sidei) { - unsigned sidei; - - flow_foreach_sidei(sidei) - ev[sidei].events = 0; + uint32_t e = 0;
if (events & SPLICE_ESTABLISHED) { - flow_foreach_sidei(sidei) { - if (!(events & FIN_SENT(!sidei))) - ev[sidei].events = EPOLLIN | EPOLLRDHUP; - } - } else if (events & SPLICE_CONNECT) { - ev[1].events = EPOLLOUT; + if (!(events & FIN_SENT(!sidei))) + e = EPOLLIN | EPOLLRDHUP; + } else if (sidei == 1 && events & SPLICE_CONNECT) { + e = EPOLLOUT; }
- flow_foreach_sidei(sidei) { - if (events & OUT_WAIT(sidei)) { - ev[sidei].events |= EPOLLOUT; - ev[!sidei].events &= ~EPOLLIN; - } - } + if (events & OUT_WAIT(sidei)) + e |= EPOLLOUT; + if (events & OUT_WAIT(!sidei)) + e &= ~EPOLLIN; + + return e; }
/** @@ -161,7 +155,8 @@ static int tcp_splice_epoll_ctl(const struct ctx *c, struct epoll_event ev[SIDES] = { { .data.u64 = ref[0].u64 }, { .data.u64 = ref[1].u64 } };
- tcp_splice_conn_epoll_events(conn->events, ev); + ev[0].events = tcp_splice_conn_epoll_events(conn->events, 0); + ev[1].events = tcp_splice_conn_epoll_events(conn->events, 1);
if (epoll_ctl(epollfd, m, conn->s[0], &ev[0]) || -- 2.51.1
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Laurent Vivier