[PATCH v5 0/7] Refactor epoll handling in preparation for multithreading
This series refactors how epoll file descriptors are managed throughout the codebase in preparation for introducing multithreading support. Currently, passt uses a single global epollfd accessed through the context structure. With multithreading, each thread will need its own epollfd managing its subset of flows. The key changes are: 1. Centralize epoll management by extracting helper functions into a new epoll_ctl.c/h module and moving union epoll_ref from passt.h to its more logical location in epoll_ctl.h. 2. Simplify epoll_del() to take the epollfd directly rather than extracting it from the context structure, reducing coupling between epoll operations and the global context. 3. Move epoll registration out of sock_l4_sa() into protocol-specific code, giving callers explicit control over which epoll instance manages each socket. 4. Replace the boolean in_epoll flag in TCP connections with an epollid (epoll identifier) field in flow_common. This serves dual purposes: tracking registration status (EPOLLFD_ID_INVALID = not registered) and identifying which epoll instance manages the flow. An epoll ID to epoll fd mapping allows retrieving the actual epoll file descriptor. The epollid field is 8 bits, limiting values to 0-254 (255 = EPOLLFD_ID_INVALID). 5. Apply this pattern consistently across all protocol handlers (TCP, ICMP, UDP), storing the managing epoll ID in each flow's common structure. 6. Extract the event loop processing logic into a separate passt_worker() function, preparing the structure for future threading where this will become a worker thread callback. These changes make epoll instance ownership explicit in the flow tracking system, enabling flows to be managed by different epoll instances, a prerequisite for per-thread epollfd design in upcoming multithreading work. Changes since v1: - New patch: "epoll_ctl: Extract epoll operations" - centralizes epoll helpers into a dedicated module and relocates union epoll_ref from passt.h to epoll_ctl.h - Changed epollfd type in flow_common from int to 8-bit bitfield to avoid exceeding cacheline size threshold - Added flow_epollfd_valid() helper to check epoll registration status - Added flow_set_epollfd() helper to set the epoll instance for a flow Changes since v2: - Renamed field in flow_common from epollfd to threadnb (thread number) to better reflect that flows are now associated with threads rather than directly with epoll file descriptors - Introduced thread-to-epollfd mapping via threadnb_to_epollfd[] array in flow.c, providing an indirection layer between threads and their epoll instances - Renamed/refactored helper functions: - flow_set_epollfd() -> flow_epollfd_set() - now takes thread number and epollfd parameters - Added flow_epollfd_get() - retrieves the epoll fd for a flow's thread - flow_epollfd_valid() - now checks threadnb instead of epollfd - Updated constants: replaced EPOLLFD_* with FLOW_THREADNB_* equivalents (e.g., EPOLLFD_INVALID -> FLOW_THREADNB_INVALID) - Applied thread-based pattern consistently across TCP, ICMP, and UDP Changes since v3: - Change warn() by err() in epoll_add() - Renamed flow_epollfd_valid() to flow_in_epoll(), flow_epollfd_get() to flow_epollfd(). - Added flow_thread_register() to register an epollfd to a thread number, and called it from main() to register c->epollfd for thread 0 (main loop) - Replaced flow_epollfd_set() by flow_thread_set() to set the thread number of a flow. - Added new patch: "passt: Move main event loop processing into passt_worker()" Changes since v4: - Renamed field in flow_common from threadnb to epollid (epoll identifier) to better reflect the abstraction - flows are associated with an epoll instance identifier rather than specifically a thread number - Renamed mapping array from threadnb_to_epollfd[] to epoll_id_to_fd[] - Renamed/refactored helper functions: - flow_thread_set() -> flow_epollid_set() - sets the epoll ID - flow_thread_register() -> flow_epollid_register() - registers the epollfd for a given epoll ID - Added flow_epollid_clear() - explicitly clears the epoll ID - Updated constants: replaced FLOW_THREADNB_* with EPOLLFD_ID_* equivalents (e.g., FLOW_THREADNB_INVALID -> EPOLLFD_ID_INVALID, FLOW_THREADNB_MAX -> EPOLLFD_ID_MAX) - Added EPOLLFD_ID_DEFAULT constant for the default (main loop) epoll instance - Clarified that EPOLLFD_ID_BITS limits the number of epoll instances, not threads (though initially there will be one epoll instance per thread) Laurent Vivier (7): util: Simplify epoll_del() interface to take epollfd directly epoll_ctl: Extract epoll operations util: Move epoll registration out of sock_l4_sa() tcp, flow: Replace per-connection in_epoll flag with an epollid in flow_common icmp: Use epoll instance management for ICMP flows udp: Use epoll instance management for UDP flows passt: Move main event loop processing into passt_worker() Makefile | 22 +++---- epoll_ctl.c | 45 ++++++++++++++ epoll_ctl.h | 51 ++++++++++++++++ flow.c | 71 +++++++++++++++++++--- flow.h | 15 ++++- icmp.c | 23 +++++--- passt.c | 163 ++++++++++++++++++++++++++++----------------------- passt.h | 34 ----------- pasta.c | 7 +-- pif.c | 32 +++++++--- repair.c | 18 +++--- tap.c | 15 ++--- tcp.c | 46 +++++++++------ tcp_conn.h | 8 +-- tcp_splice.c | 26 ++++---- udp.c | 2 +- udp_flow.c | 23 ++++++-- util.c | 28 +-------- util.h | 6 +- vhost_user.c | 14 ++--- vu_common.c | 2 +- 21 files changed, 403 insertions(+), 248 deletions(-) create mode 100644 epoll_ctl.c create mode 100644 epoll_ctl.h -- 2.51.0
Change epoll_del() to accept the epoll file descriptor directly instead
of the full context structure. This simplifies the interface and aligns
with the threading refactoring by reducing dependency on the context
structure for basic epoll operations as we will manage an epollfd per
thread.
Signed-off-by: Laurent Vivier
Centralize epoll_add() and epoll_del() helper functions into new
epoll_ctl.c/h files.
This also moves the union epoll_ref definition from passt.h to
epoll_ctl.h where it's more logically placed.
The new epoll_add() helper simplifies adding file descriptors to epoll
by taking an epoll_ref and events, handling error reporting
consistently across all call sites.
Signed-off-by: Laurent Vivier
Move epoll_add() calls from sock_l4_sa() to the protocol-specific code
(icmp.c, pif.c, udp_flow.c) to give callers more control over epoll
registration. This allows sock_l4_sa() to focus solely on socket
creation and binding, while epoll management happens at a higher level.
Remove the data parameter from sock_l4_sa() and flowside_sock_l4() as
it's no longer needed - callers now construct the full epoll_ref and
register the socket themselves after creation.
Signed-off-by: Laurent Vivier
The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked
whether a connection was registered with epoll, not which epoll instance.
This limited flexibility for future multi-epoll support.
Replace the boolean with an epollid field in flow_common that identifies
which epoll instance the flow is registered with.
Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with
any epoll instance. An epoll_id_to_fd[] mapping table translates
epoll ids to their corresponding epoll file descriptors.
Add helper functions:
- flow_in_epoll() to check if a flow is registered with epoll
- flow_epollfd() to retrieve the epoll fd for a flow's thread
- flow_epollid_register() to register an epoll fd with an epollid
- flow_epollid_set() to set the epollid of a flow
- flow_epollid_clear() to reset the epoll id of a flow
This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing
the need to pass the context 'c', since the epoll fd is now directly
accessible from the flow structure via flow_epollfd().
Add a defensive check at the beginning of tcp_flow_repair_queue() to
avoid a false positive with "make clang-tidy":
error: The 1st argument to 'send' is < 0 but should be >= 0
3230 | ssize_t rc = send(conn->sock, p, MIN(len, chunk), 0);
Signed-off-by: Laurent Vivier
Store the epoll id in the flow_common structure for ICMP ping flows
using flow_epollid_set() and retrieve the corresponding epoll
file descriptor with flow_epollfd() instead of passing c->epollfd
directly. This makes ICMP consistent with the recent TCP changes and
follows the pattern established in previous commit.
Signed-off-by: Laurent Vivier
Store the epoll id in the flow_common structure for UDP flows using
flow_epollid_set() and retrieve the corresponding epoll file descriptor
with flow_epollfd() instead of passing c->epollfd directly. This makes
UDP consistent with the recent TCP and ICMP changes.
Signed-off-by: Laurent Vivier
Extract the epoll event processing logic from main() into a separate
passt_worker() function. This refactoring prepares the code for future
threading support where passt_worker() will be called as a worker thread
callback.
The new function handles:
- Processing epoll events and dispatching to protocol handlers
- Event statistics tracking and printing
- Post-handler periodic tasks (timers, deferred work)
- Migration handling
No functional changes, purely a code restructuring.
Signed-off-by: Laurent Vivier
On Tue, Oct 21, 2025 at 11:01:13PM +0200, Laurent Vivier wrote:
The in_epoll boolean flag in tcp_tap_conn and tcp_splice_conn only tracked whether a connection was registered with epoll, not which epoll instance. This limited flexibility for future multi-epoll support.
Replace the boolean with an epollid field in flow_common that identifies which epoll instance the flow is registered with. Use FLOW_EPOLLID_INVALID to indicate when a flow is not registered with any epoll instance. An epoll_id_to_fd[] mapping table translates epoll ids to their corresponding epoll file descriptors.
Add helper functions: - flow_in_epoll() to check if a flow is registered with epoll - flow_epollfd() to retrieve the epoll fd for a flow's thread - flow_epollid_register() to register an epoll fd with an epollid - flow_epollid_set() to set the epollid of a flow - flow_epollid_clear() to reset the epoll id of a flow
This change also simplifies tcp_timer_ctl() and conn_flag_do() by removing the need to pass the context 'c', since the epoll fd is now directly accessible from the flow structure via flow_epollfd().
Add a defensive check at the beginning of tcp_flow_repair_queue() to avoid a false positive with "make clang-tidy": error: The 1st argument to 'send' is < 0 but should be >= 0 3230 | ssize_t rc = send(conn->sock, p, MIN(len, chunk), 0);
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
@@ -570,7 +573,7 @@ static void tcp_timer_ctl(const struct ctx *c, struct tcp_tap_conn *conn) } conn->timer = fd;
- if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) { + if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn->timer, &ev)) {
I still think it's a bug that the timer's ref.fd field isn't the timer fd, but I recognize that fixing that is not in scope for this series. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 21, 2025 at 11:01:11PM +0200, Laurent Vivier wrote:
Centralize epoll_add() and epoll_del() helper functions into new epoll_ctl.c/h files.
This also moves the union epoll_ref definition from passt.h to epoll_ctl.h where it's more logically placed.
The new epoll_add() helper simplifies adding file descriptors to epoll by taking an epoll_ref and events, handling error reporting consistently across all call sites.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- Makefile | 22 +++++++++++----------- epoll_ctl.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ epoll_ctl.h | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ icmp.c | 4 +--- passt.c | 2 +- passt.h | 34 ---------------------------------- pasta.c | 7 +++---- repair.c | 18 +++++++----------- tap.c | 13 ++++--------- tcp.c | 2 +- tcp_splice.c | 2 +- udp.c | 2 +- udp_flow.c | 1 + util.c | 22 +++------------------- util.h | 4 +++- vhost_user.c | 8 ++------ vu_common.c | 2 +- 17 files changed, 136 insertions(+), 103 deletions(-) create mode 100644 epoll_ctl.c create mode 100644 epoll_ctl.h
diff --git a/Makefile b/Makefile index 3328f8324140..91e037b8fd3c 100644 --- a/Makefile +++ b/Makefile @@ -37,23 +37,23 @@ FLAGS += -DPAGE_SIZE=$(shell getconf PAGE_SIZE) FLAGS += -DVERSION=\"$(VERSION)\" FLAGS += -DDUAL_STACK_SOCKETS=$(DUAL_STACK_SOCKETS)
-PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c flow.c fwd.c \ - icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c log.c mld.c \ - ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c pif.c \ - repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c udp_flow.c \ - udp_vu.c util.c vhost_user.c virtio.c vu_common.c +PASST_SRCS = arch.c arp.c checksum.c conf.c dhcp.c dhcpv6.c epoll_ctl.c \ + flow.c fwd.c icmp.c igmp.c inany.c iov.c ip.c isolation.c lineread.c \ + log.c mld.c ndp.c netlink.c migrate.c packet.c passt.c pasta.c pcap.c \ + pif.c repair.c tap.c tcp.c tcp_buf.c tcp_splice.c tcp_vu.c udp.c \ + udp_flow.c udp_vu.c util.c vhost_user.c virtio.c vu_common.c QRAP_SRCS = qrap.c PASST_REPAIR_SRCS = passt-repair.c SRCS = $(PASST_SRCS) $(QRAP_SRCS) $(PASST_REPAIR_SRCS)
MANPAGES = passt.1 pasta.1 qrap.1 passt-repair.1
-PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h flow.h fwd.h \ - flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h isolation.h \ - lineread.h log.h migrate.h ndp.h netlink.h packet.h passt.h pasta.h \ - pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h tcp_conn.h \ - tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h udp_internal.h \ - udp_vu.h util.h vhost_user.h virtio.h vu_common.h +PASST_HEADERS = arch.h arp.h checksum.h conf.h dhcp.h dhcpv6.h epoll_ctl.h \ + flow.h fwd.h flow_table.h icmp.h icmp_flow.h inany.h iov.h ip.h \ + isolation.h lineread.h log.h migrate.h ndp.h netlink.h packet.h \ + passt.h pasta.h pcap.h pif.h repair.h siphash.h tap.h tcp.h tcp_buf.h \ + tcp_conn.h tcp_internal.h tcp_splice.h tcp_vu.h udp.h udp_flow.h \ + udp_internal.h udp_vu.h util.h vhost_user.h virtio.h vu_common.h HEADERS = $(PASST_HEADERS) seccomp.h
C := \#include
\nint main(){int a=getrandom(0, 0, 0);} diff --git a/epoll_ctl.c b/epoll_ctl.c new file mode 100644 index 000000000000..728a2afe1f6b --- /dev/null +++ b/epoll_ctl.c @@ -0,0 +1,45 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* epoll_ctl.c - epoll manipulation helpers + * + * Copyright Red Hat + * Author: Laurent Vivier + */ + +#include + +#include "epoll_ctl.h" + +/** + * epoll_add() - Add a file descriptor to an epollfd + * @epollfd: epoll file descriptor to add to + * @events: epoll events + * @ref: epoll reference for the file descriptor (includes fd and metadata) + * + * Return: 0 on success, negative errno on failure + */ +int epoll_add(int epollfd, uint32_t events, union epoll_ref ref) +{ + struct epoll_event ev; + int ret; + + ev.events = events; + ev.data.u64 = ref.u64; + + ret = epoll_ctl(epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + if (ret == -1) { + ret = -errno; + err("Failed to add fd to epoll: %s", strerror_(-ret)); + } + + return ret; +} + +/** + * epoll_del() - Remove a file descriptor from an epollfd + * @epollfd: epoll file descriptor to remove from + * @fd: File descriptor to remove + */ +void epoll_del(int epollfd, int fd) +{ + epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL); +} diff --git a/epoll_ctl.h b/epoll_ctl.h new file mode 100644 index 000000000000..2d7e7123ae9d --- /dev/null +++ b/epoll_ctl.h @@ -0,0 +1,51 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later + * Copyright Red Hat + * Author: Laurent Vivier + */ + +#ifndef EPOLL_CTL_H +#define EPOLL_CTL_H + +#include + +#include "util.h" +#include "passt.h" +#include "epoll_type.h" +#include "flow.h" +#include "tcp.h" +#include "udp.h" + +/** + * union epoll_ref - Breakdown of reference for epoll fd bookkeeping + * @type: Type of fd (tells us what to do with events) + * @fd: File descriptor number (implies < 2^24 total descriptors) + * @flow: Index of the flow this fd is linked to + * @tcp_listen: TCP-specific reference part for listening sockets + * @udp: UDP-specific reference part + * @data: Data handled by protocol handlers + * @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone + * @queue: vhost-user queue index for this fd + * @u64: Opaque reference for epoll_ctl() and epoll_wait() + */ +union epoll_ref { + struct { + enum epoll_type type:8; + int32_t fd:FD_REF_BITS; + union { + uint32_t flow; + flow_sidx_t flowside; + union tcp_listen_epoll_ref tcp_listen; + union udp_listen_epoll_ref udp; + uint32_t data; + int nsdir_fd; + int queue; + }; + }; + uint64_t u64; +}; +static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), + "epoll_ref must have same size as epoll_data"); + +int epoll_add(int epollfd, uint32_t events, union epoll_ref ref); +void epoll_del(int epollfd, int fd); +#endif /* EPOLL_CTL_H */ diff --git a/icmp.c b/icmp.c index bd3108a21675..c26561da80bf 100644 --- a/icmp.c +++ b/icmp.c @@ -15,7 +15,6 @@ #include #include #include -#include #include #include #include @@ -23,10 +22,8 @@ #include #include #include -#include #include #include -#include #include #include
@@ -41,6 +38,7 @@ #include "inany.h" #include "icmp.h" #include "flow_table.h" +#include "epoll_ctl.h" #define ICMP_ECHO_TIMEOUT 60 /* s, timeout for ICMP socket activity */ #define ICMP_NUM_IDS (1U << 16) diff --git a/passt.c b/passt.c index bdb7b6935f0c..af928111786b 100644 --- a/passt.c +++ b/passt.c @@ -19,7 +19,6 @@ * created in a separate network namespace). */
-#include
#include #include #include @@ -53,6 +52,7 @@ #include "vu_common.h" #include "migrate.h" #include "repair.h" +#include "epoll_ctl.h" #define NUM_EPOLL_EVENTS 8
diff --git a/passt.h b/passt.h index 0075eb4b3b16..befe56bb167b 100644 --- a/passt.h +++ b/passt.h @@ -35,40 +35,6 @@ union epoll_ref; #define MAC_OUR_LAA \ ((uint8_t [ETH_ALEN]){0x9a, 0x55, 0x9a, 0x55, 0x9a, 0x55})
-/** - * union epoll_ref - Breakdown of reference for epoll fd bookkeeping - * @type: Type of fd (tells us what to do with events) - * @fd: File descriptor number (implies < 2^24 total descriptors) - * @flow: Index of the flow this fd is linked to - * @tcp_listen: TCP-specific reference part for listening sockets - * @udp: UDP-specific reference part - * @icmp: ICMP-specific reference part - * @data: Data handled by protocol handlers - * @nsdir_fd: netns dirfd for fallback timer checking if namespace is gone - * @queue: vhost-user queue index for this fd - * @u64: Opaque reference for epoll_ctl() and epoll_wait() - */ -union epoll_ref { - struct { - enum epoll_type type:8; -#define FD_REF_BITS 24 -#define FD_REF_MAX ((int)MAX_FROM_BITS(FD_REF_BITS)) - int32_t fd:FD_REF_BITS; - union { - uint32_t flow; - flow_sidx_t flowside; - union tcp_listen_epoll_ref tcp_listen; - union udp_listen_epoll_ref udp; - uint32_t data; - int nsdir_fd; - int queue; - }; - }; - uint64_t u64; -}; -static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data), - "epoll_ref must have same size as epoll_data"); - /* Large enough for ~128 maximum size frames */ #define PKT_BUF_BYTES (8UL << 20)
diff --git a/pasta.c b/pasta.c index 687406b6e736..e8636f45df2f 100644 --- a/pasta.c +++ b/pasta.c @@ -27,7 +27,6 @@ #include
#include #include -#include #include #include #include @@ -49,6 +48,7 @@ #include "isolation.h" #include "netlink.h" #include "log.h" +#include "epoll_ctl.h" #define HOSTNAME_PREFIX "pasta-"
@@ -444,7 +444,6 @@ static int pasta_netns_quit_timer(void) */ void pasta_netns_quit_init(const struct ctx *c) { - struct epoll_event ev = { .events = EPOLLIN }; int flags = O_NONBLOCK | O_CLOEXEC; struct statfs s = { 0 }; bool try_inotify = true; @@ -487,8 +486,8 @@ void pasta_netns_quit_init(const struct ctx *c) die("netns monitor file number %i too big, exiting", fd);
ref.fd = fd; - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev); + + epoll_add(c->epollfd, EPOLLIN, ref); }
/** diff --git a/repair.c b/repair.c index f6b1bf36479c..69c530773173 100644 --- a/repair.c +++ b/repair.c @@ -22,6 +22,7 @@ #include "inany.h" #include "flow.h" #include "flow_table.h" +#include "epoll_ctl.h"
#include "repair.h"
@@ -47,7 +48,6 @@ static int repair_nfds; void repair_sock_init(const struct ctx *c) { union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR_LISTEN }; - struct epoll_event ev = { 0 };
if (c->fd_repair_listen == -1) return; @@ -58,10 +58,8 @@ void repair_sock_init(const struct ctx *c) }
ref.fd = c->fd_repair_listen; - ev.events = EPOLLIN | EPOLLHUP | EPOLLET; - ev.data.u64 = ref.u64; - if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair_listen, &ev)) - err_perror("repair helper socket epoll_ctl(), won't migrate"); + if (epoll_add(c->epollfd, EPOLLIN | EPOLLHUP | EPOLLET, ref)) + err("repair helper socket epoll_ctl(), won't migrate"); }
/** @@ -74,7 +72,6 @@ void repair_sock_init(const struct ctx *c) int repair_listen_handler(struct ctx *c, uint32_t events) { union epoll_ref ref = { .type = EPOLL_TYPE_REPAIR }; - struct epoll_event ev = { 0 }; struct ucred ucred; socklen_t len; int rc; @@ -112,11 +109,10 @@ int repair_listen_handler(struct ctx *c, uint32_t events) info("Accepted TCP_REPAIR helper, PID %i", ucred.pid);
ref.fd = c->fd_repair; - ev.events = EPOLLHUP | EPOLLET; - ev.data.u64 = ref.u64; - if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_repair, &ev)) { - rc = errno; - debug_perror("epoll_ctl() on TCP_REPAIR helper socket"); + + rc = epoll_add(c->epollfd, EPOLLHUP | EPOLLET, ref); + if (rc < 0) { + debug("epoll_ctl() on TCP_REPAIR helper socket"); close(c->fd_repair); c->fd_repair = -1; return rc; diff --git a/tap.c b/tap.c index 9812f120d426..314c2aebd39d 100644 --- a/tap.c +++ b/tap.c @@ -26,7 +26,6 @@ #include
#include #include -#include #include #include #include @@ -61,6 +60,7 @@ #include "log.h" #include "vhost_user.h" #include "vu_common.h" +#include "epoll_ctl.h" /* Maximum allowed frame lengths (including L2 header) */
@@ -1327,14 +1327,12 @@ static void tap_backend_show_hints(struct ctx *c) static void tap_sock_unix_init(const struct ctx *c) { union epoll_ref ref = { .type = EPOLL_TYPE_TAP_LISTEN }; - struct epoll_event ev = { 0 };
listen(c->fd_tap_listen, 0);
ref.fd = c->fd_tap_listen; - ev.events = EPOLLIN | EPOLLET; - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap_listen, &ev); + + epoll_add(c->epollfd, EPOLLIN | EPOLLET, ref); }
/** @@ -1343,7 +1341,6 @@ static void tap_sock_unix_init(const struct ctx *c) */ static void tap_start_connection(const struct ctx *c) { - struct epoll_event ev = { 0 }; union epoll_ref ref = { 0 };
ref.fd = c->fd_tap; @@ -1359,9 +1356,7 @@ static void tap_start_connection(const struct ctx *c) break; }
- ev.events = EPOLLIN | EPOLLRDHUP; - ev.data.u64 = ref.u64; - epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); + epoll_add(c->epollfd, EPOLLIN | EPOLLRDHUP, ref);
if (c->ifi4) arp_send_init_req(c); diff --git a/tcp.c b/tcp.c index 745353f782f5..db9f17c0622f 100644 --- a/tcp.c +++ b/tcp.c @@ -279,7 +279,6 @@ #include
#include #include -#include #include #include #include @@ -309,6 +308,7 @@ #include "tcp_internal.h" #include "tcp_buf.h" #include "tcp_vu.h" +#include "epoll_ctl.h" /* * The size of TCP header (including options) is given by doff (Data Offset) diff --git a/tcp_splice.c b/tcp_splice.c index 666ee62b738f..6f21184bdc55 100644 --- a/tcp_splice.c +++ b/tcp_splice.c @@ -44,7 +44,6 @@ #include
#include #include -#include #include #include @@ -56,6 +55,7 @@ #include "siphash.h" #include "inany.h" #include "flow.h" +#include "epoll_ctl.h"
#include "flow_table.h"
diff --git a/udp.c b/udp.c index 86585b7e0942..3812d5c2336f 100644 --- a/udp.c +++ b/udp.c @@ -94,7 +94,6 @@ #include
#include #include -#include #include #include #include @@ -115,6 +114,7 @@ #include "flow_table.h" #include "udp_internal.h" #include "udp_vu.h" +#include "epoll_ctl.h" #define UDP_MAX_FRAMES 32 /* max # of frames to receive at once */
diff --git a/udp_flow.c b/udp_flow.c index 84973f807167..d9c75f1bb1d8 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -15,6 +15,7 @@ #include "passt.h" #include "flow_table.h" #include "udp_internal.h" +#include "epoll_ctl.h"
#define UDP_CONN_TIMEOUT 180 /* s, timeout for ephemeral or local bind */
diff --git a/util.c b/util.c index 1067486be414..e3f24f7b7e47 100644 --- a/util.c +++ b/util.c @@ -18,7 +18,6 @@ #include
#include #include -#include #include #include #include @@ -35,6 +34,7 @@ #include "packet.h" #include "log.h" #include "pcap.h" +#include "epoll_ctl.h" #ifdef HAS_GETRANDOM #include #endif @@ -58,7 +58,6 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, sa_family_t af = ((const struct sockaddr *)sa)->sa_family; union epoll_ref ref = { .type = type, .data = data }; bool freebind = false; - struct epoll_event ev; int fd, y = 1, ret; uint8_t proto; int socktype; @@ -172,13 +171,9 @@ int sock_l4_sa(const struct ctx *c, enum epoll_type type, return ret; } - ev.events = EPOLLIN; - ev.data.u64 = ref.u64; - if (epoll_ctl(c->epollfd, EPOLL_CTL_ADD, fd, &ev) == -1) { - ret = -errno; - warn("L4 epoll_ctl: %s", strerror_(-ret)); + ret = epoll_add(c->epollfd, EPOLLIN, ref); + if (ret < 0) return ret; - }
return fd; } @@ -994,17 +989,6 @@ void raw_random(void *buf, size_t buflen) die("Unexpected EOF on random data source"); }
-/** - * epoll_del() - Remove a file descriptor from our passt epoll - * @epollfd: epoll file descriptor to remove from - * @fd: File descriptor to remove - */ -void epoll_del(int epollfd, int fd) -{ - epoll_ctl(epollfd, EPOLL_CTL_DEL, fd, NULL); - -} - /** * encode_domain_name() - Encode domain name according to RFC 1035, section 3.1 * @buf: Buffer to fill in with encoded domain name diff --git a/util.h b/util.h index c61cbef357aa..8e4b4c5c6032 100644 --- a/util.h +++ b/util.h @@ -193,6 +193,9 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags, #define SNDBUF_BIG (4ULL * 1024 * 1024) #define SNDBUF_SMALL (128ULL * 1024)
+#define FD_REF_BITS 24 +#define FD_REF_MAX ((int)MAX_FROM_BITS(FD_REF_BITS)) + #include
#include #include @@ -300,7 +303,6 @@ static inline bool mod_between(unsigned x, unsigned i, unsigned j, unsigned m) #define FPRINTF(f, ...) (void)fprintf(f, __VA_ARGS__) void raw_random(void *buf, size_t buflen); -void epoll_del(int epollfd, int fd);
/* * Starting from glibc 2.40.9000 and commit 25a5eb4010df ("string: strerror, diff --git a/vhost_user.c b/vhost_user.c index f8324c59cc6c..aa7c869d9e56 100644 --- a/vhost_user.c +++ b/vhost_user.c @@ -32,8 +32,6 @@ #include
#include #include -#include -#include #include #include #include @@ -45,6 +43,7 @@ #include "vhost_user.h" #include "pcap.h" #include "migrate.h" +#include "epoll_ctl.h" /* vhost-user version we are compatible with */ #define VHOST_USER_VERSION 1 @@ -753,11 +752,8 @@ static void vu_set_watch(const struct vu_dev *vdev, int idx) .fd = vdev->vq[idx].kick_fd, .queue = idx }; - struct epoll_event ev = { 0 };
- ev.data.u64 = ref.u64; - ev.events = EPOLLIN; - epoll_ctl(vdev->context->epollfd, EPOLL_CTL_ADD, ref.fd, &ev); + epoll_add(vdev->context->epollfd, EPOLLIN, ref); }
/** diff --git a/vu_common.c b/vu_common.c index b716070ea3c3..b13b7c308fd8 100644 --- a/vu_common.c +++ b/vu_common.c @@ -6,7 +6,6 @@ */
#include
-#include #include #include #include @@ -19,6 +18,7 @@ #include "pcap.h" #include "vu_common.h" #include "migrate.h" +#include "epoll_ctl.h" #define VU_MAX_TX_BUFFER_NB 2
-- 2.51.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 21, 2025 at 11:01:15PM +0200, Laurent Vivier wrote:
Store the epoll id in the flow_common structure for UDP flows using flow_epollid_set() and retrieve the corresponding epoll file descriptor with flow_epollfd() instead of passing c->epollfd directly. This makes UDP consistent with the recent TCP and ICMP changes.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- udp_flow.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/udp_flow.c b/udp_flow.c index 00e231fe1b01..8907f2f72741 100644 --- a/udp_flow.c +++ b/udp_flow.c @@ -52,7 +52,7 @@ void udp_flow_close(const struct ctx *c, struct udp_flow *uflow) flow_foreach_sidei(sidei) { flow_hash_remove(c, FLOW_SIDX(uflow, sidei)); if (uflow->s[sidei] >= 0) { - epoll_del(c->epollfd, uflow->s[sidei]); + epoll_del(flow_epollfd(&uflow->f), uflow->s[sidei]); close(uflow->s[sidei]); uflow->s[sidei] = -1; } @@ -92,7 +92,9 @@ static int udp_flow_sock(const struct ctx *c, ref.data = fref.data; ref.fd = s;
- rc = epoll_add(c->epollfd, EPOLLIN, ref); + flow_epollid_set(&uflow->f, EPOLLFD_ID_DEFAULT); + + rc = epoll_add(flow_epollfd(&uflow->f), EPOLLIN, ref); if (rc < 0) { close(s); return rc; @@ -101,7 +103,7 @@ static int udp_flow_sock(const struct ctx *c, if (flowside_connect(c, s, pif, side) < 0) { rc = -errno;
- epoll_del(c->epollfd, s); + epoll_del(flow_epollfd(&uflow->f), s); close(s);
flow_dbg_perror(uflow, "Couldn't connect flow socket"); -- 2.51.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, Oct 21, 2025 at 11:01:16PM +0200, Laurent Vivier wrote:
Extract the epoll event processing logic from main() into a separate passt_worker() function. This refactoring prepares the code for future threading support where passt_worker() will be called as a worker thread callback.
The new function handles: - Processing epoll events and dispatching to protocol handlers - Event statistics tracking and printing - Post-handler periodic tasks (timers, deferred work) - Migration handling
No functional changes, purely a code restructuring.
Signed-off-by: Laurent Vivier
Reviewed-by: David Gibson
--- passt.c | 160 +++++++++++++++++++++++++++++++------------------------- 1 file changed, 88 insertions(+), 72 deletions(-)
diff --git a/passt.c b/passt.c index f1d78b79e2c3..5eabfa8f0c51 100644 --- a/passt.c +++ b/passt.c @@ -229,6 +229,92 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats, lines_printed++; }
+/** + * passt_worker() - Process epoll events and handle protocol operations + * @opaque: Pointer to execution context (struct ctx) + * @nfds: Number of file descriptors ready (epoll_wait return value) + * @events: epoll_event array of ready file descriptors + */ +static void passt_worker(void *opaque, int nfds, struct epoll_event *events) +{ + static struct passt_stats stats = { 0 }; + struct ctx *c = opaque; + struct timespec now; + int i; + + if (clock_gettime(CLOCK_MONOTONIC, &now)) + err_perror("Failed to get CLOCK_MONOTONIC time"); + + for (i = 0; i < nfds; i++) { + union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64); + uint32_t eventmask = events[i].events; + + trace("%s: epoll event on %s %i (events: 0x%08x)", + c->mode == MODE_PASTA ? "pasta" : "passt", + EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); + + switch (ref.type) { + case EPOLL_TYPE_TAP_PASTA: + tap_handler_pasta(c, eventmask, &now); + break; + case EPOLL_TYPE_TAP_PASST: + tap_handler_passt(c, eventmask, &now); + break; + case EPOLL_TYPE_TAP_LISTEN: + tap_listen_handler(c, eventmask); + break; + case EPOLL_TYPE_NSQUIT_INOTIFY: + pasta_netns_quit_inotify_handler(c, ref.fd); + break; + case EPOLL_TYPE_NSQUIT_TIMER: + pasta_netns_quit_timer_handler(c, ref); + break; + case EPOLL_TYPE_TCP: + tcp_sock_handler(c, ref, eventmask); + break; + case EPOLL_TYPE_TCP_SPLICE: + tcp_splice_sock_handler(c, ref, eventmask); + break; + case EPOLL_TYPE_TCP_LISTEN: + tcp_listen_handler(c, ref, &now); + break; + case EPOLL_TYPE_TCP_TIMER: + tcp_timer_handler(c, ref); + break; + case EPOLL_TYPE_UDP_LISTEN: + udp_listen_sock_handler(c, ref, eventmask, &now); + break; + case EPOLL_TYPE_UDP: + udp_sock_handler(c, ref, eventmask, &now); + break; + case EPOLL_TYPE_PING: + icmp_sock_handler(c, ref); + break; + case EPOLL_TYPE_VHOST_CMD: + vu_control_handler(c->vdev, c->fd_tap, eventmask); + break; + case EPOLL_TYPE_VHOST_KICK: + vu_kick_cb(c->vdev, ref, &now); + break; + case EPOLL_TYPE_REPAIR_LISTEN: + repair_listen_handler(c, eventmask); + break; + case EPOLL_TYPE_REPAIR: + repair_handler(c, eventmask); + break; + default: + /* Can't happen */ + ASSERT(0); + } + stats.events[ref.type]++; + print_stats(c, &stats, &now); + } + + post_handler(c, &now); + + migrate_handler(c); +} + /** * main() - Entry point and main loop * @argc: Argument count @@ -246,8 +332,7 @@ static void print_stats(const struct ctx *c, const struct passt_stats *stats, int main(int argc, char **argv) { struct epoll_event events[NUM_EPOLL_EVENTS]; - struct passt_stats stats = { 0 }; - int nfds, i, devnull_fd = -1; + int nfds, devnull_fd = -1; struct ctx c = { 0 }; struct rlimit limit; struct timespec now; @@ -355,77 +440,8 @@ loop: if (nfds == -1 && errno != EINTR) die_perror("epoll_wait() failed in main loop");
- if (clock_gettime(CLOCK_MONOTONIC, &now)) - err_perror("Failed to get CLOCK_MONOTONIC time"); - - for (i = 0; i < nfds; i++) { - union epoll_ref ref = *((union epoll_ref *)&events[i].data.u64); - uint32_t eventmask = events[i].events; - - trace("%s: epoll event on %s %i (events: 0x%08x)", - c.mode == MODE_PASTA ? "pasta" : "passt", - EPOLL_TYPE_STR(ref.type), ref.fd, eventmask); - - switch (ref.type) { - case EPOLL_TYPE_TAP_PASTA: - tap_handler_pasta(&c, eventmask, &now); - break; - case EPOLL_TYPE_TAP_PASST: - tap_handler_passt(&c, eventmask, &now); - break; - case EPOLL_TYPE_TAP_LISTEN: - tap_listen_handler(&c, eventmask); - break; - case EPOLL_TYPE_NSQUIT_INOTIFY: - pasta_netns_quit_inotify_handler(&c, ref.fd); - break; - case EPOLL_TYPE_NSQUIT_TIMER: - pasta_netns_quit_timer_handler(&c, ref); - break; - case EPOLL_TYPE_TCP: - tcp_sock_handler(&c, ref, eventmask); - break; - case EPOLL_TYPE_TCP_SPLICE: - tcp_splice_sock_handler(&c, ref, eventmask); - break; - case EPOLL_TYPE_TCP_LISTEN: - tcp_listen_handler(&c, ref, &now); - break; - case EPOLL_TYPE_TCP_TIMER: - tcp_timer_handler(&c, ref); - break; - case EPOLL_TYPE_UDP_LISTEN: - udp_listen_sock_handler(&c, ref, eventmask, &now); - break; - case EPOLL_TYPE_UDP: - udp_sock_handler(&c, ref, eventmask, &now); - break; - case EPOLL_TYPE_PING: - icmp_sock_handler(&c, ref); - break; - case EPOLL_TYPE_VHOST_CMD: - vu_control_handler(c.vdev, c.fd_tap, eventmask); - break; - case EPOLL_TYPE_VHOST_KICK: - vu_kick_cb(c.vdev, ref, &now); - break; - case EPOLL_TYPE_REPAIR_LISTEN: - repair_listen_handler(&c, eventmask); - break; - case EPOLL_TYPE_REPAIR: - repair_handler(&c, eventmask); - break; - default: - /* Can't happen */ - ASSERT(0); - } - stats.events[ref.type]++; - print_stats(&c, &stats, &now); - } - - post_handler(&c, &now);
- migrate_handler(&c); + passt_worker(&c, nfds, events);
goto loop; } -- 2.51.0
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Tue, 21 Oct 2025 23:01:13 +0200
Laurent Vivier
diff --git a/flow.h b/flow.h index ef138b83add8..2c58b30ffc6a 100644 --- a/flow.h +++ b/flow.h @@ -177,6 +177,7 @@ int flowside_connect(const struct ctx *c, int s, * @type: Type of packet flow * @pif[]: Interface for each side of the flow * @side[]: Information for each side of the flow + * @epollid: epollfd identifier, or EPOLLFD_ID_INVALID */ struct flow_common { #ifdef __GNUC__ @@ -192,8 +193,15 @@ struct flow_common { #endif uint8_t pif[SIDES]; struct flowside side[SIDES]; +#define EPOLLFD_ID_BITS 8 + unsigned int epollid:EPOLLFD_ID_BITS; };
Just to confirm, on top of Jon's series (adding tap_omac[6] before this): struct tcp_tap_conn { struct flow_common f; /* 0 84 */ /* --- cacheline 1 boundary (64 bytes) was 20 bytes ago --- */ [...] /* size: 128, cachelines: 2, members: 19 */ /* sum members: 115 */ /* sum bitfield members: 97 bits, bit holes: 1, sum bit holes: 7 bits */ }; ...perfect. Tight but we still have 7 bits (in a single chunk) should we ever need something else. -- Stefano
On Tue, 21 Oct 2025 23:01:09 +0200
Laurent Vivier
This series refactors how epoll file descriptors are managed throughout the codebase in preparation for introducing multithreading support. Currently, passt uses a single global epollfd accessed through the context structure. With multithreading, each thread will need its own epollfd managing its subset of flows.
Applied, with trivial adaptations to merge on top of Jon's "Use true MAC address of LAN local remote hosts", apologies for the delay. -- Stefano
participants (3)
-
David Gibson
-
Laurent Vivier
-
Stefano Brivio