[PATCH 0/6] RFC: Clean up tap-side event handling
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime. David Gibson (6): tap: Split out handling of EPOLLIN events tap: Improve handling of EINTR in tap_passt_input() tap: Restructure in tap_pasta_input() tap: Don't risk truncating frames on full buffer in tap_pasta_input() tap: Re-introduce EPOLLET for tap connections tap: Stub EPOLLOUT handling tap.c | 131 +++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 88 insertions(+), 43 deletions(-) -- 2.46.0
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and
EPOLLERR events, then assume anything left is EPOLLIN. We have some future
cases that may want to also handle EPOLLOUT, so in preparation explicitly
handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:30 +1000
David Gibson
Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson
--- tap.c | 50 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/tap.c b/tap.c index 852d8376..14c88871 100644 --- a/tap.c +++ b/tap.c @@ -982,24 +982,17 @@ static void tap_sock_reset(struct ctx *c) }
/** - * tap_handler_passt() - Packet handler for AF_UNIX file descriptor + * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context - * @events: epoll events * @now: Current timestamp */ -void tap_handler_passt(struct ctx *c, uint32_t events, - const struct timespec *now) +static void tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; ssize_t n; char *p;
- if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { - tap_sock_reset(c); - return; - } - tap_flush_pools();
if (partial_len) { @@ -1052,20 +1045,33 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }
/** - * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * tap_handler_passt() - Event handler for AF_UNIX file descriptor * @c: Execution context * @events: epoll events * @now: Current timestamp */ -void tap_handler_pasta(struct ctx *c, uint32_t events, +void tap_handler_passt(struct ctx *c, uint32_t events, const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) { + tap_sock_reset(c); + return; + } + + if (events & EPOLLIN) + tap_passt_input(c, now); +} + +/** + * tap_passt_input() - Handler for new data on the socket to qemu
tap_pasta_input(), QEMU (we could just call it hypervisor, given that libkrun/krun also use this path).
+ * @c: Execution context + * @now: Current timestamp + */ +static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; int ret;
- if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) - die("Disconnect event on /dev/net/tun device, exiting"); - redo: n = 0;
@@ -1102,6 +1108,22 @@ restart: die("Error on tap device, exiting"); }
+/** + * tap_handler_pasta() - Packet handler for /dev/net/tun file descriptor + * @c: Execution context + * @events: epoll events + * @now: Current timestamp + */ +void tap_handler_pasta(struct ctx *c, uint32_t events, + const struct timespec *now) +{ + if (events & (EPOLLRDHUP | EPOLLHUP | EPOLLERR)) + die("Disconnect event on /dev/net/tun device, exiting"); + + if (events & EPOLLIN) + tap_pasta_input(c, now); +} + /** * tap_sock_unix_open() - Create and bind AF_UNIX socket * @sock_path: Socket path. If empty, set on return (UNIX_SOCK_PATH as prefix)
-- Stefano
On Tue, Sep 03, 2024 at 09:25:35PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:30 +1000 David Gibson
wrote: Currently, tap_handler_pas{st,ta}() check for EPOLLRDHUP, EPOLLHUP and EPOLLERR events, then assume anything left is EPOLLIN. We have some future cases that may want to also handle EPOLLOUT, so in preparation explicitly handle EPOLLIN, moving the logic to a subfunction.
Signed-off-by: David Gibson
[snip] +/** + * tap_passt_input() - Handler for new data on the socket to qemu tap_pasta_input(), QEMU (we could just call it hypervisor, given that libkrun/krun also use this path).
Updated. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
When tap_passt_input() gets an error from recv() it (correctly) does not
print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all
three cases it returns from the function. That makes sense for EAGAIN and
EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before
trying again. For EINTR, however, it makes more sense to retry immediately
- as it stands we're likely to get a renewer EPOLLIN event immediately in
that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or
get a different type of error.
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:31 +1000
David Gibson
When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or get a different type of error.
I don't see an actual improvement: we don't know why we would get EINTR (because of signals) so repeating the recv() right away isn't necessarily a better choice. I'd say whatever way we have of carrying on on EINTR is fine. If this patch helps with brevity for the next ones, it makes sense, but otherwise I don't see a real advantage. Well, it's more consistent with the way we handle EINTR on other calls, but that's about it.
Signed-off-by: David Gibson
--- tap.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/tap.c b/tap.c index 14c88871..9ee59faa 100644 --- a/tap.c +++ b/tap.c @@ -1003,10 +1003,13 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) memmove(pkt_buf, partial_frame, partial_len); }
- n = recv(c->fd_tap, pkt_buf + partial_len, TAP_BUF_BYTES - partial_len, - MSG_DONTWAIT); + do { + n = recv(c->fd_tap, pkt_buf + partial_len, + TAP_BUF_BYTES - partial_len, MSG_DONTWAIT); + } while ((n < 0) && errno == EINTR); + if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + if (errno != EAGAIN && errno != EWOULDBLOCK) { err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); }
-- Stefano
On Tue, Sep 03, 2024 at 09:25:39PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:31 +1000 David Gibson
wrote: When tap_passt_input() gets an error from recv() it (correctly) does not print any error message for EINTR, EAGAIN or EWOULDBLOCK. However in all three cases it returns from the function. That makes sense for EAGAIN and EWOULDBLOCK, since we then want to wait for the next EPOLLIN event before trying again. For EINTR, however, it makes more sense to retry immediately - as it stands we're likely to get a renewer EPOLLIN event immediately in that case, since we're using level triggered signalling.
So, handle EINTR separately by immediately retrying until we succeed or get a different type of error.
I don't see an actual improvement: we don't know why we would get EINTR (because of signals) so repeating the recv() right away isn't necessarily a better choice.
My understanding is that EINTR means the system call was aborted because a signal handler happened during it. It usually doesn't matter why we got the signal - we still need to redo the system call and we might as well do so immediately.
I'd say whatever way we have of carrying on on EINTR is fine. If this patch helps with brevity for the next ones, it makes sense, but otherwise I don't see a real advantage.
Right. Specifically in order to use EPOLLET, I need to _not_ treat EINTR the same way as EAGAIN and EWOULDBLOCK.
Well, it's more consistent with the way we handle EINTR on other calls, but that's about it.
That too. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
tap_pasta_input() has a rather confusing structure, using two gotos.
Remove these by restructuring the function to have the main loop condition
based on filling our buffer space, with errors or running out of data
treated as the exception, rather than the other way around. This allows
us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush
it and do another pass. This one is unnecessary since we don't (yet) use
EPOLLET on the tap device: if there's still more data we'll get another
event and re-enter the loop.
Along the way handle a couple of extra edge cases:
- Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future
ports where those might not have the same value
- Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:32 +1000
David Gibson
tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop.
Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson
--- tap.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/tap.c b/tap.c index 9ee59faa..71742748 100644 --- a/tap.c +++ b/tap.c @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events, static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; - int ret; - -redo: - n = 0;
tap_flush_pools(); -restart: - while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
- if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) { - n += len; - continue; + for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { + len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + + if (len <= 0) { + if (errno == EINTR) {
You might be checking errno when read() returns 0, in which case it's not set. I guess you should zero out errno before read() if you want to keep this, or, alternatively, directly handle len == 0 here.
+ len = 0; + continue; + } + break; }
+ /* Ignore frames of bad length */ + if (len < (ssize_t)sizeof(struct ethhdr) || + len > (ssize_t)ETH_MAX_MTU) + continue;
tap_add_packet(c, len, pkt_buf + n); - - if ((n += len) == TAP_BUF_BYTES) - break; }
- if (len < 0 && errno == EINTR) - goto restart; - - ret = errno; + if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK) + die("Error on tap device, exiting"); + else if (len == 0) + die("EOF on tap device, exiting");
tap_handler(c, now); - - if (len > 0 || ret == EAGAIN) - return; - - if (n == TAP_BUF_BYTES) - goto redo; - - die("Error on tap device, exiting"); }
/**
-- Stefano
On Tue, Sep 03, 2024 at 09:25:42PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:32 +1000 David Gibson
wrote: tap_pasta_input() has a rather confusing structure, using two gotos. Remove these by restructuring the function to have the main loop condition based on filling our buffer space, with errors or running out of data treated as the exception, rather than the other way around. This allows us to handle the EINTR which triggered the 'restart' goto with a continue.
The outer 'redo' was triggered if we completely filled our buffer, to flush it and do another pass. This one is unnecessary since we don't (yet) use EPOLLET on the tap device: if there's still more data we'll get another event and re-enter the loop.
Along the way handle a couple of extra edge cases: - Check for EWOULDBLOCK as well as EAGAIN for the benefit of any future ports where those might not have the same value - Detect EOF on the tap device and exit in that case
Signed-off-by: David Gibson
--- tap.c | 42 +++++++++++++++++------------------------- 1 file changed, 17 insertions(+), 25 deletions(-) diff --git a/tap.c b/tap.c index 9ee59faa..71742748 100644 --- a/tap.c +++ b/tap.c @@ -1073,42 +1073,34 @@ void tap_handler_passt(struct ctx *c, uint32_t events, static void tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len; - int ret; - -redo: - n = 0;
tap_flush_pools(); -restart: - while ((len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n)) > 0) {
- if (len < (ssize_t)sizeof(struct ethhdr) || - len > (ssize_t)ETH_MAX_MTU) { - n += len; - continue; + for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { + len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + + if (len <= 0) { + if (errno == EINTR) {
You might be checking errno when read() returns 0, in which case it's not set. I guess you should zero out errno before read() if you want to keep this, or, alternatively, directly handle len == 0 here.
Good catch. I've re-organised to avoid that.
+ len = 0; + continue; + } + break; }
+ /* Ignore frames of bad length */ + if (len < (ssize_t)sizeof(struct ethhdr) || + len > (ssize_t)ETH_MAX_MTU) + continue;
tap_add_packet(c, len, pkt_buf + n); - - if ((n += len) == TAP_BUF_BYTES) - break; }
- if (len < 0 && errno == EINTR) - goto restart; - - ret = errno; + if (len < 0 && errno != EAGAIN && errno != EWOULDBLOCK) + die("Error on tap device, exiting"); + else if (len == 0) + die("EOF on tap device, exiting");
tap_handler(c, now); - - if (len > 0 || ret == EAGAIN) - return; - - if (n == TAP_BUF_BYTES) - goto redo; - - die("Error on tap device, exiting"); }
/**
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
tap_pasta_input() keeps reading frames from the tap device until the
buffer is full. However, this has an ugly edge case, when we get close
to buffer full, we will provide just the remaining space as a read()
buffer. If this is shorter than the next frame to read, the tap device
will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:33 +1000
David Gibson
tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson
--- tap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tap.c b/tap.c index 71742748..2fbcef04 100644 --- a/tap.c +++ b/tap.c @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
tap_flush_pools();
- for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { - len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I guess, because if we have ETH_MAX_MTU bytes left, we can read another frame. Only if we have strictly less than that we might truncate one. In any case, not a strong preference, and perhaps this version is actually more readable.
+ len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
if (len <= 0) { if (errno == EINTR) {
-- Stefano
On Tue, Sep 03, 2024 at 09:25:46PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:33 +1000 David Gibson
wrote: tap_pasta_input() keeps reading frames from the tap device until the buffer is full. However, this has an ugly edge case, when we get close to buffer full, we will provide just the remaining space as a read() buffer. If this is shorter than the next frame to read, the tap device will truncate the frame and discard the remainder.
Adjust the code to make sure we always have room for a maximum size frame.
Signed-off-by: David Gibson
--- tap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tap.c b/tap.c index 71742748..2fbcef04 100644 --- a/tap.c +++ b/tap.c @@ -1076,8 +1076,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now)
tap_flush_pools();
- for (n = 0; n < (ssize_t)TAP_BUF_BYTES; n += len) { - len = read(c->fd_tap, pkt_buf + n, TAP_BUF_BYTES - n); + for (n = 0; n < (ssize_t)TAP_BUF_BYTES - ETH_MAX_MTU; n += len) {
Logically speaking, this should be TAP_BUF_BYTES - ETH_MAX_MTU + 1, I guess, because if we have ETH_MAX_MTU bytes left, we can read another frame. Only if we have strictly less than that we might truncate one.
Good point. I've changed the < to <= to correct.
In any case, not a strong preference, and perhaps this version is actually more readable.
+ len = read(c->fd_tap, pkt_buf + n, ETH_MAX_MTU);
if (len <= 0) { if (errno == EINTR) {
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only
used level-triggered events for the tap device. Prior to that we used it
inconsistently which caused some problems.
We want to add support for EPOLLOUT events on the tap connection, and
without EPOLLET that would require toggling EPOLLOUT on and off, which is
awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap
modes. The main change this requires is making sure on EPOLLIN we loop
until all there's no more data to process.
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:34 +1000
David Gibson
Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which caused some problems.
It didn't actually cause any issue according to your commit message for 4684f603446b itself.
We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process.
Signed-off-by: David Gibson
--- tap.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index 2fbcef04..d7d3fc19 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; }
p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; }
if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p;
tap_handler(c, now); + + return true; }
/** @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }
if (events & EPOLLIN) - tap_passt_input(c, now); + while (tap_passt_input(c, now)) + ;
Nit (same below): use curly brackets for multi-line block. Or just use: while (tap_passt_input(c, now));
}
/** * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len;
@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) die("EOF on tap device, exiting");
tap_handler(c, now); + + return len > 0; }
/** @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, die("Disconnect event on /dev/net/tun device, exiting");
if (events & EPOLLIN) - tap_pasta_input(c, now); + while (tap_pasta_input(c, now)) + ; }
/** @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v);
ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c);
ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA;
- ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return;
-- Stefano
On Tue, Sep 03, 2024 at 09:25:50PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:34 +1000 David Gibson
wrote: Since 4684f603446b ("tap: Don't use EPOLLET on Qemu sockets") we've only used level-triggered events for the tap device. Prior to that we used it inconsistently which caused some problems.
It didn't actually cause any issue according to your commit message for 4684f603446b itself.
I've reworded this.
We want to add support for EPOLLOUT events on the tap connection, and without EPOLLET that would require toggling EPOLLOUT on and off, which is awkward. So, re-introduce EPOLLET, but now use it uniformly for all tap modes. The main change this requires is making sure on EPOLLIN we loop until all there's no more data to process.
Signed-off-by: David Gibson
--- tap.c | 28 +++++++++++++++++++--------- 1 file changed, 19 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index 2fbcef04..d7d3fc19 100644 --- a/tap.c +++ b/tap.c @@ -985,8 +985,10 @@ static void tap_sock_reset(struct ctx *c) * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_passt_input(struct ctx *c, const struct timespec *now) +static bool tap_passt_input(struct ctx *c, const struct timespec *now) { static const char *partial_frame; static ssize_t partial_len = 0; @@ -1013,7 +1015,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) err_perror("Receive error on guest connection, reset"); tap_sock_reset(c); } - return; + return false; }
p = pkt_buf; @@ -1025,7 +1027,7 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { err("Bad frame size from guest, resetting connection"); tap_sock_reset(c); - return; + return false; }
if (l2len + sizeof(uint32_t) > (size_t)n) @@ -1045,6 +1047,8 @@ static void tap_passt_input(struct ctx *c, const struct timespec *now) partial_frame = p;
tap_handler(c, now); + + return true; }
/** @@ -1062,15 +1066,18 @@ void tap_handler_passt(struct ctx *c, uint32_t events, }
if (events & EPOLLIN) - tap_passt_input(c, now); + while (tap_passt_input(c, now)) + ;
Nit (same below): use curly brackets for multi-line block. Or just use:
while (tap_passt_input(c, now));
Fixed.
}
/** * tap_passt_input() - Handler for new data on the socket to qemu * @c: Execution context * @now: Current timestamp + * + * Return: true if there may be additional data to read, false otherwise */ -static void tap_pasta_input(struct ctx *c, const struct timespec *now) +static bool tap_pasta_input(struct ctx *c, const struct timespec *now) { ssize_t n, len;
@@ -1101,6 +1108,8 @@ static void tap_pasta_input(struct ctx *c, const struct timespec *now) die("EOF on tap device, exiting");
tap_handler(c, now); + + return len > 0; }
/** @@ -1116,7 +1125,8 @@ void tap_handler_pasta(struct ctx *c, uint32_t events, die("Disconnect event on /dev/net/tun device, exiting");
if (events & EPOLLIN) - tap_pasta_input(c, now); + while (tap_pasta_input(c, now)) + ; }
/** @@ -1250,7 +1260,7 @@ void tap_listen_handler(struct ctx *c, uint32_t events) trace("tap: failed to set SO_SNDBUF to %i", v);
ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1306,7 +1316,7 @@ static void tap_sock_tun_init(struct ctx *c) pasta_ns_conf(c);
ref.fd = c->fd_tap; - ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); } @@ -1339,7 +1349,7 @@ void tap_sock_init(struct ctx *c) else ref.type = EPOLL_TYPE_TAP_PASTA;
- ev.events = EPOLLIN | EPOLLRDHUP; + ev.events = EPOLLIN | EPOLLRDHUP | EPOLLET; ev.data.u64 = ref.u64; epoll_ctl(c->epollfd, EPOLL_CTL_ADD, c->fd_tap, &ev); return;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Process EPOLLOUT events. For now this is just a stub, and needs to be
extended to call into the various protocols to actually do something
useful.
Signed-off-by: David Gibson
On Tue, 3 Sep 2024 22:02:29 +1000
David Gibson
This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement. I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything. That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes. -- Stefano
On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson
wrote: This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.
I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.
I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP. I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.
That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.
Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally. That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Wed, 4 Sep 2024 13:17:53 +1000
David Gibson
On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson
wrote: This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.
I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.
I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.
I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at: https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from: https://github.com/containers/podman/issues/23686#issuecomment-2324945010 I shared this other one on top: https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.
Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.
That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.
Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally.
I'm not sure either, but I don't think we have any indication, at the moment, that any of the issues from those two tickets have anything to do with TCP event handling (minus the one you tentatively fixed).
That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might.
I understand that they shouldn't alter it, but if we missed something subtle and they actually do, they'll make bisection more complicated. If this series is only needed for switching TCP sockets to EPOLLET (well, minus 4/6, which is a fix on its own), maybe we could wait until you have the whole thing ready (and, hopefully, we manage to fix those two tickets meanwhile)? -- Stefano
On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:
On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson
wrote: On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson
wrote: This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.
I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.
I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.
I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:
https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:
https://github.com/containers/podman/issues/23686#issuecomment-2324945010
I shared this other one on top:
https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae
Ah, nice.
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.
Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.
Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?
That is, once/if we come up with fixes for those, as they might involve setting different event masks, I'd rather have those in *before* this series, to avoid further noise in case we manage to break something else with those hypothetical fixes.
Right, I understand the impetus. Although as I said I find the current TCP event handling nigh-incomprehensible so I'm not as yet confident we can find a small fix without cleaning up the event handling more generally.
I'm not sure either, but I don't think we have any indication, at the moment, that any of the issues from those two tickets have anything to do with TCP event handling (minus the one you tentatively fixed).
Right, this reasoning is pretty much specific to the EPOLLRDHUP storm. I may have written some of the descriptions before registering that the EPOLLERR storm was UDP and therefore unrelated.
That said, these changes to tap side event handling are a prerequisite / preliminary and shouldn't as yet really alter the TCP event flow. So I don't think this series will of itself make bisection harder, although follow on things based on it might.
I understand that they shouldn't alter it, but if we missed something subtle and they actually do, they'll make bisection more complicated.
I guess. Seems pretty unlikely to me given this doesn't touch the TCP events themselves.
If this series is only needed for switching TCP sockets to EPOLLET (well, minus 4/6, which is a fix on its own), maybe we could wait until you have the whole thing ready (and, hopefully, we manage to fix those two tickets meanwhile)?
Right, I'm ok to wait on this until I have the whole picture including TCP event masks as well. That's kind of why it's an RFC. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 5 Sep 2024 10:35:14 +1000
David Gibson
On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:
On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson
wrote: On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson
wrote: This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.
I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.
I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.
I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:
https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:
https://github.com/containers/podman/issues/23686#issuecomment-2324945010
I shared this other one on top:
https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae
Ah, nice.
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.
Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.
Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?
Right now, we have a condition where we fail to handle EPOLLRDHUP before an outbound connection is established, see https://github.com/containers/podman/issues/23686#issuecomment-233023742, and we end up in a tight event processing loop. I guess what we're missing in tcp_sock_handler() is clear (we should orderly close the connection), but the tight loop didn't happen on 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it didn't. If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP event is reported just once, but that doesn't mean we solved this. -- Stefano
On Thu, 5 Sep 2024 10:32:38 +0200
Stefano Brivio
On Thu, 5 Sep 2024 10:35:14 +1000 David Gibson
wrote: On Wed, Sep 04, 2024 at 07:19:22PM +0200, Stefano Brivio wrote:
On Wed, 4 Sep 2024 13:17:53 +1000 David Gibson
wrote: On Tue, Sep 03, 2024 at 09:25:54PM +0200, Stefano Brivio wrote:
On Tue, 3 Sep 2024 22:02:29 +1000 David Gibson
wrote: This is a draft patch working towards adding EPOLLOUT handling to the tap code, which could then be used to "unstick" flows which have unsent data from the socket side. For now that's just a stub, but makes what I think are some worthwhile cleanups to the tap side event handling in the meantime.
Except for the issue in 3/6 and nits elsewhere, it all makes sense and tap-side EPOLLOUT handling is definitely going to be an improvement.
I wonder if it's the right moment for this kind of series, though, in terms of future bisections, as long as we're grappling with https://github.com/containers/podman/issues/23686 and https://bugs.passt.top/show_bug.cgi?id=94. Assuming, of course, that this series doesn't fix anything.
I don't think this series will fix anything as it stands. It is, indirectly, aimed at addressing bug 94. I'm struggling to figure out what to do with bug 94, because I find it almost impossible to reason about the current event masks in TCP.
I don't see at the moment anything indicating TCP issues other than the one you addressed with your tentative debug patch at:
https://passt.top/passt/commit/?h=podman23686&id=026fb71d1dde60135d95741552906fd5320384bc
Given that, with that patch, we had at least another report of event storms, this time on UDP, that is, the one from:
https://github.com/containers/podman/issues/23686#issuecomment-2324945010
I shared this other one on top:
https://passt.top/passt/commit/?h=podman23686&id=0c6c20dee5c24bd324834a99f409ad43c50812ae
Ah, nice.
I'd really like to simplify them so it's clearer what's correct and not and I think the most obvious path to doing so is using EPOLLET all the time. That requires some sort of kick when the tap is ready to accept more data, hence this series as a prerequisite.
Sure, it's going to be simpler and more robust, but on the other hand we wouldn't notice these kind of issues.
Uh.. I'm confused. In what way would we not notice issues, other than the issues not existing which.. would be good, right?
Right now, we have a condition where we fail to handle EPOLLRDHUP before an outbound connection is established, see https://github.com/containers/podman/issues/23686#issuecomment-233023742,
Sorry, it's: https://github.com/containers/podman/issues/23686#issuecomment-2330237424
and we end up in a tight event processing loop.
I guess what we're missing in tcp_sock_handler() is clear (we should orderly close the connection), but the tight loop didn't happen on 2024_06_24.1ee2eca (I'm bisecting right now) and we don't know why it didn't.
If we set EPOLLET, we won't see that anymore, because the EPOLLRDHUP event is reported just once, but that doesn't mean we solved this.
-- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio