[PATCH 0/5] Fix assorted errors in the Qemu socket tap receive handler
A long time ago Matej Hrica pointed out a possible buffer overrun when receiving data from the qemu socket. Stefano recently proposed a fix for this, but I don't think it's quite right. This series is a different approach to fixing that problem and a few adjacent ones. David Gibson (5): tap: Better report errors receiving from QEMU socket tap: Don't attempt to carry on if we get a bad frame length from qemu tap: Don't use EPOLLET on Qemu sockets tap: Correctly handle frames of odd length tap: Improve handling of partially received frames on qemu socket passt.h | 1 - tap.c | 72 ++++++++++++++++++++++++++++++--------------------------- util.h | 16 +++++++++++++ 3 files changed, 54 insertions(+), 35 deletions(-) -- 2.45.2
If we get an error on recv() from the QEMU socket, we currently don't
print any kind of error. Although this can happen in a non-fatal situation
such as a guest restarting, it's unusual enough that we realy should report
something for debugability.
Add an error message in this case. Also always report when the qemu
connection closes for any reason, not just when it will cause us to exit
(--one-off).
Signed-off-by: David Gibson
On Fri, 26 Jul 2024 17:20:27 +1000
David Gibson
If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability.
Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off).
Signed-off-by: David Gibson
--- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - }
/* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo:
n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");
Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration). Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own. So I'm changing this to: err("Error receiving from guest, resetting connection"); ...if you see an issue with it, I'll send another patch. -- Stefano
On Fri, 26 Jul 2024 13:25:08 +0200
Stefano Brivio
On Fri, 26 Jul 2024 17:20:27 +1000 David Gibson
wrote: If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability.
Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off).
Signed-off-by: David Gibson
--- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - }
/* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo:
n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");
Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration).
Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs
On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own.
So I'm changing this to:
err("Error receiving from guest, resetting connection");
...or rather: err_perror("Receive error on guest connection, reset"); -- Stefano
On Fri, Jul 26, 2024 at 01:25:08PM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 17:20:27 +1000 David Gibson
wrote: If we get an error on recv() from the QEMU socket, we currently don't print any kind of error. Although this can happen in a non-fatal situation such as a guest restarting, it's unusual enough that we realy should report something for debugability.
Add an error message in this case. Also always report when the qemu connection closes for any reason, not just when it will cause us to exit (--one-off).
Signed-off-by: David Gibson
--- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd4445..a2ae7c2a 100644 --- a/tap.c +++ b/tap.c @@ -969,10 +969,10 @@ void tap_add_packet(struct ctx *c, ssize_t l2len, char *p) */ static void tap_sock_reset(struct ctx *c) { - if (c->one_off) { - info("Client closed connection, exiting"); + info("Client connection closed%s", c->one_off ? ", exiting" : ""); + + if (c->one_off) exit(EXIT_SUCCESS); - }
/* Close the connected socket, wait for a new connection */ epoll_ctl(c->epollfd, EPOLL_CTL_DEL, c->fd_tap, NULL); @@ -1005,8 +1005,10 @@ redo:
n = recv(c->fd_tap, p, TAP_BUF_FILL, MSG_DONTWAIT); if (n < 0) { - if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) + if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) { + err_perror("Error receiving from QEMU socket");
Nit I'm fixing up on merge: it's not necessarily QEMU, because libkrun and perhaps others also use this path (in fact, the whole problem was reported as part of the libkrun integration).
Fair point. I'm not sure what term to use to describe specifically a socket using this qemu-originated protocol, though.
Maybe it's obvious to users anyway, but this might seriously confuse somebody who's using e.g. krun on Asahi Linux (is QEMU running, one might wonder): https://github.com/slp/krun#motivation https://github.com/slp/krun/blob/main/crates/krun/src/net.rs
On top of that, now that we have an error message, I guess it would be nice to state we're resetting the connection, because it's not really obvious: the subsequent message from tap_sock_reset() makes it look like the client decided to close the connection on its own.
That's why I changed the message in tap_sock_reset() from "client closed connection" to "client connection closed".
So I'm changing this to:
err("Error receiving from guest, resetting connection");
...if you see an issue with it, I'll send another patch.
Ok. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
If we receive a too-short or too-long frame from the QEMU socket, currently
we try to skip it and carry on. That sounds sensible on first blush, but
probably isn't wise in practice. If this happens, either (a) qemu has done
something seriously unexpected, or (b) we've received corrupt data over a
Unix socket. Or more likely (c), we have a bug elswhere which has put us
out of sync with the stream, so we're trying to read something that's not a
frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a)
might be ok, but we can no longer be confident qemu won't do something else
we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error
and close the socket. As a bonus, establishing firm bounds on l2len early
will allow simplifications to how we deal with the case where a partial
frame is recv()ed.
Signed-off-by: David Gibson
On Fri, 26 Jul 2024 17:20:28 +1000
David Gibson
If we receive a too-short or too-long frame from the QEMU socket, currently we try to skip it and carry on. That sounds sensible on first blush, but probably isn't wise in practice. If this happens, either (a) qemu has done something seriously unexpected, or (b) we've received corrupt data over a Unix socket. Or more likely (c), we have a bug elswhere which has put us out of sync with the stream, so we're trying to read something that's not a frame length as a frame length.
Neither (b) nor (c) is really salvageable with the same stream. Case (a) might be ok, but we can no longer be confident qemu won't do something else we can't cope with.
So, instead of just skipping the frame and trying to carry on, log an error and close the socket. As a bonus, establishing firm bounds on l2len early will allow simplifications to how we deal with the case where a partial frame is recv()ed.
Signed-off-by: David Gibson
--- tap.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/tap.c b/tap.c index a2ae7c2a..08700f65 100644 --- a/tap.c +++ b/tap.c @@ -1013,7 +1013,13 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p); + + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) { + err("Invalid frame size from QEMU (corrupt stream?)");
Same as my comment to 1/5, let me change this to: err("Bad frame size from guest, resetting connection"); -- Stefano
Currently we set EPOLLET (edge trigger) on the epoll flags for the
connected Qemu Unix socket. It's not clear that there's a reason for
doing this: for TCP sockets we need to use EPOLLET, because we leave data
in the socket buffers for our flow control handling. That consideration
doesn't apply to the way we handle the qemu socket however.
Furthermore, using EPOLLET causes additional complications:
1) We don't set EPOLLET when opening /dev/net/tun for pasta mode, however
we *do* set it when using pasta mode with --fd. This inconsistency
doesn't seem to have broken anything, but it's odd.
2) EPOLLET requires that tap_handler_passt() loop until all data available
is read (otherwise we may have data in the buffer but never get an event
causing us to read it). We do that with a rather ugly goto.
Worse, our condition for that goto appears to be incorrect. We'll only
loop if rem is non-zero, which will only happen if we perform a blocking
recv() for a partially received frame. We'll only perform that second
recv() if the original recv() resulted in a partially read frame. As
far as I can tell the original recv() could end on a frame boundary
(never triggering the second recv()) even if there is additional data in
the socket buffer. In that circumstance we wouldn't goto redo and could
leave unprocessed frames in the qemu socket buffer indefinitely.
This doesn't seem to have caused any problems in practice, but since
there's no obvious reason to use EPOLLET here anyway, we might as well
get rid of it.
Signed-off-by: David Gibson
On Fri, 26 Jul 2024 17:20:29 +1000
David Gibson
Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.
It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one. I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile. -- Stefano
On Fri, 26 Jul 2024 10:00:56 +0200
Stefano Brivio
On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson
wrote: Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.
It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile.
Surprisingly, this doesn't affect throughput at all on my setup, no matter the packet size. I didn't check with perf(1), though, and we probably should give all the recent substantial changes a pass with it (it's been a while since I last checked where overhead typically is...). -- Stefano
On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson
wrote: Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.
It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).
I can check that now again with current QEMU and kernel versions, plus several fundamental changes in buffer handling, but I don't see a real reason why this shouldn't have changed meanwhile.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 26 Jul 2024 22:12:27 +1000
David Gibson
On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson
wrote: Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.
It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).
True, we could actually have that loop back without EPOLLET. But the reason why I added EPOLLET, despite the resulting complexity, was surely increased overhead without it. I can't remember (and unfortunately I didn't write this in that commit message from 2021) exactly how that looked like, if we had spurious or more frequent wake-ups or what else. Maybe that was a side effect of something that's fixed or otherwise changed now, but still we should give this a pass with perf(1) before we try to optimise this again (if it even needs to be optimised, that is). -- Stefano
On Fri, Jul 26, 2024 at 03:25:38PM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 22:12:27 +1000 David Gibson
wrote: On Fri, Jul 26, 2024 at 10:00:56AM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 17:20:29 +1000 David Gibson
wrote: Currently we set EPOLLET (edge trigger) on the epoll flags for the connected Qemu Unix socket. It's not clear that there's a reason for doing this: for TCP sockets we need to use EPOLLET, because we leave data in the socket buffers for our flow control handling. That consideration doesn't apply to the way we handle the qemu socket however.
It significantly decreases epoll_wait() overhead on sustained data transfers, because we can read multiple TAP_BUF_SIZE buffers at a time instead of just one.
That's a reason to keep the loop, but not EPOLLET itself, AFAICT. I'd be happy enough to put the loop back in as an optimization (although, I'd prefer to avoid the goto).
True, we could actually have that loop back without EPOLLET.
But the reason why I added EPOLLET, despite the resulting complexity, was surely increased overhead without it.
I can't remember (and unfortunately I didn't write this in that commit message from 2021) exactly how that looked like, if we had spurious or more frequent wake-ups or what else. Maybe that was a side effect of something that's fixed or otherwise changed now, but still we should give this a pass with perf(1) before we try to optimise this again (if it even needs to be optimised, that is).
I think we need to understand if and why that's still the case before putting this back in. I can see an obvious reason why the loop might reduce overhead, but not why the EPOLLET flag itself would. If anything I'd expect level triggered events to more accurately give us wakeups only exactly when we need them. Note also that even looping withouth EPOLLET does have its own cost here: it potentially allows a heavy burst of traffic from qemu to starve processing of events on other sockets. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
The Qemu socket protocol consists of a 32-bit frame length in network (BE)
order, followed by the Ethernet frame itself. As far as I can tell,
frames can be any length, with no particular alignment requirement. This
means that although pkt_buf itself is aligned, if we have a frame of odd
length, frames after it will have their frame length at an unaligned
address.
Currently we load the frame length by just casting a char pointer to
(uint32_t *) and loading. Some platforms will generate a fatal trap on
such an unaligned load. Even if they don't casting an incorrectly aligned
pointer to (uint32_t *) is undefined behaviour, strictly speaking.
Introduce a new helper to safely load a possibly unaligned value here. We
assume that the compiler is smart enough to optimize this into nothing on
platforms that provide performant unaligned loads. If that turns out not
to be the case, we can look at improvements then.
Signed-off-by: David Gibson
Because the Unix socket to qemu is a stream socket, we have no guarantee
of where the boundaries between recv() calls will lie. Typically they
will lie on frame boundaries, because that's how qemu will send then, but
we can't rely on it.
Currently we handle this case by detecting when we have received a partial
frame and performing a blocking recv() to get the remainder, and only then
processing the frames. Change it so instead we save the partial frame
persistently and include it as the first thing processed next time we
receive data from the socket. This handles a number of (unlikely) cases
which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the
remainder, previously we could block here for an unacceptably long time
* If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without
doing the partial frame handling, which would put us out of sync with
the stream from qemu
* If a the blocking recv() only received some of the remainder of the
frame, not all of it, we'd return leaving us out of sync with the
stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This
is probably acceptable because it's an unlikely case in practice. If
necessary we could mitigate this by using a true ring buffer.
Signed-off-by: David Gibson
On Fri, 26 Jul 2024 17:20:31 +1000
David Gibson
Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it.
Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer.
I don't think that that memmove() is a problem if we have a single recv(), even if it happens to be one memmove() for every recv() (guest filling up the buffer, common in throughput tests and bulk transfers), because it's very small in relative terms anyway. I think the ringbuffer would be worth it with multiple recv() calls per epoll wakeup, with EPOLLET. -- Stefano
On Fri, Jul 26, 2024 at 01:39:13PM +0200, Stefano Brivio wrote:
On Fri, 26 Jul 2024 17:20:31 +1000 David Gibson
wrote: Because the Unix socket to qemu is a stream socket, we have no guarantee of where the boundaries between recv() calls will lie. Typically they will lie on frame boundaries, because that's how qemu will send then, but we can't rely on it.
Currently we handle this case by detecting when we have received a partial frame and performing a blocking recv() to get the remainder, and only then processing the frames. Change it so instead we save the partial frame persistently and include it as the first thing processed next time we receive data from the socket. This handles a number of (unlikely) cases which previously would not be dealt with correctly:
* If qemu sent a partial frame then waited some time before sending the remainder, previously we could block here for an unacceptably long time * If qemu sent a tiny partial frame (< 4 bytes) we'd leave the loop without doing the partial frame handling, which would put us out of sync with the stream from qemu * If a the blocking recv() only received some of the remainder of the frame, not all of it, we'd return leaving us out of sync with the stream again
Caveat: This could memmove() a moderate amount of data (ETH_MAX_MTU). This is probably acceptable because it's an unlikely case in practice. If necessary we could mitigate this by using a true ring buffer.
I don't think that that memmove() is a problem if we have a single recv(), even if it happens to be one memmove() for every recv() (guest filling up the buffer, common in throughput tests and bulk transfers), because it's very small in relative terms anyway.
I think the ringbuffer would be worth it with multiple recv() calls per epoll wakeup, with EPOLLET.
So first, as noted on the earlier patch, I don't think multiple recv()s per wakeup requires EPOLLET, though the reverse is true. Regardless, AFAICT the proportion of memmove()s to data received would not vary regardless of whether we do multiple recv()s per wakeup or the same number of recv()s split across multiple wakeups. Of course, if the recv()s line up with frame boundaries, as we expect, then it doesn't matter anyway, since we won't get partial frames and we won't need memmove()s. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 26 Jul 2024 17:20:26 +1000
David Gibson
A long time ago Matej Hrica pointed out a possible buffer overrun when receiving data from the qemu socket. Stefano recently proposed a fix for this, but I don't think it's quite right. This series is a different approach to fixing that problem and a few adjacent ones.
David Gibson (5): tap: Better report errors receiving from QEMU socket tap: Don't attempt to carry on if we get a bad frame length from qemu tap: Don't use EPOLLET on Qemu sockets tap: Correctly handle frames of odd length tap: Improve handling of partially received frames on qemu socket
Applied. I'm a bit nervous about 3/5 but anyway the series as a whole is clearly better than the alternative. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio