[PATCH 00/11] Minor assorted fixes, mostly logging and tests
Here are a bunch of mostly unrelated assorted fixes for small issues I found in the past few weeks but didn't really find time to report. Stefano Brivio (11): tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down tcp: Change SO_PEEK_OFF support message to debug() log: Drop newlines in the middle of the perror()-like messages log: Fix sub-second part in relative log time calculation log: Initialise timestamp for relative log time also if we use a log file test: Fix memory/passt tests, --netns-only is not a valid option for passt test: Update names of symbols and slabinfo entries test: iperf3 3.16 introduces multiple threads, drop our own implementation of that tap: Exit if we fail to bind a UNIX domain socket with explicit path tap: Discard guest data on length descriptor mismatch conf: Accept addresses enclosed by square brackets in port forwarding specifiers conf.c | 24 ++++++++---- flow.c | 2 +- log.c | 76 +++++++++++++++++++++----------------- log.h | 19 +++++----- passt.c | 2 + tap.c | 18 ++++++--- tcp.c | 2 +- test/lib/perf_report | 2 +- test/lib/test | 31 +++++----------- test/memory/passt | 35 ++++++++---------- test/passt.mem.mbuto | 9 +---- test/perf/passt_tcp | 39 ++++++++++---------- test/perf/passt_udp | 55 ++++++++++++++-------------- test/perf/pasta_tcp | 58 ++++++++++++++--------------- test/perf/pasta_udp | 87 ++++++++++++++++++++++---------------------- 15 files changed, 229 insertions(+), 230 deletions(-) -- 2.43.0
If we start pasta with some ports forwarded, but no --config-net, say:
$ ./pasta -u 10001
and then use a local, non-loopback address to send traffic to that
port, say:
$ socat -u FILE:test UDP4:192.0.2.1:10001
pasta writes to the tap file descriptor, but if the interface is down,
we get EIO and terminate.
By itself, what I'm doing in this case is not very useful (I simply
forgot to pass --config-net), but if we happen to have a DHCP client
in the network namespace, the interface might still be down while
somebody tries to send traffic to it, and exiting in that case is not
really helpful.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:07PM +0200, Stefano Brivio wrote:
If we start pasta with some ports forwarded, but no --config-net, say:
$ ./pasta -u 10001
and then use a local, non-loopback address to send traffic to that port, say:
$ socat -u FILE:test UDP4:192.0.2.1:10001
pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate.
By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tap.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/tap.c b/tap.c index 32a7b09..6930ad8 100644 --- a/tap.c +++ b/tap.c @@ -324,6 +324,7 @@ static size_t tap_send_frames_pasta(const struct ctx *c, case EINTR: case ENOBUFS: case ENOSPC: + case EIO: /* interface down? */ break; default: die("Write error on tap device, exiting");
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
This:
$ ./pasta
SO_PEEK_OFF not supported
#
is a bit annoying, and might trick users who face other issues into
thinking that SO_PEEK_OFF not being supported on a given kernel is
an actual issue.
Even if SO_PEEK_OFF is supported by the kernel, that would be the
only message displayed there, with default options, which looks a bit
out of context.
Switch that to debug(): now that Podman users can pass --debug too, we
can find out quickly if it's supported or not, if SO_PEEK_OFF usage is
suspected of causing any issue.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:08PM +0200, Stefano Brivio wrote:
This:
$ ./pasta SO_PEEK_OFF not supported #
is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue.
Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context.
Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tcp.c b/tcp.c index c031f13..c0820ce 100644 --- a/tcp.c +++ b/tcp.c @@ -2524,7 +2524,7 @@ int tcp_init(struct ctx *c)
peek_offset_cap = (!c->ifi4 || tcp_probe_peek_offset_cap(AF_INET)) && (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); - info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");
return 0; }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Calling vlogmsg() twice from logmsg_perror() results in this beauty:
$ ./pasta -i foo
Invalid interface name foo
: No such device
because the first part of the message, corresponding to the first
call, doesn't end with a newline, and vlogmsg() adds it.
Given that we can't easily append an argument (error description) to
a variadic list, add a 'newline' parameter to all the functions that
currently add a newline if missing, and disable that on the first call
to vlogmsg() from logmsg_perror(). Not very pretty but I can't think
of any solution that's less messy than this.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:
Calling vlogmsg() twice from logmsg_perror() results in this beauty:
$ ./pasta -i foo Invalid interface name foo : No such device
because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.
Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.
Signed-off-by: Stefano Brivio
I think my personal inclination would be to rename all the
lowest-level functions slightly and remove the newline adding logic
unconditionally. The create wrappers under the old name which add the
"\n". I think that can be done in an easy macro, since the "\n" can
be constant string appended to the format string. Just the special
paths that need to suppress the newline would call the low level "no
newline" variants.
But, I don't actually care that much so
Reviewed-by: David Gibson
--- flow.c | 2 +- log.c | 34 ++++++++++++++++++++++------------ log.h | 18 +++++++++--------- 3 files changed, 32 insertions(+), 22 deletions(-)
diff --git a/flow.c b/flow.c index d7d548d..bd5fa2b 100644 --- a/flow.c +++ b/flow.c @@ -279,7 +279,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) else type_or_state = FLOW_TYPE(f);
- logmsg(pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); + logmsg(true, pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); }
/** diff --git a/log.c b/log.c index 9ed6bc1..54483e7 100644 --- a/log.c +++ b/log.c @@ -46,7 +46,14 @@ int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */
-void vlogmsg(int pri, const char *format, va_list ap) +/** + * vlogmsg() - Print or send messages to log or output files as configured + * @newline: Append newline at the end of the message, if missing + * @pri: Facility and level map, same as priority for vsyslog() + * @format: Message + * @ap: Variable argument list + */ +void vlogmsg(bool newline, int pri, const char *format, va_list ap) { bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1; struct timespec tp; @@ -63,9 +70,9 @@ void vlogmsg(int pri, const char *format, va_list ap)
va_copy(ap2, ap); /* Don't clobber ap, we need it again */ if (log_file != -1) - logfile_write(pri, format, ap2); + logfile_write(newline, pri, format, ap2); else if (!(log_mask & LOG_MASK(LOG_DEBUG))) - passt_vsyslog(pri, format, ap2); + passt_vsyslog(newline, pri, format, ap2);
va_end(ap2); } @@ -73,22 +80,23 @@ void vlogmsg(int pri, const char *format, va_list ap) if (debug_print || !log_conf_parsed || (!log_runtime && (log_mask & LOG_MASK(LOG_PRI(pri))))) { (void)vfprintf(stderr, format, ap); - if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') fprintf(stderr, "\n"); } }
/** * logmsg() - vlogmsg() wrapper for variable argument lists + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Message */ -void logmsg(int pri, const char *format, ...) +void logmsg(bool newline, int pri, const char *format, ...) { va_list ap;
va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(newline, pri, format, ap); va_end(ap); }
@@ -103,10 +111,10 @@ void logmsg_perror(int pri, const char *format, ...) va_list ap;
va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(false, pri, format, ap); va_end(ap);
- logmsg(pri, ": %s", strerror(errno_copy)); + logmsg(true, pri, ": %s", strerror(errno_copy)); }
/* Prefixes for log file messages, indexed by priority */ @@ -174,11 +182,12 @@ void __setlogmask(int mask)
/** * passt_vsyslog() - vsyslog() implementation not using heap memory + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void passt_vsyslog(int pri, const char *format, va_list ap) +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap) { char buf[BUFSIZ]; int n; @@ -188,7 +197,7 @@ void passt_vsyslog(int pri, const char *format, va_list ap)
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
- if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n");
if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && !log_runtime) @@ -360,11 +369,12 @@ static int logfile_rotate(int fd, const struct timespec *now)
/** * logfile_write() - Write entry to log file, trigger rotation if full + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void logfile_write(int pri, const char *format, va_list ap) +void logfile_write(bool newline, int pri, const char *format, va_list ap) { struct timespec now; char buf[BUFSIZ]; @@ -380,7 +390,7 @@ void logfile_write(int pri, const char *format, va_list ap)
n += vsnprintf(buf + n, BUFSIZ - n, format, ap);
- if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n");
if ((log_written + n >= log_size) && logfile_rotate(log_file, &now)) diff --git a/log.h b/log.h index 6f34e8d..51ddafa 100644 --- a/log.h +++ b/log.h @@ -13,16 +13,16 @@ #define LOGFILE_CUT_RATIO 30 /* When full, cut ~30% size */ #define LOGFILE_SIZE_MIN (5UL * MAX(BUFSIZ, PAGE_SIZE))
-void vlogmsg(int pri, const char *format, va_list ap); -void logmsg(int pri, const char *format, ...) - __attribute__((format(printf, 2, 3))); +void vlogmsg(bool newline, int pri, const char *format, va_list ap); +void logmsg(bool newline, int pri, const char *format, ...) + __attribute__((format(printf, 3, 4))); void logmsg_perror(int pri, const char *format, ...) __attribute__((format(printf, 2, 3)));
-#define err(...) logmsg( LOG_ERR, __VA_ARGS__) -#define warn(...) logmsg( LOG_WARNING, __VA_ARGS__) -#define info(...) logmsg( LOG_INFO, __VA_ARGS__) -#define debug(...) logmsg( LOG_DEBUG, __VA_ARGS__) +#define err(...) logmsg(true, LOG_ERR, __VA_ARGS__) +#define warn(...) logmsg(true, LOG_WARNING, __VA_ARGS__) +#define info(...) logmsg(true, LOG_INFO, __VA_ARGS__) +#define debug(...) logmsg(true, LOG_DEBUG, __VA_ARGS__)
#define err_perror(...) logmsg_perror( LOG_ERR, __VA_ARGS__) #define warn_perror(...) logmsg_perror( LOG_WARNING, __VA_ARGS__) @@ -54,8 +54,8 @@ void trace_init(int enable);
void __openlog(const char *ident, int option, int facility); void logfile_init(const char *name, const char *path, size_t size); -void passt_vsyslog(int pri, const char *format, va_list ap); -void logfile_write(int pri, const char *format, va_list ap); +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap); +void logfile_write(bool newline, int pri, const char *format, va_list ap); void __setlogmask(int mask);
#endif /* LOG_H */
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 13:26:47 +1000
David Gibson
On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:
Calling vlogmsg() twice from logmsg_perror() results in this beauty:
$ ./pasta -i foo Invalid interface name foo : No such device
because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.
Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.
Signed-off-by: Stefano Brivio
I think my personal inclination would be to rename all the lowest-level functions slightly and remove the newline adding logic unconditionally. The create wrappers under the old name which add the "\n". I think that can be done in an easy macro, since the "\n" can be constant string appended to the format string. Just the special paths that need to suppress the newline would call the low level "no newline" variants.
I gave it a try, but the problem is that the "easy macro" needs to be conditional, depending on whether the newline is there or not: we don't want to add a newline in case somebody already added it by mistake (or habit). That could probably be done as well with an intermediate function, but it's getting a bit too complicated (at least for me right now).
But, I don't actually care that much so
Reviewed-by: David Gibson
-- Stefano
On Thu, Jul 25, 2024 at 01:27:09PM +0200, Stefano Brivio wrote:
On Thu, 25 Jul 2024 13:26:47 +1000 David Gibson
wrote: On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:
Calling vlogmsg() twice from logmsg_perror() results in this beauty:
$ ./pasta -i foo Invalid interface name foo : No such device
because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.
Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.
Signed-off-by: Stefano Brivio
I think my personal inclination would be to rename all the lowest-level functions slightly and remove the newline adding logic unconditionally. The create wrappers under the old name which add the "\n". I think that can be done in an easy macro, since the "\n" can be constant string appended to the format string. Just the special paths that need to suppress the newline would call the low level "no newline" variants.
I gave it a try, but the problem is that the "easy macro" needs to be conditional, depending on whether the newline is there or not: we don't want to add a newline in case somebody already added it by mistake (or habit).
Well, I mean we could just add the newline and consider it a bug if someone puts one in themselves.
That could probably be done as well with an intermediate function, but it's getting a bit too complicated (at least for me right now).
But, I don't actually care that much so
Reviewed-by: David Gibson
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
For some reason, in commit 01efc71ddd25 ("log, conf: Add support for
logging to file"), I added calculations for relative logging
timestamps using the difference for the seconds part only, not for
accounting for the fractional part.
Fix that by storing the initial timestamp, log_start, as a timespec
struct, and by calculating differences of both seconds and nanoseconds
from the starting time. Do this in a macro as we need the same
calculation and format in a few places.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:10PM +0200, Stefano Brivio wrote:
For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part.
Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating differences of both seconds and nanoseconds from the starting time. Do this in a macro as we need the same calculation and format in a few places.
Signed-off-by: Stefano Brivio
--- log.c | 42 ++++++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/log.c b/log.c index 54483e7..f57a54f 100644 --- a/log.c +++ b/log.c @@ -40,12 +40,21 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */
-static time_t log_start; /* Start timestamp */ +static struct timespec log_start; /* Start timestamp */
int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */
+/** + * logtime_fmt_and_arg() - Build format and arguments to print relative log time + * @x: Current timestamp + */ +#define logtime_fmt_and_arg(x) \ + "%lli.%04lli", \ + ((long long int)(x)->tv_sec - log_start.tv_sec), \ + (((long long int)(x)->tv_nsec - log_start.tv_nsec) / (100L * 1000))
This doesn't look right. If x->tv_nsec - log_start.tv_nsec is negative it will produce weird results. Instead you need more complex logic to carry a sufficient difference in the nsec over into the seconds difference. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 13:32:52 +1000
David Gibson
On Wed, Jul 24, 2024 at 11:50:10PM +0200, Stefano Brivio wrote:
[...]
+/** + * logtime_fmt_and_arg() - Build format and arguments to print relative log time + * @x: Current timestamp + */ +#define logtime_fmt_and_arg(x) \ + "%lli.%04lli", \ + ((long long int)(x)->tv_sec - log_start.tv_sec), \ + (((long long int)(x)->tv_nsec - log_start.tv_nsec) / (100L * 1000))
This doesn't look right. If x->tv_nsec - log_start.tv_nsec is negative it will produce weird results. Instead you need more complex logic to carry a sufficient difference in the nsec over into the seconds difference.
Oh, oops, of course. I was hoping to recycle this from the Linux kernel, where we have timespec_sub(), but that implementation looks a bit too complicated for my taste. I'm adding something simpler now. -- Stefano
...not just for debug messages. Otherwise, timestamps in the log file
are consistent but the starting point is not zero.
Do this right away as we enter main(), so that the resulting
timestamps are as closely as possible relative to when we start.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:11PM +0200, Stefano Brivio wrote:
...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero.
Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- log.c | 4 +--- log.h | 1 + passt.c | 2 ++ 3 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/log.c b/log.c index f57a54f..5f24c16 100644 --- a/log.c +++ b/log.c @@ -40,7 +40,7 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */
-static struct timespec log_start; /* Start timestamp */ +struct timespec log_start; /* Start timestamp */
int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ @@ -154,8 +154,6 @@ void __openlog(const char *ident, int option, int facility) { (void)option;
- clock_gettime(CLOCK_REALTIME, &log_start); - if (log_sock < 0) { struct sockaddr_un a = { .sun_family = AF_UNIX, };
diff --git a/log.h b/log.h index 51ddafa..e03199c 100644 --- a/log.h +++ b/log.h @@ -44,6 +44,7 @@ void logmsg_perror(int pri, const char *format, ...) extern int log_trace; extern bool log_conf_parsed; extern bool log_runtime; +extern struct timespec log_start;
void trace_init(int enable); #define trace(...) \ diff --git a/passt.c b/passt.c index eed74ec..72ad704 100644 --- a/passt.c +++ b/passt.c @@ -207,6 +207,8 @@ int main(int argc, char **argv) struct timespec now; struct sigaction sa;
+ clock_gettime(CLOCK_REALTIME, &log_start); + arch_avx2_exec(argv);
isolate_initial();
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 13:35:25 +1000
David Gibson
On Wed, Jul 24, 2024 at 11:50:11PM +0200, Stefano Brivio wrote:
...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero.
Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
It does occur to me, though... when we're using relative timestamps, should we be using the monotonic clock instead of the realtime clock?
Right, yes, I was under the impression that we could risk mixing those with packet capture timestamps (where we want the realtime clock), but actually it's never the case. I'll post another patch for that. -- Stefano
This used to work on my setup as I kept reusing an old mbuto
(initramfs) image, but since commit 65923ba79877 ("conf: Accept
duplicate and conflicting options, the last one wins"), --netns-only
is, as originally intended, a pasta-only option.
I had used --netns-only, here, to prevent passt from trying to detach
its own user namespace, which is not permitted as we're in a chroot,
see unshare(2). In turn, we need the chroot because passt can't pivot
root directly into its own empty filesystem using an initramfs.
Use switch_root into the tmpfs mountpoint instead of chroot, so that
we can still detach user namespaces.
Note that in the mbuto images, we can't switch to nobody as we have
no password entries at all, so we need to detach a further user
namespace before starting passt, to trick passt into running as UID
0.
Given the new sequence, it's now more convenient to directly switch
to a detached network namespace as well, which means we need to move
the initialisation of the dummy network from the init script into the
test script.
Reported-by: David Gibson
On Wed, Jul 24, 2024 at 11:50:12PM +0200, Stefano Brivio wrote:
This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit 65923ba79877 ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option.
I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs.
Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces.
Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0.
Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script.
Reported-by: David Gibson
Signed-off-by: Stefano Brivio
Excellent, I can run these tests again.
Tested-by: David Gibson
--- test/memory/passt | 13 ++++++++++--- test/passt.mem.mbuto | 9 +-------- 2 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/test/memory/passt b/test/memory/passt index 1193af8..bf78c8f 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -44,7 +44,7 @@ endef def start_stop_diff guest sed /proc/slabinfo -ne 's/^\([^ ]* *[^ ]* *[^ ]* *[^ ]*\).*/\\\1/p' > /tmp/slabinfo.before guest cat /proc/meminfo > /tmp/meminfo.before -guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ --netns-only +guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ sleep 2 guest cat /proc/meminfo > /tmp/meminfo.after guest sed /proc/slabinfo -ne 's/^\([^ ]* *[^ ]* *[^ ]* *[^ ]*\).*/\\\1/p' > /tmp/slabinfo.after @@ -78,9 +78,16 @@ guest mount -o bind /proc /test/proc guest mount -o bind /dev /test/dev guest cp -Lr /bin /lib /lib64 /usr /sbin /test/
+guest exec switch_root /test /bin/sh + guest ulimit -Hn 300000 -guest unshare -rUm -R /test -guest chroot . +guest unshare -rUn +guest ip link add eth0 type dummy +guest ip link set eth0 up +guest ip address add 192.0.2.2/24 dev eth0 +guest ip address add 2001:db8::2/64 dev eth0 +guest ip route add default via 192.0.2.1 +guest ip -6 route add default via 2001:db8::1 dev eth0
guest meminfo_size() { grep "^$2:" $1 | tr -s ' ' | cut -f2 -d ' '; } guest meminfo_diff() { echo $(( $(meminfo_size $2 $3) - $(meminfo_size $1 $3) )); } diff --git a/test/passt.mem.mbuto b/test/passt.mem.mbuto index 56f5139..532eae0 100755 --- a/test/passt.mem.mbuto +++ b/test/passt.mem.mbuto @@ -12,7 +12,7 @@
PROGS="${PROGS:-ash,dash,bash chmod ip mount insmod mkdir ln cat chmod modprobe grep mknod sed chown sleep bc ls ps mount unshare chroot cp kill diff - head tail sort tr tee cut nm which}" + head tail sort tr tee cut nm which switch_root}"
KMODS="${KMODS:- dummy}"
@@ -29,13 +29,6 @@ COPIES="${COPIES} ../passt.avx2,/bin/passt.avx2" FIXUP="${FIXUP}"' ln -s /bin /usr/bin chmod 777 /tmp -ip link add eth0 type dummy -ip link set eth0 up -ip address add 192.0.2.2/24 dev eth0 -ip address add 2001:db8::2/64 dev eth0 -ip route add default via 192.0.2.1 -ip -6 route add default via 2001:db8::1 dev eth0 -sleep 2 sh +m '
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Differences in allocated Acpi-Parse entries are gone (at least) since
the 6.1 Linux kernel series. I should run this on a 6.10 kernel,
eventually, and adjust things further, as needed.
Userspace symbols are also fairly different now: show whatever is more
than 1 MiB at the moment.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:13PM +0200, Stefano Brivio wrote:
Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed.
Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment.
Signed-off-by: Stefano Brivio
And now I get more useful results here too.
Tested-by: David Gibson
--- test/memory/passt | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-)
diff --git a/test/memory/passt b/test/memory/passt index bf78c8f..7e45724 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -110,27 +110,17 @@ info th symbol MiB set WHAT tcp_buf_discard nm_row -set WHAT tcp6_l2_buf +set WHAT flowtab nm_row -set WHAT tcp4_l2_buf +set WHAT tcp6_payload nm_row -set WHAT tc +set WHAT tcp4_payload nm_row set WHAT pkt_buf nm_row -set WHAT udp_splice_map +set WHAT udp_payload nm_row -set WHAT udp6_l2_buf -nm_row -set WHAT udp4_l2_buf -nm_row -set WHAT udp_tap_map -nm_row -set WHAT icmp_id_map -nm_row -set WHAT udp_splice_buf -nm_row -set WHAT tc_hash +set WHAT flow_hashtab nm_row set WHAT pool_tap6_storage nm_row @@ -149,8 +139,6 @@ set WHAT pid slab_row set WHAT dentry slab_row -set WHAT Acpi-Parse -slab_row set WHAT kmalloc-64 slab_row set WHAT kmalloc-32
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Starting from iperf3 version 3.16, -P / --parallel spawns multiple
clients as separate threads, instead of multiple streams serviced by
the same thread.
So we can drop our lib/test implementation to spawn several iperf3
client and server processes and finally simplify things quite a bit.
Adjust number of threads and UDP sending bandwidth to values that seem
to be more or less matching previous throughput tests on my setup.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:14PM +0200, Stefano Brivio wrote:
Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread.
So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit.
Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup.
Signed-off-by: Stefano Brivio
Lovely.
Reviewed-by: David Gibson
In tap_sock_unix_open(), if we have a given path for the socket from
configuration, we don't need to loop over possible paths, so we exit
the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should
exit, instead of continuing. Otherwise we'll pretend we're up and
running, but nobody can contact us, and this might be mildly confusing
for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:15PM +0200, Stefano Brivio wrote:
In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally.
But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- tap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/tap.c b/tap.c index 6930ad8..44bd444 100644 --- a/tap.c +++ b/tap.c @@ -1139,8 +1139,11 @@ int tap_sock_unix_open(char *sock_path) close(ex);
unlink(path); - if (!bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) || - *sock_path) + ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr)); + if (*sock_path && ret) + die_perror("Failed to bind UNIX domain socket"); + + if (!ret) break; }
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
This was reported by Matej a while ago, but we forgot to fix it. Even
if the hypervisor is necessarily trusted by passt, as it can in any
case terminate the guest or disrupt guest connectivity, it's a good
idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard
the data we read with a single recv(), as we had a few cases where
QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the
length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);
p += sizeof(uint32_t); n -= sizeof(uint32_t);
+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;
Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space. But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful. Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries. If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice. If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0); if ((n += rem) != l2len) return;
Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 14:37:43 +1000
David Gibson
On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);
p += sizeof(uint32_t); n -= sizeof(uint32_t);
+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;
Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.
Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.
...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.
That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.
That's already checked below (l2len > ETH_MAX_MTU), and...
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.
that's exactly what we do in that case (goto next).
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);
^^^^^^^^^^^^^^^^
This the reason why the check above is relevant.
if ((n += rem) != l2len) return;
Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().
Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover. But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);
-- Stefano
On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:
On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson
wrote: On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);
p += sizeof(uint32_t); n -= sizeof(uint32_t);
+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;
Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.
Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.
But AFAICT, what we need to know is if the remainder of the frame fits in the buffer. That could be less than the length descriptor if we've already recv()ed part of a frame.
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.
...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.
That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.
Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.
That's already checked below (l2len > ETH_MAX_MTU), and...
Right. I wonder if it would make sense to do that earlier.
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.
that's exactly what we do in that case (goto next).
Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);
^^^^^^^^^^^^^^^^
This the reason why the check above is relevant.
Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it. I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).
if ((n += rem) != l2len) return;
Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().
Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.
Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.
But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 20:23:55 +1000
David Gibson
On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:
On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson
wrote: On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);
p += sizeof(uint32_t); n -= sizeof(uint32_t);
+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;
Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.
Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.
But AFAICT, what we need to know is if the remainder of the frame fits in the buffer.
It always does because of how TAP_BUF_BYTES and TAP_BUF_FILL are defined.
That could be less than the length descriptor if we've already recv()ed part of a frame.
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.
...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.
That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.
Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.
In most cases, the content of one recv() corresponds to a given number of whole frames, so what we're doing by returning here is, practically speaking, similar to what you're suggesting below with MSG_TRUNC.
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.
That's already checked below (l2len > ETH_MAX_MTU), and...
Right. I wonder if it would make sense to do that earlier.
We can. But right now what I want to fix here is just that missing check, because it actually causes passt to crash (even though we assume a trusted hypervisor, so it's not really security relevant, but not nice either).
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.
that's exactly what we do in that case (goto next).
Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.
We don't know if the length descriptor actually matches the length of the frame, though. If you have a way you think is more robust, would you mind sending a patch?
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);
^^^^^^^^^^^^^^^^
This the reason why the check above is relevant.
Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it.
That should be enough to make sure we don't have bad side effects on this second recv(), but yes:
I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).
...that is actually more accurate. I can fix this up in another version, unless you can think of a more comprehensive change that also gives us better possibilities to recover from a corrupted stream.
if ((n += rem) != l2len) return;
Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().
Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.
Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.
The idea is that the recv() is _usually_ big enough as to flush the buffer anyway.
But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);
-- Stefano
On Thu, Jul 25, 2024 at 01:09:19PM +0200, Stefano Brivio wrote:
On Thu, 25 Jul 2024 20:23:55 +1000 David Gibson
wrote: On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:
On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson
wrote: On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.
Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.
While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.
Reported-by: Matej Hrica
Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }
while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);
p += sizeof(uint32_t); n -= sizeof(uint32_t);
+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;
Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.
Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.
But AFAICT, what we need to know is if the remainder of the frame fits in the buffer.
It always does because of how TAP_BUF_BYTES and TAP_BUF_FILL are defined.
Assuming the frame is <= ETH_MAX_MTU bytes, which we haven't yet verified. But that's not really the point I was making. The key word above is *remainder*: we may have already read part of the frame, so comparing the *total* frame length against the remaining buffer space is incorrect.
That could be less than the length descriptor if we've already recv()ed part of a frame.
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.
...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.
That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.
Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.
In most cases, the content of one recv() corresponds to a given number of whole frames, so what we're doing by returning here is, practically speaking, similar to what you're suggesting below with MSG_TRUNC.
Or, I guess more technically, the recv() boundaries will probably correspond to qemu's send() boundaries(), which are likely to correspond to the frame boundaries. But, I think I now see the root of our disagreement here. If the frame field lengths don't seem to make sense relative to the recv() boundaries, do we trust the former (my position) or the latter (your position, I think). Thinking about it like that, I don't think either position makes a lot of sense. If these mismatch something has already gone horribly wrong and we should probably just die(), or at least reset the tap socket.
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.
That's already checked below (l2len > ETH_MAX_MTU), and...
Right. I wonder if it would make sense to do that earlier.
We can. But right now what I want to fix here is just that missing check, because it actually causes passt to crash (even though we assume a trusted hypervisor, so it's not really security relevant, but not nice either).
Ok, fair enough.
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.
that's exactly what we do in that case (goto next).
Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.
We don't know if the length descriptor actually matches the length of the frame, though. If you have a way you think is more robust, would you mind sending a patch?
It's more that if the length descriptor doesn't match the actual length of the frame, I think we're beyond saving. The recv() boundaries probably correspond to to frame boundaries, but that's not guaranteed by SOCK_STREAM semantics, so I don't think we should ever trust them _over_ the length fields we receive.
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);
^^^^^^^^^^^^^^^^
This the reason why the check above is relevant.
Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it.
That should be enough to make sure we don't have bad side effects on this second recv(), but yes:
No, I don't think it is. Suppose we get exactly TAP_BUF_FILL bytes in the first recv(). (TAP_BUF_FILL-4) bytes of that is perfectly ordinary, valid frames, which we process. We've read the remaining 4 bytes as a field length, so: p == pkt_buf + TAP_BUF_FILL n == 0 Now suppose that last frame length field is bad, say l2len == 2*ETH_MAX_MTU The test you've suggested: if (l2len > (ssize_t)TAP_BUF_BYTES - n) will let us continue, but the recv() rem = recv(c->fd_tap, p + n, l2len - n, 0); will overrun pkt_buf + TAP_BUF_BYTES. That _probably_ won't happen, since the recv() is likely to end at a frame boundary instead, but again SOCK_STREAM semantics don't guarantee that. And even if the kernel does always preserve send()/recv() boundaries, that qemu might not transmit the frame length as a separate send() (maybe on a debugging / slow path).
I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).
...that is actually more accurate. I can fix this up in another version, unless you can think of a more comprehensive change that also gives us better possibilities to recover from a corrupted stream.
So, as noted above, I don't think we should even attempt to recover from a corrupted stream. One could certainly design a stream protocol that allows for recovery from corrupted portions, but the qemu -net stream protocol isn't so designed.
if ((n += rem) != l2len) return;
Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().
Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.
Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.
The idea is that the recv() is _usually_ big enough as to flush the buffer anyway.
Hrm, not something I'd really like to rely on. So, I just realised there's another pre-existing problem here. AFAICT there's no guarantee that frame lengths are aligned - which means after an odd sized frame, we'd be making an unaligned u32 read for the frame length, which will crash on some platforms. I'll have a look at all the above today and see what I can come up with. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Even though we don't use : as delimiter for the port, making square
brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6
literals. We want IPv6 addresses there, but some users might still
specify them out of habit.
Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for
IPv4 literals, but I had reports of users actually trying to use them
(they're accepted by many tools).
Allow square brackets for both IPv4 and IPv6 addresses, correct or
not, they're harmless anyway.
Signed-off-by: Stefano Brivio
On Wed, Jul 24, 2024 at 11:50:17PM +0200, Stefano Brivio wrote:
Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit.
Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools).
Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- conf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/conf.c b/conf.c index 3cf9ed8..338742e 100644 --- a/conf.c +++ b/conf.c @@ -209,14 +209,24 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
}
- if (ifname == buf + 1) /* Interface without address */ + if (ifname == buf + 1) { /* Interface without address */ addr = NULL; - else if (inet_pton(AF_INET, buf, addr)) - af = AF_INET; - else if (inet_pton(AF_INET6, buf, addr)) - af = AF_INET6; - else - goto bad; + } else { + p = buf; + + /* Allow square brackets for IPv4 too for convenience */ + if (*p == '[' && p[strlen(p) - 1] == ']') { + p[strlen(p) - 1] = 0;
Nit: I think = '\0' would make the intention here clearer.
+ p++; + } + + if (inet_pton(AF_INET, p, addr)) + af = AF_INET; + else if (inet_pton(AF_INET6, p, addr)) + af = AF_INET6; + else + goto bad; + } } else { spec = buf;
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 25 Jul 2024 14:39:49 +1000
David Gibson
On Wed, Jul 24, 2024 at 11:50:17PM +0200, Stefano Brivio wrote:
Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit.
Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools).
Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway.
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- conf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)
diff --git a/conf.c b/conf.c index 3cf9ed8..338742e 100644 --- a/conf.c +++ b/conf.c @@ -209,14 +209,24 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
}
- if (ifname == buf + 1) /* Interface without address */ + if (ifname == buf + 1) { /* Interface without address */ addr = NULL; - else if (inet_pton(AF_INET, buf, addr)) - af = AF_INET; - else if (inet_pton(AF_INET6, buf, addr)) - af = AF_INET6; - else - goto bad; + } else { + p = buf; + + /* Allow square brackets for IPv4 too for convenience */ + if (*p == '[' && p[strlen(p) - 1] == ']') { + p[strlen(p) - 1] = 0;
Nit: I think = '\0' would make the intention here clearer.
Right, fixed on merge. -- Stefano
On Wed, 24 Jul 2024 23:50:06 +0200
Stefano Brivio
Here are a bunch of mostly unrelated assorted fixes for small issues I found in the past few weeks but didn't really find time to report.
Stefano Brivio (11): tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down tcp: Change SO_PEEK_OFF support message to debug() log: Drop newlines in the middle of the perror()-like messages log: Fix sub-second part in relative log time calculation log: Initialise timestamp for relative log time also if we use a log file test: Fix memory/passt tests, --netns-only is not a valid option for passt test: Update names of symbols and slabinfo entries test: iperf3 3.16 introduces multiple threads, drop our own implementation of that tap: Exit if we fail to bind a UNIX domain socket with explicit path tap: Discard guest data on length descriptor mismatch conf: Accept addresses enclosed by square brackets in port forwarding specifiers
I applied these except for 4/11, 5/11, and 10/11. I'll send a new version at least for 4/11 and 5/11 in a bit. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio