[PATCH 00/11] Minor assorted fixes, mostly logging and tests

older
[PATCH] test: Fix spurious test...

Stefano Brivio

24 Jul 2024 24 Jul '24

11:50 p.m.

Here are a bunch of mostly unrelated assorted fixes for small issues I found in the past few weeks but didn't really find time to report. Stefano Brivio (11): tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down tcp: Change SO_PEEK_OFF support message to debug() log: Drop newlines in the middle of the perror()-like messages log: Fix sub-second part in relative log time calculation log: Initialise timestamp for relative log time also if we use a log file test: Fix memory/passt tests, --netns-only is not a valid option for passt test: Update names of symbols and slabinfo entries test: iperf3 3.16 introduces multiple threads, drop our own implementation of that tap: Exit if we fail to bind a UNIX domain socket with explicit path tap: Discard guest data on length descriptor mismatch conf: Accept addresses enclosed by square brackets in port forwarding specifiers conf.c | 24 ++++++++---- flow.c | 2 +- log.c | 76 +++++++++++++++++++++----------------- log.h | 19 +++++----- passt.c | 2 + tap.c | 18 ++++++--- tcp.c | 2 +- test/lib/perf_report | 2 +- test/lib/test | 31 +++++----------- test/memory/passt | 35 ++++++++---------- test/passt.mem.mbuto | 9 +---- test/perf/passt_tcp | 39 ++++++++++---------- test/perf/passt_udp | 55 ++++++++++++++-------------- test/perf/pasta_tcp | 58 ++++++++++++++--------------- test/perf/pasta_udp | 87 ++++++++++++++++++++++---------------------- 15 files changed, 229 insertions(+), 230 deletions(-) -- 2.43.0

Show replies by date

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 01/11] tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down

If we start pasta with some ports forwarded, but no --config-net, say: $ ./pasta -u 10001 and then use a local, non-loopback address to send traffic to that port, say: $ socat -u FILE:test UDP4:192.0.2.1:10001 pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate. By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful. Signed-off-by: Stefano Brivio --- tap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/tap.c b/tap.c index 32a7b09..6930ad8 100644 --- a/tap.c +++ b/tap.c @@ -324,6 +324,7 @@ static size_t tap_send_frames_pasta(const struct ctx *c, case EINTR: case ENOBUFS: case ENOSPC: + case EIO: /* interface down? */ break; default: die("Write error on tap device, exiting"); -- 2.43.0

David Gibson

25 Jul 25 Jul

5:21 a.m.

New subject: [PATCH 01/11] tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down

On Wed, Jul 24, 2024 at 11:50:07PM +0200, Stefano Brivio wrote:

...

If we start pasta with some ports forwarded, but no --config-net, say:

$ ./pasta -u 10001

and then use a local, non-loopback address to send traffic to that port, say:

$ socat -u FILE:test UDP4:192.0.2.1:10001

pasta writes to the tap file descriptor, but if the interface is down, we get EIO and terminate.

By itself, what I'm doing in this case is not very useful (I simply forgot to pass --config-net), but if we happen to have a DHCP client in the network namespace, the interface might still be down while somebody tries to send traffic to it, and exiting in that case is not really helpful.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tap.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/tap.c b/tap.c index 32a7b09..6930ad8 100644 --- a/tap.c +++ b/tap.c @@ -324,6 +324,7 @@ static size_t tap_send_frames_pasta(const struct ctx *c, case EINTR: case ENOBUFS: case ENOSPC: + case EIO: /* interface down? */ break; default: die("Write error on tap device, exiting");

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 02/11] tcp: Change SO_PEEK_OFF support message to debug()

This: $ ./pasta SO_PEEK_OFF not supported # is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue. Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context. Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue. Signed-off-by: Stefano Brivio --- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tcp.c b/tcp.c index c031f13..c0820ce 100644 --- a/tcp.c +++ b/tcp.c @@ -2524,7 +2524,7 @@ int tcp_init(struct ctx *c) peek_offset_cap = (!c->ifi4 || tcp_probe_peek_offset_cap(AF_INET)) && (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); - info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); return 0; } -- 2.43.0

David Gibson

25 Jul 25 Jul

5:22 a.m.

New subject: [PATCH 02/11] tcp: Change SO_PEEK_OFF support message to debug()

On Wed, Jul 24, 2024 at 11:50:08PM +0200, Stefano Brivio wrote:

...

This:

$ ./pasta SO_PEEK_OFF not supported #

is a bit annoying, and might trick users who face other issues into thinking that SO_PEEK_OFF not being supported on a given kernel is an actual issue.

Even if SO_PEEK_OFF is supported by the kernel, that would be the only message displayed there, with default options, which looks a bit out of context.

Switch that to debug(): now that Podman users can pass --debug too, we can find out quickly if it's supported or not, if SO_PEEK_OFF usage is suspected of causing any issue.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tcp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tcp.c b/tcp.c index c031f13..c0820ce 100644 --- a/tcp.c +++ b/tcp.c @@ -2524,7 +2524,7 @@ int tcp_init(struct ctx *c)

peek_offset_cap = (!c->ifi4 || tcp_probe_peek_offset_cap(AF_INET)) && (!c->ifi6 || tcp_probe_peek_offset_cap(AF_INET6)); - info("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not "); + debug("SO_PEEK_OFF%ssupported", peek_offset_cap ? " " : " not ");

return 0; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 03/11] log: Drop newlines in the middle of the perror()-like messages

Calling vlogmsg() twice from logmsg_perror() results in this beauty: $ ./pasta -i foo Invalid interface name foo : No such device because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it. Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this. Signed-off-by: Stefano Brivio --- flow.c | 2 +- log.c | 34 ++++++++++++++++++++++------------ log.h | 18 +++++++++--------- 3 files changed, 32 insertions(+), 22 deletions(-) diff --git a/flow.c b/flow.c index d7d548d..bd5fa2b 100644 --- a/flow.c +++ b/flow.c @@ -279,7 +279,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) else type_or_state = FLOW_TYPE(f); - logmsg(pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); + logmsg(true, pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); } /** diff --git a/log.c b/log.c index 9ed6bc1..54483e7 100644 --- a/log.c +++ b/log.c @@ -46,7 +46,14 @@ int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */ -void vlogmsg(int pri, const char *format, va_list ap) +/** + * vlogmsg() - Print or send messages to log or output files as configured + * @newline: Append newline at the end of the message, if missing + * @pri: Facility and level map, same as priority for vsyslog() + * @format: Message + * @ap: Variable argument list + */ +void vlogmsg(bool newline, int pri, const char *format, va_list ap) { bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1; struct timespec tp; @@ -63,9 +70,9 @@ void vlogmsg(int pri, const char *format, va_list ap) va_copy(ap2, ap); /* Don't clobber ap, we need it again */ if (log_file != -1) - logfile_write(pri, format, ap2); + logfile_write(newline, pri, format, ap2); else if (!(log_mask & LOG_MASK(LOG_DEBUG))) - passt_vsyslog(pri, format, ap2); + passt_vsyslog(newline, pri, format, ap2); va_end(ap2); } @@ -73,22 +80,23 @@ void vlogmsg(int pri, const char *format, va_list ap) if (debug_print || !log_conf_parsed || (!log_runtime && (log_mask & LOG_MASK(LOG_PRI(pri))))) { (void)vfprintf(stderr, format, ap); - if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') fprintf(stderr, "\n"); } } /** * logmsg() - vlogmsg() wrapper for variable argument lists + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Message */ -void logmsg(int pri, const char *format, ...) +void logmsg(bool newline, int pri, const char *format, ...) { va_list ap; va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(newline, pri, format, ap); va_end(ap); } @@ -103,10 +111,10 @@ void logmsg_perror(int pri, const char *format, ...) va_list ap; va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(false, pri, format, ap); va_end(ap); - logmsg(pri, ": %s", strerror(errno_copy)); + logmsg(true, pri, ": %s", strerror(errno_copy)); } /* Prefixes for log file messages, indexed by priority */ @@ -174,11 +182,12 @@ void __setlogmask(int mask) /** * passt_vsyslog() - vsyslog() implementation not using heap memory + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void passt_vsyslog(int pri, const char *format, va_list ap) +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap) { char buf[BUFSIZ]; int n; @@ -188,7 +197,7 @@ void passt_vsyslog(int pri, const char *format, va_list ap) n += vsnprintf(buf + n, BUFSIZ - n, format, ap); - if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n"); if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && !log_runtime) @@ -360,11 +369,12 @@ static int logfile_rotate(int fd, const struct timespec *now) /** * logfile_write() - Write entry to log file, trigger rotation if full + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void logfile_write(int pri, const char *format, va_list ap) +void logfile_write(bool newline, int pri, const char *format, va_list ap) { struct timespec now; char buf[BUFSIZ]; @@ -380,7 +390,7 @@ void logfile_write(int pri, const char *format, va_list ap) n += vsnprintf(buf + n, BUFSIZ - n, format, ap); - if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n"); if ((log_written + n >= log_size) && logfile_rotate(log_file, &now)) diff --git a/log.h b/log.h index 6f34e8d..51ddafa 100644 --- a/log.h +++ b/log.h @@ -13,16 +13,16 @@ #define LOGFILE_CUT_RATIO 30 /* When full, cut ~30% size */ #define LOGFILE_SIZE_MIN (5UL * MAX(BUFSIZ, PAGE_SIZE)) -void vlogmsg(int pri, const char *format, va_list ap); -void logmsg(int pri, const char *format, ...) - __attribute__((format(printf, 2, 3))); +void vlogmsg(bool newline, int pri, const char *format, va_list ap); +void logmsg(bool newline, int pri, const char *format, ...) + __attribute__((format(printf, 3, 4))); void logmsg_perror(int pri, const char *format, ...) __attribute__((format(printf, 2, 3))); -#define err(...) logmsg( LOG_ERR, __VA_ARGS__) -#define warn(...) logmsg( LOG_WARNING, __VA_ARGS__) -#define info(...) logmsg( LOG_INFO, __VA_ARGS__) -#define debug(...) logmsg( LOG_DEBUG, __VA_ARGS__) +#define err(...) logmsg(true, LOG_ERR, __VA_ARGS__) +#define warn(...) logmsg(true, LOG_WARNING, __VA_ARGS__) +#define info(...) logmsg(true, LOG_INFO, __VA_ARGS__) +#define debug(...) logmsg(true, LOG_DEBUG, __VA_ARGS__) #define err_perror(...) logmsg_perror( LOG_ERR, __VA_ARGS__) #define warn_perror(...) logmsg_perror( LOG_WARNING, __VA_ARGS__) @@ -54,8 +54,8 @@ void trace_init(int enable); void __openlog(const char *ident, int option, int facility); void logfile_init(const char *name, const char *path, size_t size); -void passt_vsyslog(int pri, const char *format, va_list ap); -void logfile_write(int pri, const char *format, va_list ap); +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap); +void logfile_write(bool newline, int pri, const char *format, va_list ap); void __setlogmask(int mask); #endif /* LOG_H */ -- 2.43.0

David Gibson

25 Jul 25 Jul

5:26 a.m.

New subject: [PATCH 03/11] log: Drop newlines in the middle of the perror()-like messages

On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:

...

Calling vlogmsg() twice from logmsg_perror() results in this beauty:

$ ./pasta -i foo Invalid interface name foo : No such device

because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.

Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.

Signed-off-by: Stefano Brivio

I think my personal inclination would be to rename all the lowest-level functions slightly and remove the newline adding logic unconditionally. The create wrappers under the old name which add the "\n". I think that can be done in an easy macro, since the "\n" can be constant string appended to the format string. Just the special paths that need to suppress the newline would call the low level "no newline" variants. But, I don't actually care that much so Reviewed-by: David Gibson

...

--- flow.c | 2 +- log.c | 34 ++++++++++++++++++++++------------ log.h | 18 +++++++++--------- 3 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/flow.c b/flow.c index d7d548d..bd5fa2b 100644 --- a/flow.c +++ b/flow.c @@ -279,7 +279,7 @@ void flow_log_(const struct flow_common *f, int pri, const char *fmt, ...) else type_or_state = FLOW_TYPE(f);

- logmsg(pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); + logmsg(true, pri, "Flow %u (%s): %s", flow_idx(f), type_or_state, msg); }

/** diff --git a/log.c b/log.c index 9ed6bc1..54483e7 100644 --- a/log.c +++ b/log.c @@ -46,7 +46,14 @@ int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */

-void vlogmsg(int pri, const char *format, va_list ap) +/** + * vlogmsg() - Print or send messages to log or output files as configured + * @newline: Append newline at the end of the message, if missing + * @pri: Facility and level map, same as priority for vsyslog() + * @format: Message + * @ap: Variable argument list + */ +void vlogmsg(bool newline, int pri, const char *format, va_list ap) { bool debug_print = (log_mask & LOG_MASK(LOG_DEBUG)) && log_file == -1; struct timespec tp; @@ -63,9 +70,9 @@ void vlogmsg(int pri, const char *format, va_list ap)

va_copy(ap2, ap); /* Don't clobber ap, we need it again */ if (log_file != -1) - logfile_write(pri, format, ap2); + logfile_write(newline, pri, format, ap2); else if (!(log_mask & LOG_MASK(LOG_DEBUG))) - passt_vsyslog(pri, format, ap2); + passt_vsyslog(newline, pri, format, ap2);

va_end(ap2); } @@ -73,22 +80,23 @@ void vlogmsg(int pri, const char *format, va_list ap) if (debug_print || !log_conf_parsed || (!log_runtime && (log_mask & LOG_MASK(LOG_PRI(pri))))) { (void)vfprintf(stderr, format, ap); - if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') fprintf(stderr, "\n"); } }

/** * logmsg() - vlogmsg() wrapper for variable argument lists + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Message */ -void logmsg(int pri, const char *format, ...) +void logmsg(bool newline, int pri, const char *format, ...) { va_list ap;

va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(newline, pri, format, ap); va_end(ap); }

@@ -103,10 +111,10 @@ void logmsg_perror(int pri, const char *format, ...) va_list ap;

va_start(ap, format); - vlogmsg(pri, format, ap); + vlogmsg(false, pri, format, ap); va_end(ap);

- logmsg(pri, ": %s", strerror(errno_copy)); + logmsg(true, pri, ": %s", strerror(errno_copy)); }

/* Prefixes for log file messages, indexed by priority */ @@ -174,11 +182,12 @@ void __setlogmask(int mask)

/** * passt_vsyslog() - vsyslog() implementation not using heap memory + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void passt_vsyslog(int pri, const char *format, va_list ap) +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap) { char buf[BUFSIZ]; int n; @@ -188,7 +197,7 @@ void passt_vsyslog(int pri, const char *format, va_list ap)

n += vsnprintf(buf + n, BUFSIZ - n, format, ap);

- if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n");

if (log_sock >= 0 && send(log_sock, buf, n, 0) != n && !log_runtime) @@ -360,11 +369,12 @@ static int logfile_rotate(int fd, const struct timespec *now)

/** * logfile_write() - Write entry to log file, trigger rotation if full + * @newline: Append newline at the end of the message, if missing * @pri: Facility and level map, same as priority for vsyslog() * @format: Same as vsyslog() format * @ap: Same as vsyslog() ap */ -void logfile_write(int pri, const char *format, va_list ap) +void logfile_write(bool newline, int pri, const char *format, va_list ap) { struct timespec now; char buf[BUFSIZ]; @@ -380,7 +390,7 @@ void logfile_write(int pri, const char *format, va_list ap)

n += vsnprintf(buf + n, BUFSIZ - n, format, ap);

- if (format[strlen(format)] != '\n') + if (newline && format[strlen(format)] != '\n') n += snprintf(buf + n, BUFSIZ - n, "\n");

if ((log_written + n >= log_size) && logfile_rotate(log_file, &now)) diff --git a/log.h b/log.h index 6f34e8d..51ddafa 100644 --- a/log.h +++ b/log.h @@ -13,16 +13,16 @@ #define LOGFILE_CUT_RATIO 30 /* When full, cut ~30% size */ #define LOGFILE_SIZE_MIN (5UL * MAX(BUFSIZ, PAGE_SIZE))

-void vlogmsg(int pri, const char *format, va_list ap); -void logmsg(int pri, const char *format, ...) - __attribute__((format(printf, 2, 3))); +void vlogmsg(bool newline, int pri, const char *format, va_list ap); +void logmsg(bool newline, int pri, const char *format, ...) + __attribute__((format(printf, 3, 4))); void logmsg_perror(int pri, const char *format, ...) __attribute__((format(printf, 2, 3)));

-#define err(...) logmsg( LOG_ERR, __VA_ARGS__) -#define warn(...) logmsg( LOG_WARNING, __VA_ARGS__) -#define info(...) logmsg( LOG_INFO, __VA_ARGS__) -#define debug(...) logmsg( LOG_DEBUG, __VA_ARGS__) +#define err(...) logmsg(true, LOG_ERR, __VA_ARGS__) +#define warn(...) logmsg(true, LOG_WARNING, __VA_ARGS__) +#define info(...) logmsg(true, LOG_INFO, __VA_ARGS__) +#define debug(...) logmsg(true, LOG_DEBUG, __VA_ARGS__)

#define err_perror(...) logmsg_perror( LOG_ERR, __VA_ARGS__) #define warn_perror(...) logmsg_perror( LOG_WARNING, __VA_ARGS__) @@ -54,8 +54,8 @@ void trace_init(int enable);

void __openlog(const char *ident, int option, int facility); void logfile_init(const char *name, const char *path, size_t size); -void passt_vsyslog(int pri, const char *format, va_list ap); -void logfile_write(int pri, const char *format, va_list ap); +void passt_vsyslog(bool newline, int pri, const char *format, va_list ap); +void logfile_write(bool newline, int pri, const char *format, va_list ap); void __setlogmask(int mask);

#endif /* LOG_H */

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

1:27 p.m.

New subject: [PATCH 03/11] log: Drop newlines in the middle of the perror()-like messages

On Thu, 25 Jul 2024 13:26:47 +1000 David Gibson wrote:

...

On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:

...
Calling vlogmsg() twice from logmsg_perror() results in this beauty:

$ ./pasta -i foo Invalid interface name foo : No such device

because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.

Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.

Signed-off-by: Stefano Brivio

I think my personal inclination would be to rename all the lowest-level functions slightly and remove the newline adding logic unconditionally. The create wrappers under the old name which add the "\n". I think that can be done in an easy macro, since the "\n" can be constant string appended to the format string. Just the special paths that need to suppress the newline would call the low level "no newline" variants.

I gave it a try, but the problem is that the "easy macro" needs to be conditional, depending on whether the newline is there or not: we don't want to add a newline in case somebody already added it by mistake (or habit). That could probably be done as well with an intermediate function, but it's getting a bit too complicated (at least for me right now).

...

But, I don't actually care that much so

Reviewed-by: David Gibson

-- Stefano

David Gibson

26 Jul 26 Jul

2:33 a.m.

New subject: [PATCH 03/11] log: Drop newlines in the middle of the perror()-like messages

On Thu, Jul 25, 2024 at 01:27:09PM +0200, Stefano Brivio wrote:

...

On Thu, 25 Jul 2024 13:26:47 +1000 David Gibson wrote:

...
On Wed, Jul 24, 2024 at 11:50:09PM +0200, Stefano Brivio wrote:

...
Calling vlogmsg() twice from logmsg_perror() results in this beauty:

$ ./pasta -i foo Invalid interface name foo : No such device

because the first part of the message, corresponding to the first call, doesn't end with a newline, and vlogmsg() adds it.

Given that we can't easily append an argument (error description) to a variadic list, add a 'newline' parameter to all the functions that currently add a newline if missing, and disable that on the first call to vlogmsg() from logmsg_perror(). Not very pretty but I can't think of any solution that's less messy than this.

Signed-off-by: Stefano Brivio

I think my personal inclination would be to rename all the lowest-level functions slightly and remove the newline adding logic unconditionally. The create wrappers under the old name which add the "\n". I think that can be done in an easy macro, since the "\n" can be constant string appended to the format string. Just the special paths that need to suppress the newline would call the low level "no newline" variants.

I gave it a try, but the problem is that the "easy macro" needs to be conditional, depending on whether the newline is there or not: we don't want to add a newline in case somebody already added it by mistake (or habit).

Well, I mean we could just add the newline and consider it a bug if someone puts one in themselves.

...

That could probably be done as well with an intermediate function, but it's getting a bit too complicated (at least for me right now).

...
But, I don't actually care that much so

Reviewed-by: David Gibson

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 04/11] log: Fix sub-second part in relative log time calculation

For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part. Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating differences of both seconds and nanoseconds from the starting time. Do this in a macro as we need the same calculation and format in a few places. Signed-off-by: Stefano Brivio --- log.c | 42 ++++++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 20 deletions(-) diff --git a/log.c b/log.c index 54483e7..f57a54f 100644 --- a/log.c +++ b/log.c @@ -40,12 +40,21 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */ -static time_t log_start; /* Start timestamp */ +static struct timespec log_start; /* Start timestamp */ int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */ +/** + * logtime_fmt_and_arg() - Build format and arguments to print relative log time + * @x: Current timestamp + */ +#define logtime_fmt_and_arg(x) \ + "%lli.%04lli", \ + ((long long int)(x)->tv_sec - log_start.tv_sec), \ + (((long long int)(x)->tv_nsec - log_start.tv_nsec) / (100L * 1000)) + /** * vlogmsg() - Print or send messages to log or output files as configured * @newline: Append newline at the end of the message, if missing @@ -60,9 +69,8 @@ void vlogmsg(bool newline, int pri, const char *format, va_list ap) if (debug_print) { clock_gettime(CLOCK_REALTIME, &tp); - fprintf(stderr, "%lli.%04lli: ", - (long long int)tp.tv_sec - log_start, - (long long int)tp.tv_nsec / (100L * 1000)); + fprintf(stderr, logtime_fmt_and_arg(&tp)); + fprintf(stderr, ": "); } if ((log_mask & LOG_MASK(LOG_PRI(pri))) || !log_conf_parsed) { @@ -144,12 +152,9 @@ void trace_init(int enable) */ void __openlog(const char *ident, int option, int facility) { - struct timespec tp; - (void)option; - clock_gettime(CLOCK_REALTIME, &tp); - log_start = tp.tv_sec; + clock_gettime(CLOCK_REALTIME, &log_start); if (log_sock < 0) { struct sockaddr_un a = { .sun_family = AF_UNIX, }; @@ -255,10 +260,8 @@ static void logfile_rotate_fallocate(int fd, const struct timespec *now) if (read(fd, buf, BUFSIZ) == -1) return; - n = snprintf(buf, BUFSIZ, - "%s - log truncated at %lli.%04lli", log_header, - (long long int)(now->tv_sec - log_start), - (long long int)(now->tv_nsec / (100L * 1000))); + n = snprintf(buf, BUFSIZ, "%s - log truncated at ", log_header); + n += snprintf(buf + n, BUFSIZ - n, logtime_fmt_and_arg(now)); /* Avoid partial lines by padding the header with spaces */ nl = memchr(buf + n + 1, '\n', BUFSIZ - n - 1); @@ -288,10 +291,11 @@ static void logfile_rotate_move(int fd, const struct timespec *now) char buf[BUFSIZ]; const char *nl; - header_len = snprintf(buf, BUFSIZ, - "%s - log truncated at %lli.%04lli\n", log_header, - (long long int)(now->tv_sec - log_start), - (long long int)(now->tv_nsec / (100L * 1000))); + header_len = snprintf(buf, BUFSIZ, "%s - log truncated at ", + log_header); + header_len += snprintf(buf + header_len, BUFSIZ - header_len, + logtime_fmt_and_arg(now)); + if (lseek(fd, 0, SEEK_SET) == -1) return; if (write(fd, buf, header_len) == -1) @@ -383,10 +387,8 @@ void logfile_write(bool newline, int pri, const char *format, va_list ap) if (clock_gettime(CLOCK_REALTIME, &now)) return; - n = snprintf(buf, BUFSIZ, "%lli.%04lli: %s", - (long long int)(now.tv_sec - log_start), - (long long int)(now.tv_nsec / (100L * 1000)), - logfile_prefix[pri]); + n = snprintf(buf, BUFSIZ, logtime_fmt_and_arg(&now)); + n += snprintf(buf + n, BUFSIZ - n, ": %s", logfile_prefix[pri]); n += vsnprintf(buf + n, BUFSIZ - n, format, ap); -- 2.43.0

David Gibson

25 Jul 25 Jul

5:32 a.m.

New subject: [PATCH 04/11] log: Fix sub-second part in relative log time calculation

On Wed, Jul 24, 2024 at 11:50:10PM +0200, Stefano Brivio wrote:

...

For some reason, in commit 01efc71ddd25 ("log, conf: Add support for logging to file"), I added calculations for relative logging timestamps using the difference for the seconds part only, not for accounting for the fractional part.

Fix that by storing the initial timestamp, log_start, as a timespec struct, and by calculating differences of both seconds and nanoseconds from the starting time. Do this in a macro as we need the same calculation and format in a few places.

Signed-off-by: Stefano Brivio --- log.c | 42 ++++++++++++++++++++++-------------------- 1 file changed, 22 insertions(+), 20 deletions(-)

diff --git a/log.c b/log.c index 54483e7..f57a54f 100644 --- a/log.c +++ b/log.c @@ -40,12 +40,21 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */

-static time_t log_start; /* Start timestamp */ +static struct timespec log_start; /* Start timestamp */

int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ bool log_runtime; /* Daemonised, or ready in foreground */

+/** + * logtime_fmt_and_arg() - Build format and arguments to print relative log time + * @x: Current timestamp + */ +#define logtime_fmt_and_arg(x) \ + "%lli.%04lli", \ + ((long long int)(x)->tv_sec - log_start.tv_sec), \ + (((long long int)(x)->tv_nsec - log_start.tv_nsec) / (100L * 1000))

This doesn't look right. If x->tv_nsec - log_start.tv_nsec is negative it will produce weird results. Instead you need more complex logic to carry a sufficient difference in the nsec over into the seconds difference. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

9:51 a.m.

New subject: [PATCH 04/11] log: Fix sub-second part in relative log time calculation

On Thu, 25 Jul 2024 13:32:52 +1000 David Gibson wrote:

...

On Wed, Jul 24, 2024 at 11:50:10PM +0200, Stefano Brivio wrote:

...
[...]

+/** + * logtime_fmt_and_arg() - Build format and arguments to print relative log time + * @x: Current timestamp + */ +#define logtime_fmt_and_arg(x) \ + "%lli.%04lli", \ + ((long long int)(x)->tv_sec - log_start.tv_sec), \ + (((long long int)(x)->tv_nsec - log_start.tv_nsec) / (100L * 1000))

This doesn't look right. If x->tv_nsec - log_start.tv_nsec is negative it will produce weird results. Instead you need more complex logic to carry a sufficient difference in the nsec over into the seconds difference.

Oh, oops, of course. I was hoping to recycle this from the Linux kernel, where we have timespec_sub(), but that implementation looks a bit too complicated for my taste. I'm adding something simpler now. -- Stefano

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 05/11] log: Initialise timestamp for relative log time also if we use a log file

...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero. Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start. Signed-off-by: Stefano Brivio --- log.c | 4 +--- log.h | 1 + passt.c | 2 ++ 3 files changed, 4 insertions(+), 3 deletions(-) diff --git a/log.c b/log.c index f57a54f..5f24c16 100644 --- a/log.c +++ b/log.c @@ -40,7 +40,7 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */ -static struct timespec log_start; /* Start timestamp */ +struct timespec log_start; /* Start timestamp */ int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ @@ -154,8 +154,6 @@ void __openlog(const char *ident, int option, int facility) { (void)option; - clock_gettime(CLOCK_REALTIME, &log_start); - if (log_sock < 0) { struct sockaddr_un a = { .sun_family = AF_UNIX, }; diff --git a/log.h b/log.h index 51ddafa..e03199c 100644 --- a/log.h +++ b/log.h @@ -44,6 +44,7 @@ void logmsg_perror(int pri, const char *format, ...) extern int log_trace; extern bool log_conf_parsed; extern bool log_runtime; +extern struct timespec log_start; void trace_init(int enable); #define trace(...) \ diff --git a/passt.c b/passt.c index eed74ec..72ad704 100644 --- a/passt.c +++ b/passt.c @@ -207,6 +207,8 @@ int main(int argc, char **argv) struct timespec now; struct sigaction sa; + clock_gettime(CLOCK_REALTIME, &log_start); + arch_avx2_exec(argv); isolate_initial(); -- 2.43.0

David Gibson

25 Jul 25 Jul

5:35 a.m.

New subject: [PATCH 05/11] log: Initialise timestamp for relative log time also if we use a log file

On Wed, Jul 24, 2024 at 11:50:11PM +0200, Stefano Brivio wrote:

...

...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson It does occur to me, though... when we're using relative timestamps, should we be using the monotonic clock instead of the realtime clock?

...

--- log.c | 4 +--- log.h | 1 + passt.c | 2 ++ 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/log.c b/log.c index f57a54f..5f24c16 100644 --- a/log.c +++ b/log.c @@ -40,7 +40,7 @@ static size_t log_written; /* Currently used bytes in log file */ static size_t log_cut_size; /* Bytes to cut at start on rotation */ static char log_header[BUFSIZ]; /* File header, written back on cuts */

-static struct timespec log_start; /* Start timestamp */ +struct timespec log_start; /* Start timestamp */

int log_trace; /* --trace mode enabled */ bool log_conf_parsed; /* Logging options already parsed */ @@ -154,8 +154,6 @@ void __openlog(const char *ident, int option, int facility) { (void)option;

- clock_gettime(CLOCK_REALTIME, &log_start); - if (log_sock < 0) { struct sockaddr_un a = { .sun_family = AF_UNIX, };

diff --git a/log.h b/log.h index 51ddafa..e03199c 100644 --- a/log.h +++ b/log.h @@ -44,6 +44,7 @@ void logmsg_perror(int pri, const char *format, ...) extern int log_trace; extern bool log_conf_parsed; extern bool log_runtime; +extern struct timespec log_start;

void trace_init(int enable); #define trace(...) \ diff --git a/passt.c b/passt.c index eed74ec..72ad704 100644 --- a/passt.c +++ b/passt.c @@ -207,6 +207,8 @@ int main(int argc, char **argv) struct timespec now; struct sigaction sa;

+ clock_gettime(CLOCK_REALTIME, &log_start); + arch_avx2_exec(argv);

isolate_initial();

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

9:51 a.m.

New subject: [PATCH 05/11] log: Initialise timestamp for relative log time also if we use a log file

On Thu, 25 Jul 2024 13:35:25 +1000 David Gibson wrote:

...

On Wed, Jul 24, 2024 at 11:50:11PM +0200, Stefano Brivio wrote:

...
...not just for debug messages. Otherwise, timestamps in the log file are consistent but the starting point is not zero.

Do this right away as we enter main(), so that the resulting timestamps are as closely as possible relative to when we start.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

It does occur to me, though... when we're using relative timestamps, should we be using the monotonic clock instead of the realtime clock?

Right, yes, I was under the impression that we could risk mixing those with packet capture timestamps (where we want the realtime clock), but actually it's never the case. I'll post another patch for that. -- Stefano

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 06/11] test: Fix memory/passt tests, --netns-only is not a valid option for passt

This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit 65923ba79877 ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option. I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs. Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces. Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0. Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script. Reported-by: David Gibson Signed-off-by: Stefano Brivio --- test/memory/passt | 13 ++++++++++--- test/passt.mem.mbuto | 9 +-------- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/test/memory/passt b/test/memory/passt index 1193af8..bf78c8f 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -44,7 +44,7 @@ endef def start_stop_diff guest sed /proc/slabinfo -ne 's/^$[^ ]* *[^ ]* *[^ ]* *[^ ]*$.*/\\\1/p' > /tmp/slabinfo.before guest cat /proc/meminfo > /tmp/meminfo.before -guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ --netns-only +guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ sleep 2 guest cat /proc/meminfo > /tmp/meminfo.after guest sed /proc/slabinfo -ne 's/^$[^ ]* *[^ ]* *[^ ]* *[^ ]*$.*/\\\1/p' > /tmp/slabinfo.after @@ -78,9 +78,16 @@ guest mount -o bind /proc /test/proc guest mount -o bind /dev /test/dev guest cp -Lr /bin /lib /lib64 /usr /sbin /test/ +guest exec switch_root /test /bin/sh + guest ulimit -Hn 300000 -guest unshare -rUm -R /test -guest chroot . +guest unshare -rUn +guest ip link add eth0 type dummy +guest ip link set eth0 up +guest ip address add 192.0.2.2/24 dev eth0 +guest ip address add 2001:db8::2/64 dev eth0 +guest ip route add default via 192.0.2.1 +guest ip -6 route add default via 2001:db8::1 dev eth0 guest meminfo_size() { grep "^$2:" $1 | tr -s ' ' | cut -f2 -d ' '; } guest meminfo_diff() { echo $(( $(meminfo_size $2 $3) - $(meminfo_size $1 $3) )); } diff --git a/test/passt.mem.mbuto b/test/passt.mem.mbuto index 56f5139..532eae0 100755 --- a/test/passt.mem.mbuto +++ b/test/passt.mem.mbuto @@ -12,7 +12,7 @@ PROGS="${PROGS:-ash,dash,bash chmod ip mount insmod mkdir ln cat chmod modprobe grep mknod sed chown sleep bc ls ps mount unshare chroot cp kill diff - head tail sort tr tee cut nm which}" + head tail sort tr tee cut nm which switch_root}" KMODS="${KMODS:- dummy}" @@ -29,13 +29,6 @@ COPIES="${COPIES} ../passt.avx2,/bin/passt.avx2" FIXUP="${FIXUP}"' ln -s /bin /usr/bin chmod 777 /tmp -ip link add eth0 type dummy -ip link set eth0 up -ip address add 192.0.2.2/24 dev eth0 -ip address add 2001:db8::2/64 dev eth0 -ip route add default via 192.0.2.1 -ip -6 route add default via 2001:db8::1 dev eth0 -sleep 2 sh +m ' -- 2.43.0

David Gibson

25 Jul 25 Jul

5:50 a.m.

New subject: [PATCH 06/11] test: Fix memory/passt tests, --netns-only is not a valid option for passt

On Wed, Jul 24, 2024 at 11:50:12PM +0200, Stefano Brivio wrote:

...

This used to work on my setup as I kept reusing an old mbuto (initramfs) image, but since commit 65923ba79877 ("conf: Accept duplicate and conflicting options, the last one wins"), --netns-only is, as originally intended, a pasta-only option.

I had used --netns-only, here, to prevent passt from trying to detach its own user namespace, which is not permitted as we're in a chroot, see unshare(2). In turn, we need the chroot because passt can't pivot root directly into its own empty filesystem using an initramfs.

Use switch_root into the tmpfs mountpoint instead of chroot, so that we can still detach user namespaces.

Note that in the mbuto images, we can't switch to nobody as we have no password entries at all, so we need to detach a further user namespace before starting passt, to trick passt into running as UID 0.

Given the new sequence, it's now more convenient to directly switch to a detached network namespace as well, which means we need to move the initialisation of the dummy network from the init script into the test script.

Reported-by: David Gibson Signed-off-by: Stefano Brivio

Excellent, I can run these tests again. Tested-by: David Gibson Reviewed-by: David Gibson

...

--- test/memory/passt | 13 ++++++++++--- test/passt.mem.mbuto | 9 +-------- 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/test/memory/passt b/test/memory/passt index 1193af8..bf78c8f 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -44,7 +44,7 @@ endef def start_stop_diff guest sed /proc/slabinfo -ne 's/^$[^ ]* *[^ ]* *[^ ]* *[^ ]*$.*/\\\1/p' > /tmp/slabinfo.before guest cat /proc/meminfo > /tmp/meminfo.before -guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ --netns-only +guest /bin/passt.avx2 -l /tmp/log -s /tmp/sock -P /tmp/pid __OPTS__ sleep 2 guest cat /proc/meminfo > /tmp/meminfo.after guest sed /proc/slabinfo -ne 's/^$[^ ]* *[^ ]* *[^ ]* *[^ ]*$.*/\\\1/p' > /tmp/slabinfo.after @@ -78,9 +78,16 @@ guest mount -o bind /proc /test/proc guest mount -o bind /dev /test/dev guest cp -Lr /bin /lib /lib64 /usr /sbin /test/

+guest exec switch_root /test /bin/sh + guest ulimit -Hn 300000 -guest unshare -rUm -R /test -guest chroot . +guest unshare -rUn +guest ip link add eth0 type dummy +guest ip link set eth0 up +guest ip address add 192.0.2.2/24 dev eth0 +guest ip address add 2001:db8::2/64 dev eth0 +guest ip route add default via 192.0.2.1 +guest ip -6 route add default via 2001:db8::1 dev eth0

guest meminfo_size() { grep "^$2:" $1 | tr -s ' ' | cut -f2 -d ' '; } guest meminfo_diff() { echo $(( $(meminfo_size $2 $3) - $(meminfo_size $1 $3) )); } diff --git a/test/passt.mem.mbuto b/test/passt.mem.mbuto index 56f5139..532eae0 100755 --- a/test/passt.mem.mbuto +++ b/test/passt.mem.mbuto @@ -12,7 +12,7 @@

PROGS="${PROGS:-ash,dash,bash chmod ip mount insmod mkdir ln cat chmod modprobe grep mknod sed chown sleep bc ls ps mount unshare chroot cp kill diff - head tail sort tr tee cut nm which}" + head tail sort tr tee cut nm which switch_root}"

KMODS="${KMODS:- dummy}"

@@ -29,13 +29,6 @@ COPIES="${COPIES} ../passt.avx2,/bin/passt.avx2" FIXUP="${FIXUP}"' ln -s /bin /usr/bin chmod 777 /tmp -ip link add eth0 type dummy -ip link set eth0 up -ip address add 192.0.2.2/24 dev eth0 -ip address add 2001:db8::2/64 dev eth0 -ip route add default via 192.0.2.1 -ip -6 route add default via 2001:db8::1 dev eth0 -sleep 2 sh +m '

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 07/11] test: Update names of symbols and slabinfo entries

Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed. Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment. Signed-off-by: Stefano Brivio --- test/memory/passt | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/test/memory/passt b/test/memory/passt index bf78c8f..7e45724 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -110,27 +110,17 @@ info th symbol MiB set WHAT tcp_buf_discard nm_row -set WHAT tcp6_l2_buf +set WHAT flowtab nm_row -set WHAT tcp4_l2_buf +set WHAT tcp6_payload nm_row -set WHAT tc +set WHAT tcp4_payload nm_row set WHAT pkt_buf nm_row -set WHAT udp_splice_map +set WHAT udp_payload nm_row -set WHAT udp6_l2_buf -nm_row -set WHAT udp4_l2_buf -nm_row -set WHAT udp_tap_map -nm_row -set WHAT icmp_id_map -nm_row -set WHAT udp_splice_buf -nm_row -set WHAT tc_hash +set WHAT flow_hashtab nm_row set WHAT pool_tap6_storage nm_row @@ -149,8 +139,6 @@ set WHAT pid slab_row set WHAT dentry slab_row -set WHAT Acpi-Parse -slab_row set WHAT kmalloc-64 slab_row set WHAT kmalloc-32 -- 2.43.0

David Gibson

25 Jul 25 Jul

5:54 a.m.

New subject: [PATCH 07/11] test: Update names of symbols and slabinfo entries

On Wed, Jul 24, 2024 at 11:50:13PM +0200, Stefano Brivio wrote:

...

Differences in allocated Acpi-Parse entries are gone (at least) since the 6.1 Linux kernel series. I should run this on a 6.10 kernel, eventually, and adjust things further, as needed.

Userspace symbols are also fairly different now: show whatever is more than 1 MiB at the moment.

Signed-off-by: Stefano Brivio

And now I get more useful results here too. Tested-by: David Gibson Reviewed-by: David Gibson

...

--- test/memory/passt | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/test/memory/passt b/test/memory/passt index bf78c8f..7e45724 100644 --- a/test/memory/passt +++ b/test/memory/passt @@ -110,27 +110,17 @@ info th symbol MiB set WHAT tcp_buf_discard nm_row -set WHAT tcp6_l2_buf +set WHAT flowtab nm_row -set WHAT tcp4_l2_buf +set WHAT tcp6_payload nm_row -set WHAT tc +set WHAT tcp4_payload nm_row set WHAT pkt_buf nm_row -set WHAT udp_splice_map +set WHAT udp_payload nm_row -set WHAT udp6_l2_buf -nm_row -set WHAT udp4_l2_buf -nm_row -set WHAT udp_tap_map -nm_row -set WHAT icmp_id_map -nm_row -set WHAT udp_splice_buf -nm_row -set WHAT tc_hash +set WHAT flow_hashtab nm_row set WHAT pool_tap6_storage nm_row @@ -149,8 +139,6 @@ set WHAT pid slab_row set WHAT dentry slab_row -set WHAT Acpi-Parse -slab_row set WHAT kmalloc-64 slab_row set WHAT kmalloc-32

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 08/11] test: iperf3 3.16 introduces multiple threads, drop our own implementation of that

Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread. So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit. Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup. Signed-off-by: Stefano Brivio --- test/lib/perf_report | 2 +- test/lib/test | 31 +++++----------- test/perf/passt_tcp | 39 ++++++++++---------- test/perf/passt_udp | 55 ++++++++++++++-------------- test/perf/pasta_tcp | 58 ++++++++++++++--------------- test/perf/pasta_udp | 87 ++++++++++++++++++++++---------------------- 6 files changed, 127 insertions(+), 145 deletions(-) diff --git a/test/lib/perf_report b/test/lib/perf_report index 67f9f4e..4d79fa2 100755 --- a/test/lib/perf_report +++ b/test/lib/perf_report @@ -18,7 +18,7 @@ PERF_LINK_COUNT=0 PERF_JS="${LOGDIR}/web/perf.js" PERF_TEMPLATE_HTML="document.write('"' -Throughput in Gbps, latency in µs. Threads are iperf3 processes, passt and pasta are currently single-threaded. +Throughput in Gbps, latency in µs. Threads are iperf3 threads, passt and pasta are currently single-threaded. Click on numbers to show test execution. Measured at head, commit __commit__. <style type="text/CSS"> diff --git a/test/lib/test b/test/lib/test index 1d571c3..c525f8e 100755 --- a/test/lib/test +++ b/test/lib/test @@ -15,18 +15,13 @@ # test_iperf3s() - Start iperf3 server # $1: Destination/server context -# $2: Port number, ${i} is translated to process index -# $3: Number of processes to run in parallel +# $2: Port number test_iperf3s() { __sctx="${1}" __port="${2}" - __procs="$((${3} - 1))" pane_or_context_run_bg "${__sctx}" \ - 'for i in $(seq 0 '${__procs}'); do' \ - ' iperf3 -s -p'${__port}' &' \ - ' echo $! > s${i}.pid; ' \ - 'done' \ + 'iperf3 -s -p'${__port}' & echo $! > s.pid' \ sleep 1 # Wait for server to be ready } @@ -36,7 +31,7 @@ test_iperf3s() { test_iperf3k() { __sctx="${1}" - pane_or_context_run "${__sctx}" 'kill -INT $(cat s*.pid); rm s*.pid' + pane_or_context_run "${__sctx}" 'kill -INT $(cat s.pid); rm s.pid' sleep 3 # Wait for kernel to free up ports } @@ -46,37 +41,29 @@ test_iperf3k() { # $2: Source/client context # $3: Destination name or address for client # $4: Port number, ${i} is translated to process index -# $5: Number of processes to run in parallel -# $6: Run time, in seconds +# $5: Run time, in seconds # $@: Client options test_iperf3() { __var="${1}"; shift __cctx="${1}"; shift __dest="${1}"; shift __port="${1}"; shift - __procs="$((${1} - 1))"; shift __time="${1}"; shift - pane_or_context_run "${__cctx}" 'rm -f c*.json' + pane_or_context_run "${__cctx}" 'rm -f c.json' # A 1s wait for connection on what's basically a local link # indicates something is pretty wrong __timeout=1000 pane_or_context_run "${__cctx}" \ - '(' \ - ' for i in $(seq 0 '${__procs}'); do' \ - ' iperf3 -J -c '${__dest}' -p '${__port} \ - ' --connect-timeout '${__timeout} \ - ' -t'${__time}' -i0 -T c${i} '"${@}" \ - ' > c${i}.json &' \ - ' done;' \ - ' wait' \ - ')' + 'iperf3 -J -c '${__dest}' -p '${__port} \ + ' --connect-timeout '${__timeout} \ + ' -t'${__time}' -i0 '"${@}"' > c.json' \ __jval=".end.sum_received.bits_per_second" __bw=$(pane_or_context_output "${__cctx}" \ - 'cat c*.json | jq -rMs "map('${__jval}') | add"') + 'cat c.json | jq -rMs "map('${__jval}') | add"') TEST_ONE_subs="$(list_add_pair "${TEST_ONE_subs}" "__${__var}__" "${__bw}" )" } diff --git a/test/perf/passt_tcp b/test/perf/passt_tcp index 631a407..14343cb 100644 --- a/test/perf/passt_tcp +++ b/test/perf/passt_tcp @@ -37,34 +37,33 @@ hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\ hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ -set THREADS 1 -set STREAMS 8 +set THREADS 4 set TIME 10 set OMIT 0.1 -set OPTS -Z -P __STREAMS__ -l 1M -O__OMIT__ +set OPTS -Z -P __THREADS__ -l 1M -O__OMIT__ -info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz, __STREAMS__ streams +info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz report passt tcp __THREADS__ __FREQ__ th MTU 256B 576B 1280B 1500B 9000B 65520B tr TCP throughput over IPv6: guest to host -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 bw - bw - guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 4M +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 4M +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 8M bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 16M +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -w 16M bw __BW__ 7.0 8.0 iperf3k ns @@ -90,25 +89,25 @@ gout LAT tcp_crr --nolog -6 -c -H __GW6__%__IFNAME__ | sed -n 's/^throughput=\(. lat __LAT__ 500 400 tr TCP throughput over IPv4: guest to host -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 guest ip link set dev __IFNAME__ mtu 256 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 1M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 1M bw __BW__ 0.2 0.3 guest ip link set dev __IFNAME__ mtu 576 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 1M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 1M bw __BW__ 0.5 0.8 guest ip link set dev __IFNAME__ mtu 1280 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 4M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.2 1.5 guest ip link set dev __IFNAME__ mtu 1500 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 4M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 4M bw __BW__ 1.6 1.8 guest ip link set dev __IFNAME__ mtu 9000 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 8M bw __BW__ 4.0 5.0 guest ip link set dev __IFNAME__ mtu 65520 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -w 16M +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -w 16M bw __BW__ 7.0 8.0 iperf3k ns @@ -134,14 +133,14 @@ gout LAT tcp_crr --nolog -4 -c -H __GW__ | sed -n 's/^throughput=$.*$/\1/p' lat __LAT__ 500 400 tr TCP throughput over IPv6: host to guest -iperf3s guest 100${i}1 __THREADS__ +iperf3s guest 10001 bw - bw - bw - bw - bw - -iperf3 BW ns ::1 100${i}1 __THREADS__ __TIME__ __OPTS__ +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ bw __BW__ 6.0 6.8 iperf3k guest @@ -170,14 +169,14 @@ lat __LAT__ 500 350 tr TCP throughput over IPv4: host to guest -iperf3s guest 100${i}1 __THREADS__ +iperf3s guest 10001 bw - bw - bw - bw - bw - -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ bw __BW__ 6.0 6.8 iperf3k guest diff --git a/test/perf/passt_udp b/test/perf/passt_udp index 10f638f..8919280 100644 --- a/test/perf/passt_udp +++ b/test/perf/passt_udp @@ -30,30 +30,29 @@ hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\ hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ -set THREADS 4 -set STREAMS 1 +set THREADS 2 set TIME 10 -set OPTS -u -P __STREAMS__ --pacing-timer 1000 +set OPTS -u -P __THREADS__ --pacing-timer 1000 -info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz, one stream each +info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz report passt udp __THREADS__ __FREQ__ th pktlen 256B 576B 1280B 1500B 9000B 65520B tr UDP throughput over IPv6: guest to host -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 # (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header bw - bw - -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1232 +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 3G -l 1232 bw __BW__ 0.8 1.2 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1452 +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 4G -l 1452 bw __BW__ 1.0 1.5 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 5G -l 8952 +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 8G -l 8952 bw __BW__ 4.0 5.0 -iperf3 BW guest __GW6__%__IFNAME__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 7G -l 64372 +iperf3 BW guest __GW6__%__IFNAME__ 10002 __TIME__ __OPTS__ -b 15G -l 64372 bw __BW__ 4.0 5.0 iperf3k ns @@ -70,20 +69,20 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: guest to host -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 500M -l 228 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 1G -l 228 bw __BW__ 0.0 0.0 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 1G -l 548 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 2G -l 548 bw __BW__ 0.4 0.6 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1252 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 3G -l 1252 bw __BW__ 0.8 1.2 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1472 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 4G -l 1472 bw __BW__ 1.0 1.5 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 6G -l 8972 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 8G -l 8972 bw __BW__ 4.0 5.0 -iperf3 BW guest __GW__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 7G -l 65492 +iperf3 BW guest __GW__ 10002 __TIME__ __OPTS__ -b 15G -l 65492 bw __BW__ 4.0 5.0 iperf3k ns @@ -100,18 +99,18 @@ lat __LAT__ 200 150 tr UDP throughput over IPv6: host to guest -iperf3s guest 100${i}1 __THREADS__ +iperf3s guest 10001 # (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header bw - bw - -iperf3 BW ns ::1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1232 +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 3G -l 1232 bw __BW__ 0.8 1.2 -iperf3 BW ns ::1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1452 +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 4G -l 1452 bw __BW__ 1.0 1.5 -iperf3 BW ns ::1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 8952 +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 8G -l 8952 bw __BW__ 3.0 4.0 -iperf3 BW ns ::1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 64372 +iperf3 BW ns ::1 10001 __TIME__ __OPTS__ -b 15G -l 64372 bw __BW__ 3.0 4.0 iperf3k guest @@ -129,20 +128,20 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: host to guest -iperf3s guest 100${i}1 __THREADS__ +iperf3s guest 10001 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 1G -l 228 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 1G -l 228 bw __BW__ 0.0 0.0 -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 1G -l 548 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 2G -l 548 bw __BW__ 0.4 0.6 -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1252 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 3G -l 1252 bw __BW__ 0.8 1.2 -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1472 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 4G -l 1472 bw __BW__ 1.0 1.5 -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 8972 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 8G -l 8972 bw __BW__ 3.0 4.0 -iperf3 BW ns 127.0.0.1 100${i}1 __THREADS__ __TIME__ __OPTS__ -b 3G -l 65492 +iperf3 BW ns 127.0.0.1 10001 __TIME__ __OPTS__ -b 15G -l 65492 bw __BW__ 3.0 4.0 iperf3k guest diff --git a/test/perf/pasta_tcp b/test/perf/pasta_tcp index 7777532..8d2f911 100644 --- a/test/perf/pasta_tcp +++ b/test/perf/pasta_tcp @@ -21,26 +21,25 @@ ns /sbin/sysctl -w net.ipv4.tcp_wmem="131072 524288 134217728" ns /sbin/sysctl -w net.ipv4.tcp_timestamps=0 -set THREADS 2 -set STREAMS 2 +set THREADS 4 set TIME 10 set OMIT 0.1 -set OPTS -Z -w 4M -l 1M -P __STREAMS__ -O__OMIT__ +set OPTS -Z -w 4M -l 1M -P __THREADS__ -O__OMIT__ hout FREQ_PROCFS (echo "scale=1"; sed -n 's/cpu MHz.*: $[0-9]*$\..*$/(\1+10^2\/2)\/10^3/p' /proc/cpuinfo) | bc -l | head -n1 hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq) ) | bc -l hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ -info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz, __STREAMS__ streams each +info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz report pasta lo_tcp __THREADS__ __FREQ__ th MTU 65535B tr TCP throughput over IPv6: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 -iperf3 BW ns ::1 100${i}3 __THREADS__ __TIME__ __OPTS__ +iperf3 BW ns ::1 10003 __THREADS__ __TIME__ __OPTS__ bw __BW__ 15.0 20.0 iperf3k host @@ -59,9 +58,9 @@ lat __LAT__ 500 350 tr TCP throughput over IPv4: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 -iperf3 BW ns 127.0.0.1 100${i}3 __THREADS__ __TIME__ __OPTS__ +iperf3 BW ns 127.0.0.1 10003 __THREADS__ __TIME__ __OPTS__ bw __BW__ 15.0 20.0 iperf3k host @@ -79,9 +78,9 @@ hostw lat __LAT__ 500 350 tr TCP throughput over IPv6: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 -iperf3 BW host ::1 100${i}2 __THREADS__ __TIME__ __OPTS__ +iperf3 BW host ::1 10002 __TIME__ __OPTS__ bw __BW__ 15.0 20.0 iperf3k ns @@ -100,9 +99,9 @@ lat __LAT__ 1000 700 tr TCP throughput over IPv4: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 -iperf3 BW host 127.0.0.1 100${i}2 __THREADS__ __TIME__ __OPTS__ +iperf3 BW host 127.0.0.1 10002 __TIME__ __OPTS__ bw __BW__ 15.0 20.0 iperf3k ns @@ -126,29 +125,28 @@ test pasta: throughput and latency (connections via tap) nsout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' -set THREADS 1 -set STREAMS 2 -set OPTS -Z -P __STREAMS__ -i1 -O__OMIT__ +set THREADS 2 +set OPTS -Z -P __THREADS__ -i1 -O__OMIT__ -info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz, __STREAMS__ streams +info Throughput in Gbps, latency in µs, __THREADS__ threads at __FREQ__ GHz report pasta tap_tcp __THREADS__ __FREQ__ th MTU 1500B 4000B 16384B 65520B tr TCP throughput over IPv6: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 ns ip link set dev __IFNAME__ mtu 1500 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 512k +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 512k bw __BW__ 0.2 0.4 ns ip link set dev __IFNAME__ mtu 4000 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 1M +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 1M bw __BW__ 0.3 0.5 ns ip link set dev __IFNAME__ mtu 16384 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 1.5 2.0 ns ip link set dev __IFNAME__ mtu 65520 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 2.0 2.5 iperf3k host @@ -173,19 +171,19 @@ lat __LAT__ 1500 500 tr TCP throughput over IPv4: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 ns ip link set dev __IFNAME__ mtu 1500 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 512k +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 512k bw __BW__ 0.2 0.4 ns ip link set dev __IFNAME__ mtu 4000 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 1M +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 1M bw __BW__ 0.3 0.5 ns ip link set dev __IFNAME__ mtu 16384 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 1.5 2.0 ns ip link set dev __IFNAME__ mtu 65520 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -w 8M +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -w 8M bw __BW__ 2.0 2.5 iperf3k host @@ -209,14 +207,14 @@ hostw lat __LAT__ 1500 500 tr TCP throughput over IPv6: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local' bw - bw - bw - -iperf3 BW host __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ +iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ bw __BW__ 8.0 10.0 iperf3k ns @@ -242,13 +240,13 @@ lat __LAT__ 5000 10000 tr TCP throughput over IPv4: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 nsout ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local' bw - bw - bw - -iperf3 BW host __ADDR__ 100${i}2 __THREADS__ __TIME__ __OPTS__ +iperf3 BW host __ADDR__ 10002 __TIME__ __OPTS__ bw __BW__ 8.0 10.0 iperf3k ns diff --git a/test/perf/pasta_udp b/test/perf/pasta_udp index 5e3db1e..6acbfd3 100644 --- a/test/perf/pasta_udp +++ b/test/perf/pasta_udp @@ -21,11 +21,10 @@ hout FREQ_CPUFREQ (echo "scale=1"; printf '( %i + 10^5 / 2 ) / 10^6\n' $(cat /sy hout FREQ [ -n "__FREQ_CPUFREQ__" ] && echo __FREQ_CPUFREQ__ || echo __FREQ_PROCFS__ set THREADS 1 -set STREAMS 4 set TIME 10 -set OPTS -u -P __STREAMS__ +set OPTS -u -P __THREADS__ -info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz, __STREAMS__ streams +info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz report pasta lo_udp 1 __FREQ__ @@ -33,16 +32,16 @@ th pktlen 1500B 4000B 16384B 65535B tr UDP throughput over IPv6: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 # (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header -iperf3 BW ns ::1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1452 +iperf3 BW ns ::1 10003 __TIME__ __OPTS__ -b 5G -l 1452 bw __BW__ 1.0 1.5 -iperf3 BW ns ::1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW ns ::1 10003 __TIME__ __OPTS__ -b 10G -l 3972 bw __BW__ 1.2 1.8 -iperf3 BW ns ::1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 10G -l 16336 +iperf3 BW ns ::1 10003 __TIME__ __OPTS__ -b 30G -l 16336 bw __BW__ 5.0 6.0 -iperf3 BW ns ::1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 15G -l 65487 +iperf3 BW ns ::1 10003 __TIME__ __OPTS__ -b 40G -l 65487 bw __BW__ 7.0 9.0 iperf3k host @@ -58,16 +57,16 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW ns 127.0.0.1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1372 +iperf3 BW ns 127.0.0.1 10003 __TIME__ __OPTS__ -b 5G -l 1372 bw __BW__ 1.0 1.5 -iperf3 BW ns 127.0.0.1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW ns 127.0.0.1 10003 __TIME__ __OPTS__ -b 10G -l 3972 bw __BW__ 1.2 1.8 -iperf3 BW ns 127.0.0.1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 10G -l 16356 +iperf3 BW ns 127.0.0.1 10003 __TIME__ __OPTS__ -b 30G -l 16356 bw __BW__ 5.0 6.0 -iperf3 BW ns 127.0.0.1 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 15G -l 65507 +iperf3 BW ns 127.0.0.1 10003 __TIME__ __OPTS__ -b 40G -l 65507 bw __BW__ 7.0 9.0 iperf3k host @@ -83,15 +82,15 @@ lat __LAT__ 200 150 tr UDP throughput over IPv6: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 -iperf3 BW host ::1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1452 +iperf3 BW host ::1 10002 __TIME__ __OPTS__ -b 5G -l 1452 bw __BW__ 1.0 1.5 -iperf3 BW host ::1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW host ::1 10002 __TIME__ __OPTS__ -b 10G -l 3972 bw __BW__ 1.2 1.8 -iperf3 BW host ::1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 10G -l 16336 +iperf3 BW host ::1 10002 __TIME__ __OPTS__ -b 30G -l 16336 bw __BW__ 5.0 6.0 -iperf3 BW host ::1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G -l 16336 +iperf3 BW host ::1 10002 __TIME__ __OPTS__ -b 40G -l 65487 bw __BW__ 7.0 9.0 iperf3k ns @@ -107,14 +106,14 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: host to ns -iperf3s ns 100${i}2 __THREADS__ -iperf3 BW host 127.0.0.1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 1372 +iperf3s ns 10002 +iperf3 BW host 127.0.0.1 10002 __TIME__ __OPTS__ -b 5G -l 1372 bw __BW__ 1.0 1.5 -iperf3 BW host 127.0.0.1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW host 127.0.0.1 10002 __TIME__ __OPTS__ -b 10G -l 3972 bw __BW__ 1.2 1.8 -iperf3 BW host 127.0.0.1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 10G -l 16356 +iperf3 BW host 127.0.0.1 10002 __TIME__ __OPTS__ -b 30G -l 16356 bw __BW__ 5.0 6.0 -iperf3 BW host 127.0.0.1 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G -l 65507 +iperf3 BW host 127.0.0.1 10002 __TIME__ __OPTS__ -b 40G -l 65507 bw __BW__ 7.0 9.0 iperf3k ns @@ -138,22 +137,22 @@ nsout GW ip -j -4 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout GW6 ip -j -6 route show|jq -rM '.[] | select(.dst == "default").gateway' nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' -info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz, __STREAMS__ streams +info Throughput in Gbps, latency in µs, one thread at __FREQ__ GHz report pasta tap_udp 1 __FREQ__ th pktlen 1500B 4000B 16384B 65520B tr UDP throughput over IPv6: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 # (datagram size) = (packet size) - 48: 40 bytes of IPv6 header, 8 of UDP header -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1472 +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 4G -l 16356 +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW ns __GW6__%__IFNAME__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 6G -l 65472 +iperf3 BW ns __GW6__%__IFNAME__ 10003 __TIME__ __OPTS__ -b 30G -l 65472 bw __BW__ 6.0 7.0 iperf3k host @@ -169,16 +168,16 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: ns to host -iperf3s host 100${i}3 __THREADS__ +iperf3s host 10003 # (datagram size) = (packet size) - 28: 20 bytes of IPv4 header, 8 of UDP header -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1472 +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 4G -l 16356 +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW ns __GW__ 100${i}3 __THREADS__ __TIME__ __OPTS__ -b 6G -l 65492 +iperf3 BW ns __GW__ 10003 __TIME__ __OPTS__ -b 30G -l 65492 bw __BW__ 6.0 7.0 iperf3k host @@ -193,17 +192,17 @@ hostw lat __LAT__ 200 150 tr UDP throughput over IPv6: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 nsout IFNAME ip -j link show | jq -rM '.[] | select(.link_type == "ether").ifname' nsout ADDR6 ip -j -6 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[] | select(.scope == "global" and .prefixlen == 64).local' -iperf3 BW host __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1472 +iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW host __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW host __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 4G -l 16356 +iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW host __ADDR6__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G -l 65472 +iperf3 BW host __ADDR6__ 10002 __TIME__ __OPTS__ -b 30G -l 65472 bw __BW__ 7.0 9.0 iperf3k ns @@ -219,16 +218,16 @@ lat __LAT__ 200 150 tr UDP throughput over IPv4: host to ns -iperf3s ns 100${i}2 __THREADS__ +iperf3s ns 10002 nsout ADDR ip -j -4 addr show|jq -rM '.[] | select(.ifname == "__IFNAME__").addr_info[0].local' -iperf3 BW host __ADDR__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 2G -l 1472 +iperf3 BW host __ADDR__ 10002 __TIME__ __OPTS__ -b 8G -l 1472 bw __BW__ 0.3 0.5 -iperf3 BW host __ADDR__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 3G -l 3972 +iperf3 BW host __ADDR__ 10002 __TIME__ __OPTS__ -b 12G -l 3972 bw __BW__ 0.5 0.8 -iperf3 BW host __ADDR__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 4G -l 16356 +iperf3 BW host __ADDR__ 10002 __TIME__ __OPTS__ -b 20G -l 16356 bw __BW__ 3.0 4.0 -iperf3 BW host __ADDR__ 100${i}2 __THREADS__ __TIME__ __OPTS__ -b 15G -l 65492 +iperf3 BW host __ADDR__ 10002 __TIME__ __OPTS__ -b 30G -l 65492 bw __BW__ 7.0 9.0 iperf3k ns -- 2.43.0

David Gibson

25 Jul 25 Jul

6:23 a.m.

New subject: [PATCH 08/11] test: iperf3 3.16 introduces multiple threads, drop our own implementation of that

On Wed, Jul 24, 2024 at 11:50:14PM +0200, Stefano Brivio wrote:

...

Starting from iperf3 version 3.16, -P / --parallel spawns multiple clients as separate threads, instead of multiple streams serviced by the same thread.

So we can drop our lib/test implementation to spawn several iperf3 client and server processes and finally simplify things quite a bit.

Adjust number of threads and UDP sending bandwidth to values that seem to be more or less matching previous throughput tests on my setup.

Signed-off-by: Stefano Brivio

Lovely. Reviewed-by: David Gibson Tested-by: David Gibson -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 09/11] tap: Exit if we fail to bind a UNIX domain socket with explicit path

In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally. But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users. Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio --- tap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/tap.c b/tap.c index 6930ad8..44bd444 100644 --- a/tap.c +++ b/tap.c @@ -1139,8 +1139,11 @@ int tap_sock_unix_open(char *sock_path) close(ex); unlink(path); - if (!bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) || - *sock_path) + ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr)); + if (*sock_path && ret) + die_perror("Failed to bind UNIX domain socket"); + + if (!ret) break; } -- 2.43.0

David Gibson

25 Jul 25 Jul

6 a.m.

New subject: [PATCH 09/11] tap: Exit if we fail to bind a UNIX domain socket with explicit path

On Wed, Jul 24, 2024 at 11:50:15PM +0200, Stefano Brivio wrote:

...

In tap_sock_unix_open(), if we have a given path for the socket from configuration, we don't need to loop over possible paths, so we exit the loop on the first iteration, unconditionally.

But if we failed to bind() the socket to that explicit path, we should exit, instead of continuing. Otherwise we'll pretend we're up and running, but nobody can contact us, and this might be mildly confusing for users.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=2299474 Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- tap.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/tap.c b/tap.c index 6930ad8..44bd444 100644 --- a/tap.c +++ b/tap.c @@ -1139,8 +1139,11 @@ int tap_sock_unix_open(char *sock_path) close(ex);

unlink(path); - if (!bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) || - *sock_path) + ret = bind(fd, (const struct sockaddr *)&addr, sizeof(addr)); + if (*sock_path && ret) + die_perror("Failed to bind UNIX domain socket"); + + if (!ret) break; }

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues. Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past. While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide. Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: } while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p); p += sizeof(uint32_t); n -= sizeof(uint32_t); + if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return; + /* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0); if ((n += rem) != l2len) return; @@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p); -- 2.43.0

David Gibson

25 Jul 25 Jul

6:37 a.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:

...

This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.

Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.

While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.

Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }

while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);

p += sizeof(uint32_t); n -= sizeof(uint32_t);

+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;

Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space. But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful. Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries. If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice. If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.

...

/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0); if ((n += rem) != l2len) return;

Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().

...

@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

11:15 a.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson wrote:

...

On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:

...
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.

Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.

While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.

Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }

while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);

p += sizeof(uint32_t); n -= sizeof(uint32_t);

+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;

Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.

Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.

...

But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.

...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.

...

Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.

That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.

...

If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.

That's already checked below (l2len > ETH_MAX_MTU), and...

...

If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.

that's exactly what we do in that case (goto next).

...

...
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);

^^^^^^^^^^^^^^^^

This the reason why the check above is relevant.

...

...
if ((n += rem) != l2len) return;

Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().

Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover. But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.

...

...
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);

-- Stefano

David Gibson

12:23 p.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:

...

On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson wrote:

...
On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:

...
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.

Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.

While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.

Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }

while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);

p += sizeof(uint32_t); n -= sizeof(uint32_t);

+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;

Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.

Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.

But AFAICT, what we need to know is if the remainder of the frame fits in the buffer. That could be less than the length descriptor if we've already recv()ed part of a frame.

...

...
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.

...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.

...
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.

That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.

Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.

...

...
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.

That's already checked below (l2len > ETH_MAX_MTU), and...

Right. I wonder if it would make sense to do that earlier.

...

...
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.

that's exactly what we do in that case (goto next).

Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.

...

...
...
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);

^^^^^^^^^^^^^^^^

This the reason why the check above is relevant.

Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it. I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).

...

...
...
if ((n += rem) != l2len) return;

Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().

Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.

Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.

...

But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.

...
...
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

1:09 p.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

On Thu, 25 Jul 2024 20:23:55 +1000 David Gibson wrote:

...

On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:

...
On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson wrote:

...
On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:

...
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.

Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.

While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.

Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }

while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);

p += sizeof(uint32_t); n -= sizeof(uint32_t);

+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;

Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.

Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.

But AFAICT, what we need to know is if the remainder of the frame fits in the buffer.

It always does because of how TAP_BUF_BYTES and TAP_BUF_FILL are defined.

...

That could be less than the length descriptor if we've already recv()ed part of a frame.

...
...
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.

...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.

...
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.

That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.

Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.

In most cases, the content of one recv() corresponds to a given number of whole frames, so what we're doing by returning here is, practically speaking, similar to what you're suggesting below with MSG_TRUNC.

...

...
...
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.

That's already checked below (l2len > ETH_MAX_MTU), and...

Right. I wonder if it would make sense to do that earlier.

We can. But right now what I want to fix here is just that missing check, because it actually causes passt to crash (even though we assume a trusted hypervisor, so it's not really security relevant, but not nice either).

...

...
...
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.

that's exactly what we do in that case (goto next).

Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.

We don't know if the length descriptor actually matches the length of the frame, though. If you have a way you think is more robust, would you mind sending a patch?

...

...
...
...
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);

^^^^^^^^^^^^^^^^

This the reason why the check above is relevant.

Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it.

That should be enough to make sure we don't have bad side effects on this second recv(), but yes:

...

I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).

...that is actually more accurate. I can fix this up in another version, unless you can think of a more comprehensive change that also gives us better possibilities to recover from a corrupted stream.

...

...
...
...
if ((n += rem) != l2len) return;

Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().

Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.

Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.

The idea is that the recv() is _usually_ big enough as to flush the buffer anyway.

...

...
But yes, the first attempt should probably be to recv() the rest of the frame. I didn't implement this because it adds complexity and I think that, eventually, we should turn this into a proper ringbuffer anyway.

...
...
@@ -1028,8 +1031,7 @@ redo: /* Complete the partial read above before discarding a malformed * frame, otherwise the stream will be inconsistent. */ - if (l2len < (ssize_t)sizeof(struct ethhdr) || - l2len > (ssize_t)ETH_MAX_MTU) + if (l2len < sizeof(struct ethhdr) || l2len > ETH_MAX_MTU) goto next; tap_add_packet(c, l2len, p);

-- Stefano

David Gibson

26 Jul 26 Jul

3:22 a.m.

New subject: [PATCH 10/11] tap: Discard guest data on length descriptor mismatch

On Thu, Jul 25, 2024 at 01:09:19PM +0200, Stefano Brivio wrote:

...

On Thu, 25 Jul 2024 20:23:55 +1000 David Gibson wrote:

...
On Thu, Jul 25, 2024 at 11:15:03AM +0200, Stefano Brivio wrote:

...
On Thu, 25 Jul 2024 14:37:43 +1000 David Gibson wrote:

...
On Wed, Jul 24, 2024 at 11:50:16PM +0200, Stefano Brivio wrote:

...
This was reported by Matej a while ago, but we forgot to fix it. Even if the hypervisor is necessarily trusted by passt, as it can in any case terminate the guest or disrupt guest connectivity, it's a good idea to be robust against possible issues.

Instead of resetting the connection to the hypervisor, just discard the data we read with a single recv(), as we had a few cases where QEMU would get the length descriptor wrong, in the past.

While at it, change l2len in tap_handler_passt() to uint32_t, as the length descriptor is logically unsigned and 32-bit wide.

Reported-by: Matej Hrica Suggested-by: Matej Hrica Signed-off-by: Stefano Brivio --- tap.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tap.c b/tap.c index 44bd444..62ba6a4 100644 --- a/tap.c +++ b/tap.c @@ -1011,15 +1011,18 @@ redo: }

while (n > (ssize_t)sizeof(uint32_t)) { - ssize_t l2len = ntohl(*(uint32_t *)p); + uint32_t l2len = ntohl(*(uint32_t *)p);

p += sizeof(uint32_t); n -= sizeof(uint32_t);

+ if (l2len > (ssize_t)TAP_BUF_BYTES - n) + return;

Neither the condition nor the action makes much sense to me here. We're testing if the frame can fit in the the remaining buffer space.

Not really, we're just checking that the length descriptor fits the remaining buffer space. We're using this in the second recv() below, that's why it matters here.

But AFAICT, what we need to know is if the remainder of the frame fits in the buffer.

It always does because of how TAP_BUF_BYTES and TAP_BUF_FILL are defined.

Assuming the frame is <= ETH_MAX_MTU bytes, which we haven't yet verified. But that's not really the point I was making. The key word above is *remainder*: we may have already read part of the frame, so comparing the *total* frame length against the remaining buffer space is incorrect.

...

...
That could be less than the length descriptor if we've already recv()ed part of a frame.

...
...
But we may have already read part (or all) of the frame - i.e. it's included in 'n'. So I don't see how that condition is useful.

...that is, it has nothing to do with exceeding or not exceeding the buffer on recv(), that's already taken care of by the recv() call, implicitly.

...
Then, simply returning doesn't seem right under pretty much any circumstances - that discards some amount of data, and leaves us in an unsynchronized state w.r.t. the frame boundaries.

That might happen, of course, but it might also happen that the hypervisor sent us *one* corrupted buffer, and the next recv() will read consistent data.

Well, sure, it's possible, but it doesn't seem particularly likely to me. AFAICT this is a stream which we need every length field to interpret properly. If we lose one, or it's corrupted, I think we're done for.

In most cases, the content of one recv() corresponds to a given number of whole frames, so what we're doing by returning here is, practically speaking, similar to what you're suggesting below with MSG_TRUNC.

Or, I guess more technically, the recv() boundaries will probably correspond to qemu's send() boundaries(), which are likely to correspond to the frame boundaries. But, I think I now see the root of our disagreement here. If the frame field lengths don't seem to make sense relative to the recv() boundaries, do we trust the former (my position) or the latter (your position, I think). Thinking about it like that, I don't think either position makes a lot of sense. If these mismatch something has already gone horribly wrong and we should probably just die(), or at least reset the tap socket.

...

...
...
...
If this is just supposed to be a sanity check on the frame length, then I think we'd be better off with a fixed limit - 64kiB is the obvious choice.

That's already checked below (l2len > ETH_MAX_MTU), and...

Right. I wonder if it would make sense to do that earlier.

We can. But right now what I want to fix here is just that missing check, because it actually causes passt to crash (even though we assume a trusted hypervisor, so it's not really security relevant, but not nice either).

Ok, fair enough.

...

...
...
...
If we hit that, we can warn() and discard data up to the end of the too-large frame. That at least has a chance of letting us recover and move on to future acceptable frames.

that's exactly what we do in that case (goto next).

Only for the case that the length is too long, but not *too* long. In particular it needs to fit in the buffer to even get there. If we sanity checked the frame length earlier we could use MSG_TRUNC to discard even a ludicrously large frame and still continue on to the next one.

We don't know if the length descriptor actually matches the length of the frame, though. If you have a way you think is more robust, would you mind sending a patch?

It's more that if the length descriptor doesn't match the actual length of the frame, I think we're beyond saving. The recv() boundaries probably correspond to to frame boundaries, but that's not guaranteed by SOCK_STREAM semantics, so I don't think we should ever trust them _over_ the length fields we receive.

...

...
...
...
...
/* At most one packet might not fit in a single read, and this * needs to be blocking. */ - if (l2len > n) { + if (l2len > (size_t)n) { rem = recv(c->fd_tap, p + n, l2len - n, 0);

^^^^^^^^^^^^^^^^

This the reason why the check above is relevant.

Relevant, sure, but I still don't think it's right. Actually (TAP_BUF_BYTES - n) is an even stranger quantity than I initially thought. It's the total space of the buffer minus the current partial frame - counting *both* the stuff before our partial frame and after it.

That should be enough to make sure we don't have bad side effects on this second recv(), but yes:

No, I don't think it is. Suppose we get exactly TAP_BUF_FILL bytes in the first recv(). (TAP_BUF_FILL-4) bytes of that is perfectly ordinary, valid frames, which we process. We've read the remaining 4 bytes as a field length, so: p == pkt_buf + TAP_BUF_FILL n == 0 Now suppose that last frame length field is bad, say l2len == 2*ETH_MAX_MTU The test you've suggested: if (l2len > (ssize_t)TAP_BUF_BYTES - n) will let us continue, but the recv() rem = recv(c->fd_tap, p + n, l2len - n, 0); will overrun pkt_buf + TAP_BUF_BYTES. That _probably_ won't happen, since the recv() is likely to end at a frame boundary instead, but again SOCK_STREAM semantics don't guarantee that. And even if the kernel does always preserve send()/recv() boundaries, that qemu might not transmit the frame length as a separate send() (maybe on a debugging / slow path).

...

...
I think instead we need to check for (p + l2len > pkt_buf + TAP_BUF_BYTES).

...that is actually more accurate. I can fix this up in another version, unless you can think of a more comprehensive change that also gives us better possibilities to recover from a corrupted stream.

So, as noted above, I don't think we should even attempt to recover from a corrupted stream. One could certainly design a stream protocol that allows for recovery from corrupted portions, but the qemu -net stream protocol isn't so designed.

...

...
...
...
...
if ((n += rem) != l2len) return;

Pre-existing, but a 'return' here basically lands us in a situation we have no meaningful chance of recovering from. A die() would be preferable. Better yet would be continuing to re-recv() until we have the whole frame, similar to what we do for write_remainder().

Same as above, it depends on what failure you're assuming. If it's just one botched recv(), instead, we recv() again the next time and we recover.

Even if it's just one bad recv(), we still have no idea where we are w.r.t. frame boundaries, so I can't see any way we could recover.

The idea is that the recv() is _usually_ big enough as to flush the buffer anyway.

Hrm, not something I'd really like to rely on. So, I just realised there's another pre-existing problem here. AFAICT there's no guarantee that frame lengths are aligned - which means after an odd sized frame, we'd be making an unaligned u32 read for the frame length, which will crash on some platforms. I'll have a look at all the above today and see what I can come up with. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

24 Jul 24 Jul

11:50 p.m.

New subject: [PATCH 11/11] conf: Accept addresses enclosed by square brackets in port forwarding specifiers

Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit. Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools). Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway. Signed-off-by: Stefano Brivio --- conf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/conf.c b/conf.c index 3cf9ed8..338742e 100644 --- a/conf.c +++ b/conf.c @@ -209,14 +209,24 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg, } - if (ifname == buf + 1) /* Interface without address */ + if (ifname == buf + 1) { /* Interface without address */ addr = NULL; - else if (inet_pton(AF_INET, buf, addr)) - af = AF_INET; - else if (inet_pton(AF_INET6, buf, addr)) - af = AF_INET6; - else - goto bad; + } else { + p = buf; + + /* Allow square brackets for IPv4 too for convenience */ + if (*p == '[' && p[strlen(p) - 1] == ']') { + p[strlen(p) - 1] = 0; + p++; + } + + if (inet_pton(AF_INET, p, addr)) + af = AF_INET; + else if (inet_pton(AF_INET6, p, addr)) + af = AF_INET6; + else + goto bad; + } } else { spec = buf; -- 2.43.0

David Gibson

25 Jul 25 Jul

6:39 a.m.

New subject: [PATCH 11/11] conf: Accept addresses enclosed by square brackets in port forwarding specifiers

On Wed, Jul 24, 2024 at 11:50:17PM +0200, Stefano Brivio wrote:

...

Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit.

Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools).

Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...

--- conf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c index 3cf9ed8..338742e 100644 --- a/conf.c +++ b/conf.c @@ -209,14 +209,24 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,

}

- if (ifname == buf + 1) /* Interface without address */ + if (ifname == buf + 1) { /* Interface without address */ addr = NULL; - else if (inet_pton(AF_INET, buf, addr)) - af = AF_INET; - else if (inet_pton(AF_INET6, buf, addr)) - af = AF_INET6; - else - goto bad; + } else { + p = buf; + + /* Allow square brackets for IPv4 too for convenience */ + if (*p == '[' && p[strlen(p) - 1] == ']') { + p[strlen(p) - 1] = 0;

Nit: I think = '\0' would make the intention here clearer.

...

+ p++; + } + + if (inet_pton(AF_INET, p, addr)) + af = AF_INET; + else if (inet_pton(AF_INET6, p, addr)) + af = AF_INET6; + else + goto bad; + } } else { spec = buf;

-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson

Stefano Brivio

1:26 p.m.

New subject: [PATCH 11/11] conf: Accept addresses enclosed by square brackets in port forwarding specifiers

On Thu, 25 Jul 2024 14:39:49 +1000 David Gibson wrote:

...

On Wed, Jul 24, 2024 at 11:50:17PM +0200, Stefano Brivio wrote:

...
Even though we don't use : as delimiter for the port, making square brackets unneeded, RFC 3986, section 3.2.2, mandates them for IPv6 literals. We want IPv6 addresses there, but some users might still specify them out of habit.

Same for IPv4 addresses: RFC 3986 doesn't specify square brackets for IPv4 literals, but I had reports of users actually trying to use them (they're accepted by many tools).

Allow square brackets for both IPv4 and IPv6 addresses, correct or not, they're harmless anyway.

Signed-off-by: Stefano Brivio

Reviewed-by: David Gibson

...
--- conf.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/conf.c b/conf.c index 3cf9ed8..338742e 100644 --- a/conf.c +++ b/conf.c @@ -209,14 +209,24 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,

}

- if (ifname == buf + 1) /* Interface without address */ + if (ifname == buf + 1) { /* Interface without address */ addr = NULL; - else if (inet_pton(AF_INET, buf, addr)) - af = AF_INET; - else if (inet_pton(AF_INET6, buf, addr)) - af = AF_INET6; - else - goto bad; + } else { + p = buf; + + /* Allow square brackets for IPv4 too for convenience */ + if (*p == '[' && p[strlen(p) - 1] == ']') { + p[strlen(p) - 1] = 0;

Nit: I think = '\0' would make the intention here clearer.

Right, fixed on merge. -- Stefano

Stefano Brivio

1:28 p.m.

On Wed, 24 Jul 2024 23:50:06 +0200 Stefano Brivio wrote:

...

Here are a bunch of mostly unrelated assorted fixes for small issues I found in the past few weeks but didn't really find time to report.

Stefano Brivio (11): tap: Don't quit if pasta gets EIO on writev() to tap, interface might be down tcp: Change SO_PEEK_OFF support message to debug() log: Drop newlines in the middle of the perror()-like messages log: Fix sub-second part in relative log time calculation log: Initialise timestamp for relative log time also if we use a log file test: Fix memory/passt tests, --netns-only is not a valid option for passt test: Update names of symbols and slabinfo entries test: iperf3 3.16 introduces multiple threads, drop our own implementation of that tap: Exit if we fail to bind a UNIX domain socket with explicit path tap: Discard guest data on length descriptor mismatch conf: Accept addresses enclosed by square brackets in port forwarding specifiers

I applied these except for 4/11, 5/11, and 10/11. I'll send a new version at least for 4/11 and 5/11 in a bit. -- Stefano

333

Age (days ago)

335

Last active (days ago)

List overview

Download

32 comments

2 participants

participants (2)

David Gibson
Stefano Brivio

[PATCH 00/11] Minor assorted fixes, mostly logging and tests

tags

participants (2)