We watch network namespace entries to detect when we should quit
(unless --no-netns-quit is passed), and these might stored in a tmpfs
typically mounted at /run/user/UID or /var/run/user/UID, or found in
procfs at /proc/PID/ns/.
Currently, we try to use inotify for any possible location of those
entries, but inotify, of course, doesn't work on pseudo-filesystems
(see inotify(7)).
The man page reflects this: the description of --no-netns-quit
implies that we won't quit anyway if the namespace is not "bound to
the filesystem".
Well, we won't quit, but, since commit 9e0dbc894813 ("More
deterministic detection of whether argument is a PID, PATH or NAME"),
we try. And, indeed, this is harmless, as the caveat from that
commit message states.
Now, it turns out that Buildah, a tool to create container images,
sharing its codebase with Podman, passes a procfs entry to pasta, and
expects pasta to exit once the network namespace is not needed
anymore, that is, once the original Buildah process terminates.
Get this to work by using the timer fallback mechanism if the
namespace name is passed as a path belonging to a pseudo-filesystem.
This is expected to be procfs, but I covered sysfs and devpts
pseudo-filesystems as well, because nothing actually prevents
creating this kind of directory structure and links there.
Note that statfs(), according to some versions of man pages, was
apparently "deprecated" by the LSB. My reasoning for using it is
essentially this:
https://lore.kernel.org/linux-man/f54kudgblgk643u32tb6at4cd3kkzha6hslahv24s…
...that is, there was no such thing as an LSB deprecation, and
anyway there's no other way to get the filesystem type.
Also note that, while it might sound more robust to detect the
filesystem type using fstatfs() on the file descriptor
(c->pasta_netns_fd), the reported filesystem type for it is nsfs, no
matter what path was given to pasta. If we use the parent directory,
we'll typically have either tmpfs or procfs reported.
The timer, however, still uses the file descriptor of the parent
directory later, as it has no access to the filesystem, and that
avoids possible races if the PID is recycled: if the original process
terminates, the handle we have on /proc/PID/ns still refers to it,
not to any other process with the same PID.
We could have used pidfd_open() to get a handle on the parent
process. But it's not guaranteed that the parent process is actually
the one associated to the network namespace we operate on, and if we
get a PID file descriptor for a PID (parent or not) or path that was
given on our command line, this inherently causes a race condition as
that PID might have been recycled by the time we call pidfd_open().
Even assuming the process we want to watch is the parent process, and
a race-free usage of pidfd_open() would have been possible, I'm not
very enthusiastic about enabling yet another system call in the
seccomp profile just for this, while openat() is needed anyway.
Update the man page to reflect that, even if the target network
namespace is passed as a procfs path or a PID, we'll now quit when
the procfs entry is gone.
Reported-by: Paul Holzinger <pholzing(a)redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1948200214
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
v2: Coverity Scan isn't happy if we "check" (kind of) c->netns_dir
with statfs() before opening it in a non-atomic way. Just to make
things clear, false positive or not: open it, check it, close it
if it wasn't needed: we don't rely on the check.
passt.1 | 8 ++++++--
pasta.c | 24 +++++++++++++++++++-----
2 files changed, 25 insertions(+), 7 deletions(-)
diff --git a/passt.1 b/passt.1
index dc2d719..de6e3bf 100644
--- a/passt.1
+++ b/passt.1
@@ -550,8 +550,12 @@ without \-\-userns.
.TP
.BR \-\-no-netns-quit
-If the target network namespace is bound to the filesystem (that is, if PATH or
-NAME are given as target), do not exit once the network namespace is deleted.
+If the target network namespace is bound to the filesystem, do not exit once
+that path is deleted.
+
+If the target network namespace is represented by a procfs entry, do not exit
+once that entry is removed from procfs (representing the fact that a process
+with the given PID terminated).
.TP
.BR \-\-config-net
diff --git a/pasta.c b/pasta.c
index 01d1511..465fe1a 100644
--- a/pasta.c
+++ b/pasta.c
@@ -33,6 +33,7 @@
#include <sys/timerfd.h>
#include <sys/types.h>
#include <sys/stat.h>
+#include <sys/statfs.h>
#include <fcntl.h>
#include <sys/wait.h>
#include <signal.h>
@@ -41,6 +42,7 @@
#include <netinet/in.h>
#include <net/ethernet.h>
#include <sys/syscall.h>
+#include <linux/magic.h>
#include "util.h"
#include "passt.h"
@@ -390,12 +392,21 @@ void pasta_netns_quit_init(const struct ctx *c)
union epoll_ref ref = { .type = EPOLL_TYPE_NSQUIT_INOTIFY };
struct epoll_event ev = { .events = EPOLLIN };
int flags = O_NONBLOCK | O_CLOEXEC;
- int fd;
+ struct statfs s = { 0 };
+ bool try_inotify = true;
+ int fd = -1, dir_fd;
if (c->mode != MODE_PASTA || c->no_netns_quit || !*c->netns_base)
return;
- if ((fd = inotify_init1(flags)) < 0)
+ if ((dir_fd = open(c->netns_dir, O_CLOEXEC | O_RDONLY)) < 0)
+ die("netns dir open: %s, exiting", strerror(errno));
+
+ if (statfs(c->netns_dir, &s) || s.f_type == DEVPTS_SUPER_MAGIC ||
+ s.f_type == PROC_SUPER_MAGIC || s.f_type == SYSFS_MAGIC)
+ try_inotify = false;
+
+ if (try_inotify && (fd = inotify_init1(flags)) < 0)
warn("inotify_init1(): %s, use a timer", strerror(errno));
if (fd >= 0 && inotify_add_watch(fd, c->netns_dir, IN_DELETE) < 0) {
@@ -409,11 +420,11 @@ void pasta_netns_quit_init(const struct ctx *c)
if ((fd = pasta_netns_quit_timer()) < 0)
die("Failed to set up fallback netns timer, exiting");
- ref.nsdir_fd = open(c->netns_dir, O_CLOEXEC | O_RDONLY);
- if (ref.nsdir_fd < 0)
- die("netns dir open: %s, exiting", strerror(errno));
+ ref.nsdir_fd = dir_fd;
ref.type = EPOLL_TYPE_NSQUIT_TIMER;
+ } else {
+ close(dir_fd);
}
if (fd > FD_REF_MAX)
@@ -463,6 +474,9 @@ void pasta_netns_quit_timer_handler(struct ctx *c, union epoll_ref ref)
fd = openat(ref.nsdir_fd, c->netns_base, O_PATH | O_CLOEXEC);
if (fd < 0) {
+ if (errno == EACCES) /* Expected for existing procfs entry */
+ return;
+
info("Namespace %s is gone, exiting", c->netns_base);
exit(EXIT_SUCCESS);
}
--
2.39.2
...Podman users might get confused by the fact that if we can't
find a default route for a given IP version, we'll report that as a
warning message and possibly just before actual error messages.
However, a lack of routable interface for IPv4 or IPv6 can be a
normal circumstance: don't warn about it, just state that as
informational message, if those are displayed (they're not in
non-error paths in Podman, for example).
While at it, make it clear that we're disabling IPv4 or IPv6 if
there's no routable interface for the corresponding IP version.
Reported-by: Paul Holzinger <pholzing(a)redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
v2: Report that we're disabling IPv4 or IPv6 in the message
conf.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/conf.c b/conf.c
index 5e15b66..3646700 100644
--- a/conf.c
+++ b/conf.c
@@ -579,7 +579,7 @@ static unsigned int conf_ip4(unsigned int ifi,
ifi = nl_get_ext_if(nl_sock, AF_INET);
if (!ifi) {
- warn("No external routable interface for IPv4");
+ info("No routable interface for IPv4: IPv4 is disabled");
return 0;
}
@@ -651,7 +651,7 @@ static unsigned int conf_ip6(unsigned int ifi,
ifi = nl_get_ext_if(nl_sock, AF_INET6);
if (!ifi) {
- warn("No external routable interface for IPv6");
+ info("No routable interface for IPv6: IPv6 is disabled");
return 0;
}
--
2.39.2
The new version with tag 2024_02_16.08344da includes the following changes:
08344da selinux: Allow pasta to remount procfs
338b632 conf: No routable interface for IPv4 or IPv6 is informational, not a warning
8f3f8e1 pasta: Add fallback timer mechanism to check if namespace is gone
f57a2fb conf, passt.1: Exit if we can't bind a forwarded port, except for -[tu] all
927cb84 udp: udp_sock_init_ns() partially duplicats udp_port_rebind_outbound()
96ad5c5 udp: Don't prematurely (and incorrectly) set up automatic inbound forwards
9f57983 netlink: Use const rtnh pointer
7ee4e17 log: setlogmask(0) can actually result in a system call, don't use it
78901ee tcp: Fix subtle bug in fast re-transmit path
6c7623d netlink: Add support to fetch default gateway from multipath routes
322660b icmp: Dedicated functions for starting and closing ping sequences
b6a4e20 icmp: Validate packets received on ping sockets
6e86511 icmp: Warn on receive errors from ping sockets
a325121 icmp: Consolidate icmp_sock_handler() with icmpv6_sock_handler()
70d43f9 icmp: Share more between IPv4 and IPv6 paths in icmp_tap_handler()
15be1bf icmp: Simplify socket expiry scanning
24badd0 icmp: Use -1 to represent "missing" sockets
43713af icmp: Don't attempt to match host IDs to guest IDs
8534cdb icmp: Don't attempt to handle "wrong direction" ping socket traffic
2cb2fe6 icmp: Remove redundant initialisation of sendto() address
5dffb99 icmp: Don't set "port" on destination sockaddr for ping sockets
8981a72 flow: Avoid moving flow entries to compact table
9c0881d flow: Enforce that freeing of closed flows must happen in deferred handlers
4a849e9 flow: Abstract allocation of new flows with helper function
fb7c001 flow: Move flow_count from context structure to a global
7f37bf4 flow: Move flow_log_() to near top of flow.c
02e092b tcp, tcp_splice: Avoid double layered dispatch for connected TCP sockets
70121ca epoll: Better handling of number of epoll types
36dfa8b flow, tcp: Add handling for per-flow timers
b43e448 flow, tcp: Add flow-centric dispatch for deferred flow handling
c97bb52 tcp, tcp_splice: Move per-type cleanup logic into per-type helpers
eebca11 tcp, tcp_splice: Remove redundant handling from tcp_timer()
8563e7c treewide: Standardise on 'now' for current timestamp variables
17bbab1 flow: Make flow_table.h #include the protocol specific headers it needs
00c6eb6 pif: Remove unused pif_name() function
a179ca6 treewide: Make a bunch of pointer variables pointers to const
f60c851 test: Fix passt.mbuto for cases where /usr/sbin doesn't exist
https://passt.top/passt/log/?qt=range&q=2023_12_30.f091893..2024_02_16.0834…
Packages:
- Arch Linux:
https://www.archlinux.org/packages/extra/x86_64/passt/https://archlinuxarm.org/packages/aarch64/passthttps://archlinuxarm.org/packages/armv7h/passt
- Debian tracker:
https://tracker.debian.org/pkg/passt
- Copr (CentOS Stream, EPEL, Fedora, Mageia):
https://copr.fedorainfracloud.org/coprs/sbrivio/passt/build/7022413/
permanent mirror: https://passt.top/builds/copr/0^20240216.g08344da/
- Fedora updates:
https://bodhi.fedoraproject.org/updates/?packages=passt
- Gentoo versions:
https://packages.gentoo.org/packages/net-misc/passt
- Ubuntu tracker:
https://launchpad.net/ubuntu/+source/passt
- Void Linux:
https://voidlinux.org/packages/?q=passt
- Static builds:
- Package for other RPM-based distributions, x86_64 only:
https://passt.top/builds/latest/x86_64/passt-g08344da-1.x86_64.rpm
- x86_64 static binaries:
https://passt.top/builds/latest/x86_64/
- Debian package, from x86_64 static build:
https://passt.top/builds/latest/x86_64/passt_08344da-1_all.deb
--
Stefano
...Podman users might get confused by the fact that if we can't
find a default route for a given IP version, we'll report that as a
warning message and possibly just before actual error messages.
However, a lack of routable interface for IPv4 or IPv6 can be a
normal circumstance: don't warn about it, just state that as
informational message, if those are displayed (they're not in
non-error paths in Podman, for example).
Reported-by: Paul Holzinger <pholzing(a)redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
conf.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/conf.c b/conf.c
index 5e15b66..2341007 100644
--- a/conf.c
+++ b/conf.c
@@ -579,7 +579,7 @@ static unsigned int conf_ip4(unsigned int ifi,
ifi = nl_get_ext_if(nl_sock, AF_INET);
if (!ifi) {
- warn("No external routable interface for IPv4");
+ info("No external routable interface for IPv4");
return 0;
}
@@ -651,7 +651,7 @@ static unsigned int conf_ip6(unsigned int ifi,
ifi = nl_get_ext_if(nl_sock, AF_INET6);
if (!ifi) {
- warn("No external routable interface for IPv6");
+ info("No external routable interface for IPv6");
return 0;
}
--
2.39.2
...or similar, that is, if only excluded ranges are given (implying
we'll forward any other available port). In that case, we'll usually
forward large sets of ports, and it might be inconvenient for the
user to skip excluding single ports that are already taken.
The existing behaviour, that is, exiting only if we fail to bind all
the ports for one given forwarding option, turns out to be
problematic for several aspects raised by Paul:
- Podman merges ranges anyway, so we might fail to bind all the ports
from a specific range given by the user, but we'll not fail anyway
because Podman merges it with another one where we succeed to bind
at least one port. At the same time, there should be no semantic
difference between multiple ranges given by a single option and
multiple ranges given as multiple options: it's unexpected and
not documented
- the user might actually rely on a given port to be forwarded to a
given container or a virtual machine, and if connections are
forwarded to an unrelated process, this might raise security
concerns
- given that we can try and fail to bind multiple ports before
exiting (in case we can't bind any), we don't have a specific error
code we can return to the user, so we don't give the user helpful
indication as to why we couldn't bind ports.
Exit as soon as we fail to create or bind a socket for a given
forwarded port, and report the actual error.
Keep the current behaviour, however, in case the user wants to
forward all the (available) ports for a given protocol, or all the
ports with excluded ranges only. There, it's more reasonable that
the user is expecting partial failures, and it's probably convenient
that we continue with the ports we could forward.
Update the manual page to reflect the new behaviour, and the old
behaviour too in the cases where we keep it.
Suggested-by: Paul Holzinger <pholzing(a)redhat.com>
Link: https://github.com/containers/podman/pull/21563#issuecomment-1937024642
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
conf.c | 36 +++++++++++-------------------------
passt.1 | 15 ++++++++++++---
2 files changed, 23 insertions(+), 28 deletions(-)
diff --git a/conf.c b/conf.c
index 5e15b66..f68f5d3 100644
--- a/conf.c
+++ b/conf.c
@@ -119,6 +119,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
bool exclude_only = true, bound_one = false;
uint8_t exclude[PORT_BITMAP_SIZE] = { 0 };
sa_family_t af = AF_UNSPEC;
+ unsigned i;
int ret;
if (!strcmp(optarg, "none")) {
@@ -141,8 +142,6 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
}
if (!strcmp(optarg, "all")) {
- unsigned i;
-
if (fwd->mode)
goto mode_conflict;
@@ -171,7 +170,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
}
if (!bound_one)
- goto bind_fail;
+ goto bind_all_fail;
return;
}
@@ -221,7 +220,6 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
p = spec;
do {
struct port_range xrange;
- unsigned i;
if (*p != '~') {
/* Not an exclude range, parse later */
@@ -244,8 +242,6 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
} while ((p = next_chunk(p, ',')));
if (exclude_only) {
- unsigned i;
-
for (i = 0; i < PORT_EPHEMERAL_MIN; i++) {
if (bitmap_isset(exclude, i))
continue;
@@ -271,7 +267,7 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
}
if (!bound_one)
- goto bind_fail;
+ goto bind_all_fail;
return;
}
@@ -280,7 +276,6 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
p = spec;
do {
struct port_range orig_range, mapped_range;
- unsigned i;
if (*p == '~')
/* Exclude range, already parsed */
@@ -314,28 +309,16 @@ static void conf_ports(const struct ctx *c, char optname, const char *optarg,
fwd->delta[i] = mapped_range.first - orig_range.first;
- if (optname == 't') {
+ ret = 0;
+ if (optname == 't')
ret = tcp_sock_init(c, af, addr, ifname, i);
- if (ret == -ENFILE || ret == -EMFILE)
- goto enfile;
- if (!ret)
- bound_one = true;
- } else if (optname == 'u') {
+ else if (optname == 'u')
ret = udp_sock_init(c, 0, af, addr, ifname, i);
- if (ret == -ENFILE || ret == -EMFILE)
- goto enfile;
- if (!ret)
- bound_one = true;
- } else {
- /* No way to check in advance for -T and -U */
- bound_one = true;
- }
+ if (ret)
+ goto bind_fail;
}
} while ((p = next_chunk(p, ',')));
- if (!bound_one)
- goto bind_fail;
-
return;
enfile:
die("Can't open enough sockets for port specifier: %s", optarg);
@@ -344,6 +327,9 @@ bad:
mode_conflict:
die("Port forwarding mode '%s' conflicts with previous mode", optarg);
bind_fail:
+ die("Failed to bind port %u (%s) for option '-%c %s', exiting",
+ i, strerror(-ret), optname, optarg);
+bind_all_fail:
die("Failed to bind any port for '-%c %s', exiting", optname, optarg);
}
diff --git a/passt.1 b/passt.1
index cc678ed..dc2d719 100644
--- a/passt.1
+++ b/passt.1
@@ -355,7 +355,8 @@ Don't forward any ports
.TP
.BR all
Forward all unbound, non-ephemeral ports, as permitted by current capabilities.
-For low (< 1024) ports, see \fBNOTES\fR.
+For low (< 1024) ports, see \fBNOTES\fR. No failures are reported for
+unavailable ports, unless no ports could be forwarded at all.
.TP
.BR ports
@@ -364,7 +365,11 @@ optionally, with target ports after \fI:\fR, if they differ. Specific addresses
can be bound as well, separated by \fI/\fR, and also, since Linux 5.7, limited
to specific interfaces, prefixed by \fI%\fR. Within given ranges, selected ports
and ranges can be excluded by an additional specification prefixed by \fI~\fR.
-Specifying excluded ranges only implies that all other ports are forwarded.
+
+Specifying excluded ranges only implies that all other ports are forwarded. In
+this case, no failures are reported for unavailable ports, unless no ports could
+be forwarded at all.
+
Examples:
.RS
.TP
@@ -447,7 +452,11 @@ optionally, with target ports after \fI:\fR, if they differ. Specific addresses
can be bound as well, separated by \fI/\fR, and also, since Linux 5.7, limited
to specific interfaces, prefixed by \fI%\fR. Within given ranges, selected ports
and ranges can be excluded by an additional specification prefixed by \fI~\fR.
-Specifying excluded ranges only implies that all other ports are forwarded.
+
+Specifying excluded ranges only implies that all other ports are forwarded. In
+this case, no failures are reported for unavailable ports, unless no ports could
+be forwarded at all.
+
Examples:
.RS
.TP
--
2.39.2