[PATCH v2] passt, util: Close any open file that the parent might have leaked
If a parent accidentally or due to implementation reasons leaks any
open file, we don't want to have access to them, except for the file
passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into
Podman: it's not practical for Podman to close unrelated files before
starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams
and the one from --fd.
Given that parts of conf() depend on other files to be already opened,
such as the epoll file descriptor, we can't easily defer this to a
more convenient point, where --fd was already parsed. Introduce a
minimal, duplicate version of --fd parsing to keep this simple.
Suggested-by: Paul Holzinger
On Wed, Aug 07, 2024 at 10:27:45AM +0200, Stefano Brivio wrote:
If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams and the one from --fd.
Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple.
Suggested-by: Paul Holzinger
Signed-off-by: Stefano Brivio
Reviewed-by: David Gibson
--- v2: Move call to close_open_files() to isolate_initial()
conf.c | 1 + isolation.c | 12 +++++++++--- isolation.h | 2 +- passt.c | 2 +- util.c | 36 ++++++++++++++++++++++++++++++++++++ util.h | 1 + 6 files changed, 49 insertions(+), 5 deletions(-)
diff --git a/conf.c b/conf.c index 14d8ece..89f5b3d 100644 --- a/conf.c +++ b/conf.c @@ -1260,6 +1260,7 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
+ optind = 1; do { name = getopt_long(argc, argv, optstring, options, NULL);
diff --git a/isolation.c b/isolation.c index 4956d7e..45fba1e 100644 --- a/isolation.c +++ b/isolation.c @@ -29,7 +29,8 @@ * * Executed immediately after startup, drops capabilities we don't * need at any point during execution (or which we gain back when we - * need by joining other namespaces). + * need by joining other namespaces), and closes any leaked file we + * might have inherited from the parent process. * * 2. isolate_user() * ================= @@ -166,14 +167,17 @@ static void clamp_caps(void) }
/** - * isolate_initial() - Early, config independent self isolation + * isolate_initial() - Early, mostly config independent self isolation + * @argc: Argument count + * @argv: Command line options: only --fd (if present) is relevant here * * Should: * - drop unneeded capabilities + * - close all open files except for standard streams and the one from --fd * Musn't: * - remove filesytem access (we need to access files during setup) */ -void isolate_initial(void) +void isolate_initial(int argc, char **argv) { uint64_t keep;
@@ -207,6 +211,8 @@ void isolate_initial(void) keep |= BIT(CAP_SETFCAP) | BIT(CAP_SYS_PTRACE);
drop_caps_ep_except(keep); + + close_open_files(argc, argv); }
/** diff --git a/isolation.h b/isolation.h index 846b2af..80bb68d 100644 --- a/isolation.h +++ b/isolation.h @@ -7,7 +7,7 @@ #ifndef ISOLATION_H #define ISOLATION_H
-void isolate_initial(void); +void isolate_initial(int argc, char **argv); void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns, enum passt_modes mode); int isolate_prefork(const struct ctx *c); diff --git a/passt.c b/passt.c index ea5bece..4b3c306 100644 --- a/passt.c +++ b/passt.c @@ -211,7 +211,7 @@ int main(int argc, char **argv)
arch_avx2_exec(argv);
- isolate_initial(); + isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
diff --git a/util.c b/util.c index 07fb21c..bd65b5a 100644 --- a/util.c +++ b/util.c @@ -26,6 +26,7 @@ #include
#include #include +#include #include "util.h" #include "iov.h" @@ -694,3 +695,38 @@ const char *str_ee_origin(const struct sock_extended_err *ee)
return "<invalid>"; } + +/** + * close_open_files() - Close leaked files, but not --fd, stdin, stdout, stderr + * @argc: Argument count + * @argv: Command line options, as we need to skip any file given via --fd + */ +void close_open_files(int argc, char **argv) +{ + const struct option optfd[] = { { "fd", required_argument, NULL, 'F' }, + { 0 }, + }; + long fd = -1; + int name; + + do { + name = getopt_long(argc, argv, ":F", optfd, NULL); + + if (name == 'F') { + errno = 0; + fd = strtol(optarg, NULL, 0); + + if (fd < 0 || errno) + die("Invalid --fd: %s", optarg); + } + } while (name != -1); + + if (fd == -1) { + if (close_range(3, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent"); + } else { + if (close_range(3, fd - 1, CLOSE_RANGE_UNSHARE) || + close_range(fd + 1, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent"); + } +} diff --git a/util.h b/util.h index 83d2b53..cb4d181 100644 --- a/util.h +++ b/util.h @@ -183,6 +183,7 @@ int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); +void close_open_files(int argc, char **argv);
/** * af_name() - Return name of an address family
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On 07/08/2024 10:27, Stefano Brivio wrote:
If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams and the one from --fd.
Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple.
Suggested-by: Paul Holzinger
Signed-off-by: Stefano Brivio --- v2: Move call to close_open_files() to isolate_initial() conf.c | 1 + isolation.c | 12 +++++++++--- isolation.h | 2 +- passt.c | 2 +- util.c | 36 ++++++++++++++++++++++++++++++++++++ util.h | 1 + 6 files changed, 49 insertions(+), 5 deletions(-)
diff --git a/conf.c b/conf.c index 14d8ece..89f5b3d 100644 --- a/conf.c +++ b/conf.c @@ -1260,6 +1260,7 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
+ optind = 1; do { name = getopt_long(argc, argv, optstring, options, NULL);
diff --git a/isolation.c b/isolation.c index 4956d7e..45fba1e 100644 --- a/isolation.c +++ b/isolation.c @@ -29,7 +29,8 @@ * * Executed immediately after startup, drops capabilities we don't * need at any point during execution (or which we gain back when we - * need by joining other namespaces). + * need by joining other namespaces), and closes any leaked file we + * might have inherited from the parent process. * * 2. isolate_user() * ================= @@ -166,14 +167,17 @@ static void clamp_caps(void) }
/** - * isolate_initial() - Early, config independent self isolation + * isolate_initial() - Early, mostly config independent self isolation + * @argc: Argument count + * @argv: Command line options: only --fd (if present) is relevant here * * Should: * - drop unneeded capabilities + * - close all open files except for standard streams and the one from --fd * Musn't: * - remove filesytem access (we need to access files during setup) */ -void isolate_initial(void) +void isolate_initial(int argc, char **argv) { uint64_t keep;
@@ -207,6 +211,8 @@ void isolate_initial(void) keep |= BIT(CAP_SETFCAP) | BIT(CAP_SYS_PTRACE);
drop_caps_ep_except(keep); + + close_open_files(argc, argv); }
/** diff --git a/isolation.h b/isolation.h index 846b2af..80bb68d 100644 --- a/isolation.h +++ b/isolation.h @@ -7,7 +7,7 @@ #ifndef ISOLATION_H #define ISOLATION_H
-void isolate_initial(void); +void isolate_initial(int argc, char **argv); void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns, enum passt_modes mode); int isolate_prefork(const struct ctx *c); diff --git a/passt.c b/passt.c index ea5bece..4b3c306 100644 --- a/passt.c +++ b/passt.c @@ -211,7 +211,7 @@ int main(int argc, char **argv)
arch_avx2_exec(argv);
- isolate_initial(); + isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
diff --git a/util.c b/util.c index 07fb21c..bd65b5a 100644 --- a/util.c +++ b/util.c @@ -26,6 +26,7 @@ #include
#include #include +#include #include "util.h" #include "iov.h" @@ -694,3 +695,38 @@ const char *str_ee_origin(const struct sock_extended_err *ee)
return "<invalid>"; } + +/** + * close_open_files() - Close leaked files, but not --fd, stdin, stdout, stderr + * @argc: Argument count + * @argv: Command line options, as we need to skip any file given via --fd + */ +void close_open_files(int argc, char **argv) +{ + const struct option optfd[] = { { "fd", required_argument, NULL, 'F' }, + { 0 }, + }; + long fd = -1; + int name; + + do { + name = getopt_long(argc, argv, ":F", optfd, NULL); + + if (name == 'F') { + errno = 0; + fd = strtol(optarg, NULL, 0); + + if (fd < 0 || errno) + die("Invalid --fd: %s", optarg); + } + } while (name != -1); + + if (fd == -1) { + if (close_range(3, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent"); + } else { + if (close_range(3, fd - 1, CLOSE_RANGE_UNSHARE) || + close_range(fd + 1, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent");
This will fail for fd == 3 as the first range is 3 - 2 in that case which causes EINVAL which could be a common choice for the extra passed fd. On the other end of the range it is the same issue, you seem to parse the fd as long so allows values > 4294967295 which then overflows when passed to close_range as uint which causes issues and even allows us to close stderr streams. Less likely that users would pass such a number but no reason to allow it in the first place. $ strace -e close_range ./pasta --config-net --fd 4294967296 echo test close_range(3, 4294967295, CLOSE_RANGE_UNSHARE) = 0 close_range(1, 4294967295, CLOSE_RANGE_UNSHARE) = 0
+ } +} diff --git a/util.h b/util.h index 83d2b53..cb4d181 100644 --- a/util.h +++ b/util.h @@ -183,6 +183,7 @@ int __daemon(int pidfile_fd, int devnull_fd); int fls(unsigned long x); int write_file(const char *path, const char *buf); int write_remainder(int fd, const struct iovec *iov, size_t iovcnt, size_t skip); +void close_open_files(int argc, char **argv);
/** * af_name() - Return name of an address family
-- Paul Holzinger
On Wed, 7 Aug 2024 11:34:34 +0200
Paul Holzinger
On 07/08/2024 10:27, Stefano Brivio wrote:
If a parent accidentally or due to implementation reasons leaks any open file, we don't want to have access to them, except for the file passed via --fd, if any.
This is the case for Podman when Podman's parent leaks files into Podman: it's not practical for Podman to close unrelated files before starting pasta, as reported by Paul.
Use close_range(2) to close all open files except for standard streams and the one from --fd.
Given that parts of conf() depend on other files to be already opened, such as the epoll file descriptor, we can't easily defer this to a more convenient point, where --fd was already parsed. Introduce a minimal, duplicate version of --fd parsing to keep this simple.
Suggested-by: Paul Holzinger
Signed-off-by: Stefano Brivio --- v2: Move call to close_open_files() to isolate_initial() conf.c | 1 + isolation.c | 12 +++++++++--- isolation.h | 2 +- passt.c | 2 +- util.c | 36 ++++++++++++++++++++++++++++++++++++ util.h | 1 + 6 files changed, 49 insertions(+), 5 deletions(-)
diff --git a/conf.c b/conf.c index 14d8ece..89f5b3d 100644 --- a/conf.c +++ b/conf.c @@ -1260,6 +1260,7 @@ void conf(struct ctx *c, int argc, char **argv) c->tcp.fwd_in.mode = c->tcp.fwd_out.mode = FWD_UNSET; c->udp.fwd_in.mode = c->udp.fwd_out.mode = FWD_UNSET;
+ optind = 1; do { name = getopt_long(argc, argv, optstring, options, NULL);
diff --git a/isolation.c b/isolation.c index 4956d7e..45fba1e 100644 --- a/isolation.c +++ b/isolation.c @@ -29,7 +29,8 @@ * * Executed immediately after startup, drops capabilities we don't * need at any point during execution (or which we gain back when we - * need by joining other namespaces). + * need by joining other namespaces), and closes any leaked file we + * might have inherited from the parent process. * * 2. isolate_user() * ================= @@ -166,14 +167,17 @@ static void clamp_caps(void) }
/** - * isolate_initial() - Early, config independent self isolation + * isolate_initial() - Early, mostly config independent self isolation + * @argc: Argument count + * @argv: Command line options: only --fd (if present) is relevant here * * Should: * - drop unneeded capabilities + * - close all open files except for standard streams and the one from --fd * Musn't: * - remove filesytem access (we need to access files during setup) */ -void isolate_initial(void) +void isolate_initial(int argc, char **argv) { uint64_t keep;
@@ -207,6 +211,8 @@ void isolate_initial(void) keep |= BIT(CAP_SETFCAP) | BIT(CAP_SYS_PTRACE);
drop_caps_ep_except(keep); + + close_open_files(argc, argv); }
/** diff --git a/isolation.h b/isolation.h index 846b2af..80bb68d 100644 --- a/isolation.h +++ b/isolation.h @@ -7,7 +7,7 @@ #ifndef ISOLATION_H #define ISOLATION_H
-void isolate_initial(void); +void isolate_initial(int argc, char **argv); void isolate_user(uid_t uid, gid_t gid, bool use_userns, const char *userns, enum passt_modes mode); int isolate_prefork(const struct ctx *c); diff --git a/passt.c b/passt.c index ea5bece..4b3c306 100644 --- a/passt.c +++ b/passt.c @@ -211,7 +211,7 @@ int main(int argc, char **argv)
arch_avx2_exec(argv);
- isolate_initial(); + isolate_initial(argc, argv);
c.pasta_netns_fd = c.fd_tap = c.pidfile_fd = -1;
diff --git a/util.c b/util.c index 07fb21c..bd65b5a 100644 --- a/util.c +++ b/util.c @@ -26,6 +26,7 @@ #include
#include #include +#include #include "util.h" #include "iov.h" @@ -694,3 +695,38 @@ const char *str_ee_origin(const struct sock_extended_err *ee)
return "<invalid>"; } + +/** + * close_open_files() - Close leaked files, but not --fd, stdin, stdout, stderr + * @argc: Argument count + * @argv: Command line options, as we need to skip any file given via --fd + */ +void close_open_files(int argc, char **argv) +{ + const struct option optfd[] = { { "fd", required_argument, NULL, 'F' }, + { 0 }, + }; + long fd = -1; + int name; + + do { + name = getopt_long(argc, argv, ":F", optfd, NULL); + + if (name == 'F') { + errno = 0; + fd = strtol(optarg, NULL, 0); + + if (fd < 0 || errno) + die("Invalid --fd: %s", optarg); + } + } while (name != -1); + + if (fd == -1) { + if (close_range(3, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent"); + } else { + if (close_range(3, fd - 1, CLOSE_RANGE_UNSHARE) || + close_range(fd + 1, ~0U, CLOSE_RANGE_UNSHARE)) + die_perror("Failed to close files leaked by parent");
This will fail for fd == 3 as the first range is 3 - 2 in that case which causes EINVAL which could be a common choice for the extra passed fd.
Whoops.
On the other end of the range it is the same issue, you seem to parse the fd as long so allows values > 4294967295 which then overflows when passed to close_range as uint which causes issues and even allows us to close stderr streams. Less likely that users would pass such a number but no reason to allow it in the first place.
Of course. Nice catch as well. Thanks.
$ strace -e close_range ./pasta --config-net --fd 4294967296 echo test close_range(3, 4294967295, CLOSE_RANGE_UNSHARE) = 0 close_range(1, 4294967295, CLOSE_RANGE_UNSHARE) = 0
Hey, that's a cool feature! So much easier to type than >/dev/null. Sending v3. -- Stefano
participants (3)
-
David Gibson
-
Paul Holzinger
-
Stefano Brivio