l3_len was calculated from the ethernet frame size, and it
was assumed to be equal to the length stored in an IP packet.
But if the ethernet frame is padded, then l3_len calculated
that way can only be used as a bound check to validate the
length stored in an IP header. It should not be used for
calculating the l4_len.
This patch makes sure the small padded ethernet frames are
properly processed, by trusting the length stored in an IP
header.
Signed-off-by: Stas Sergeev <stsp2(a)yandex.ru&…
[View More]gt;
CC: Stefano Brivio <sbrivio(a)redhat.com>
---
tap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tap.c b/tap.c
index ee79be0..8d7859c 100644
--- a/tap.c
+++ b/tap.c
@@ -615,7 +615,7 @@ resume:
continue;
hlen = iph->ihl * 4UL;
- if (hlen < sizeof(*iph) || htons(iph->tot_len) != l3_len ||
+ if (hlen < sizeof(*iph) || htons(iph->tot_len) > l3_len ||
hlen > l3_len)
continue;
@@ -623,7 +623,7 @@ resume:
if (tap4_is_fragment(iph, now))
continue;
- l4_len = l3_len - hlen;
+ l4_len = htons(iph->tot_len) - hlen;
if (iph->saddr && c->ip4.addr_seen.s_addr != iph->saddr)
c->ip4.addr_seen.s_addr = iph->saddr;
--
2.40.1
[View Less]
This is a second draft of the first steps in implementing more general
"connection" tracking, as described at:
https://pad.passt.top/p/NewForwardingModel
This series changes the TCP connection table into a more general flow
table that can track other protocols as well (although none are
implemented yet). Each flow uniformly keeps track of all the relevant
addresses and ports, which will allow for more robust control of NAT
and port forwarding.
Caveats:
* We significantly increase the …
[View More]size of a connection/flow entry
- Can probably be mitigated, but I haven't investigated much yet
* We perform a number of extra getsockname() calls to know some of
the socket endpoints
- Haven't yet measured how much performance impact that has
- Can be mitigated in at least some cases, but again, haven't
tried yet
* Only TCP converted so far
Changes since v1:
* Terminology changes
- "Endpoint" address/port instead of "correspondent" address/port
- "flowside" instead of "demiflow"
* Actually move the connection table to a new flow table structure in
new files
* Significant rearrangement of earlier patchs on top of that new
table, to reduce churn
David Gibson (10):
flow, tcp: Generalise connection types
flow, tcp: Move TCP connection table to unified flow table
flow, tcp: Consolidate flow pointer<->index helpers
flow: Make unified version of flow table compaction
flow: Introduce struct flowside, space for uniform tracking of
addresses
tcp: Move guest side address tracking to flow/flowside
tcp, flow: Perform TCP hash calculations based on flowside
tcp: Re-use flowside_hash for initial sequence number generation
tcp: Maintain host flowside for connections
tcp_splice: Fill out flowside information for spliced connections
Makefile | 14 +-
flow.c | 111 ++++++++++++++++
flow.h | 115 +++++++++++++++++
flow_table.h | 45 +++++++
passt.h | 3 +
siphash.c | 1 +
tcp.c | 355 ++++++++++++++++++++++++---------------------------
tcp.h | 5 -
tcp_conn.h | 54 ++------
tcp_splice.c | 78 ++++++-----
tcp_splice.h | 3 +-
11 files changed, 505 insertions(+), 279 deletions(-)
create mode 100644 flow.c
create mode 100644 flow.h
create mode 100644 flow_table.h
--
2.41.0
[View Less]
The hard link trick didn't actually fix the issue with SELinux file
contexts properly: as opposed to symbolic links, SELinux now
correctly associates types to the labels that are set -- except that
those labels are now shared, so we can end up (depending on how
rpm(8) extracts the archives) with /usr/bin/passt having a
pasta_exec_t context.
This got rather confusing as running restorecon(8) seemed to fix up
labels -- but that's simply toggling between passt_exec_t and
pasta_exec_t for both …
[View More]links, because each invocation will just "fix"
the file with the mismatching context.
Replace the hard links with copies. AppArmor's attachment, instead,
works with hard links, and if there's no LSM, we can keep symbolic
links, so keep symbolic links in the Makefile.
With copies, rpmbuild(8) will warn about duplicate Build-IDs in the
same package. Mangle them in pasta binaries by summing one to the
last byte, modulo one byte, using xxd (provided by vim-common) and
disable the automatic rehashing by find-debuginfo(1) -- we already
have per-release Build-IDs thanks to $VERSION passed on 'make'.
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
contrib/fedora/passt.spec | 27 ++++++++++++++++++++++-----
1 file changed, 22 insertions(+), 5 deletions(-)
diff --git a/contrib/fedora/passt.spec b/contrib/fedora/passt.spec
index d0c6895..51bf5a8 100644
--- a/contrib/fedora/passt.spec
+++ b/contrib/fedora/passt.spec
@@ -9,6 +9,10 @@
%global git_hash {{{ git_head }}}
%global selinuxtype targeted
+# Different Build-IDs for passt and pasta: don't let find-debuginfo touch them
+%undefine _unique_build_ids
+%global _no_recompute_build_ids 1
+
Name: passt
Version: {{{ git_version }}}
@@ -19,7 +23,7 @@ Group: System Environment/Daemons
URL: https://passt.top/
Source: https://passt.top/passt/snapshot/passt-%{git_hash}.tar.xz
-BuildRequires: gcc, make, checkpolicy, selinux-policy-devel
+BuildRequires: gcc, make, checkpolicy, selinux-policy-devel, binutils, vim-common
Requires: (%{name}-selinux = %{version}-%{release} if selinux-policy-%{selinuxtype})
%description
@@ -56,15 +60,28 @@ This package adds SELinux enforcement to passt(1) and pasta(1).
%install
%make_install DESTDIR=%{buildroot} prefix=%{_prefix} bindir=%{_bindir} mandir=%{_mandir} docdir=%{_docdir}/%{name}
-# The Makefile creates symbolic links for pasta, but we need hard links for
+# The Makefile creates symbolic links for pasta, but we need actual copies for
# SELinux file contexts to work as intended. Same with pasta.avx2 if present.
-ln -f %{buildroot}%{_bindir}/passt %{buildroot}%{_bindir}/pasta
+#
+# To avoid duplicate Build-IDs in the same package, we increase the last byte of
+# the value for pasta binaries by one (modulo one byte). Note that we already
+# have differentiated Build-IDs per release, courtesy of $VERSION, so we don't
+# need find-debuginfo(1) to recalculate them.
+rm %{buildroot}%{_bindir}/pasta
+objcopy --dump-section .note.gnu.build-id=%{buildroot}/build_id %{buildroot}%{_bindir}/passt
+printf '\x'$(printf %02x $(( ( 0x$(xxd -ps -s 35 %{buildroot}/build_id) + 1 ) % 0xff )) ) | dd of=%{buildroot}/build_id seek=35 bs=1 count=1 conv=notrunc
+objcopy --update-section .note.gnu.build-id=%{buildroot}/build_id %{buildroot}%{_bindir}/passt %{buildroot}%{_bindir}/pasta
+rm %{buildroot}/build_id
+
%ifarch x86_64
-ln -f %{buildroot}%{_bindir}/passt.avx2 %{buildroot}%{_bindir}/pasta.avx2
+rm %{buildroot}%{_bindir}/pasta.avx2
+objcopy --dump-section .note.gnu.build-id=%{buildroot}/build_id %{buildroot}%{_bindir}/passt.avx2
+printf '\x'$(printf %02x $(( ( 0x$(xxd -ps -s 35 %{buildroot}/build_id) + 1 ) % 0xff )) ) | dd of=%{buildroot}/build_id seek=35 bs=1 count=1 conv=notrunc
+objcopy --update-section .note.gnu.build-id=%{buildroot}/build_id %{buildroot}%{_bindir}/passt.avx2 %{buildroot}%{_bindir}/pasta.avx2
+rm %{buildroot}/build_id
ln -sr %{buildroot}%{_mandir}/man1/passt.1 %{buildroot}%{_mandir}/man1/passt.avx2.1
ln -sr %{buildroot}%{_mandir}/man1/pasta.1 %{buildroot}%{_mandir}/man1/pasta.avx2.1
-install -p -m 755 %{buildroot}%{_bindir}/passt.avx2 %{buildroot}%{_bindir}/pasta.avx2
%endif
pushd contrib/selinux
--
2.39.2
[View Less]
When reading received messages with MSG_PEEK, we sometines have to read
the leading bytes of the stream several times, only to reach the bytes
we really want. This is clearly non-optimal.
What we would want is something similar to pread/preadv(), but working
even for tcp sockets. At the same time, we obviously don't want to add
any new arguments to the recv/recvmsg() calls.
In this commit, we allow the user to set iovec.iov_base in the first
vector entry to NULL. This tells the socket to skip …
[View More]the first entry,
hence making the iov_len field of that entry indicate the offset value.
This way, there is no need to add any new arguments.
This change is simple and non-intrusive, and should be safe addition to
the socket API. We have measured it to give a throughput improvement of
8-10 % for the protocol splicer 'passst', which is used in KubeVirt
containers.
Signed-off-by: Jon Maloy <jmaloy(a)redhat.com>
works with original msghdr
Signed-off-by: Jon Maloy <jmaloy(a)redhat.com>
---
net/ipv4/tcp.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 33f559f491c8..1d89337e89b6 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2428,6 +2428,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
struct tcp_sock *tp = tcp_sk(sk);
int copied = 0;
u32 peek_seq;
+ u32 peek_offset;
u32 *seq;
unsigned long used;
int err;
@@ -2435,7 +2436,6 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
long timeo;
struct sk_buff *skb, *last;
u32 urg_hole = 0;
-
err = -ENOTCONN;
if (sk->sk_state == TCP_LISTEN)
goto out;
@@ -2469,6 +2469,14 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len,
if (flags & MSG_PEEK) {
peek_seq = tp->copied_seq;
seq = &peek_seq;
+ if (msg->msg_iter.iov[0].iov_base == NULL) {
+ peek_offset = msg->msg_iter.iov[0].iov_len;
+ msg->msg_iter.iov = &msg->msg_iter.iov[1];
+ msg->msg_iter.nr_segs -= 1;
+ msg->msg_iter.count -= peek_offset;
+ len -= peek_offset;
+ *seq += peek_offset;
+ }
}
target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
--
2.39.0
[View Less]
Ugly as hell, but we keep breaking things otherwise, and I keep
forgetting to run this manually (as long as it's based on my local
Podman setup, that's the only alternative).
We need to clone the Podman repository as distribution packages don't
contain test scripts, typically. While at it, build the latest
version which is what really matters.
As we're planning anyway to revamp the test framework, I'd be
inclined to just add this without too many thoughts, and have it as
a nice-to-have …
[View More]requirement reminder for the new framework.
Link: https://github.com/containers/podman/pull/19699
Suggested-by: Paul Holzinger <pholzing(a)redhat.com>
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
v2: Use CONTAINERS_HELPER_BINARY_DIR to override pasta's path (Paul Holzinger)
test/README.md | 4 ++--
test/pasta_podman/bats | 21 +++++++++++++++++++++
test/run | 4 ++++
3 files changed, 27 insertions(+), 2 deletions(-)
create mode 100644 test/pasta_podman/bats
diff --git a/test/README.md b/test/README.md
index 03c7f57..0936b04 100644
--- a/test/README.md
+++ b/test/README.md
@@ -28,8 +28,8 @@ on a system, i.e. common utilities such as a shell are not included here.
Example for Debian, and possibly most Debian-based distributions:
- build-essential git jq strace iperf3 qemu-system-x86 tmux sipcalc bc
- clang-tidy cppcheck isc-dhcp-common psmisc linux-cpupower socat
+ build-essential git jq strace iperf3 qemu-system-x86 tmux sipcalc bats bc
+ catatonit clang-tidy cppcheck go isc-dhcp-common psmisc linux-cpupower socat
netcat-openbsd fakeroot lz4 lm-sensors qemu-system-arm qemu-system-ppc
qemu-system-misc qemu-system-x86 valgrind
diff --git a/test/pasta_podman/bats b/test/pasta_podman/bats
new file mode 100644
index 0000000..21446f0
--- /dev/null
+++ b/test/pasta_podman/bats
@@ -0,0 +1,21 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+# for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+# for network namespace/tap device mode
+#
+# test/pasta_podman/bats - Build Podman, run pasta system test with bats
+#
+# Copyright (c) 2022 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio(a)redhat.com>
+
+htools git make go bats catatonit ip jq socat
+
+test Podman system test with bats
+
+host git -C __STATEDIR__ clone https://github.com/containers/podman.git
+host make -C __STATEDIR__/podman
+hout WD pwd
+host PODMAN="__STATEDIR__/podman/bin/podman" CONTAINERS_HELPER_BINARY_DIR="__WD__" bats __STATEDIR__/podman/test/system/505-networking-pasta.bats
diff --git a/test/run b/test/run
index 8f4f845..3b37663 100755
--- a/test/run
+++ b/test/run
@@ -82,6 +82,10 @@ run() {
test pasta_options/log_to_file
teardown pasta_options
+ setup build
+ test pasta_podman/bats
+ teardown build
+
setup memory
test memory/passt
teardown memory
--
2.39.2
[View Less]
Host routes can include a preferred source address (RTA_PREFSRC), which
must be one of the host's addresses. However when using pasta with -a the
namespace might be given a different address, not on the host. This seems
to occur pretty routinely depending on the network configuration systems
in place on the host.
With --config-net we will try to copy host routes to the namespace. If
one of those includes an RTA_PREFSRC, but the namespace doesn't have the
host address, this will fail with -…
[View More]EINVAL, causing pasta to fail.
Fix this by stripping off RTA_PREFSRC attributes from routes as we copy
them to the namespace. This is by no means infallible, bit it should at
least handle common cases for the time being.
Link: https://bugs.passt.top/show_bug.cgi?id=71
Link: https://github.com/containers/podman/pull/19699#issuecomment-1688769287
Signed-off-by: David Gibson <david(a)gibson.dropbear.id.au>
---
netlink.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/netlink.c b/netlink.c
index f55f2c3..98f08e7 100644
--- a/netlink.c
+++ b/netlink.c
@@ -462,8 +462,21 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
- if (rta->rta_type == RTA_OIF)
+ if (rta->rta_type == RTA_OIF) {
+ /* The host obviously list's the host interface
+ * id here, we need to change it to the
+ * namespace's interface id
+ */
*(unsigned int *)RTA_DATA(rta) = ifi_dst;
+ } else if (rta->rta_type == RTA_PREFSRC) {
+ /* Host routes might include a preferred source
+ * address, which must be one of the host's
+ * addresses. However, with -a pasta will use a
+ * different namespace address, making such a
+ * route invalid in the namespace. Strip off
+ * RTA_PREFSRC attributes to avoid that. */
+ rta->rta_type = RTA_UNSPEC;
+ }
}
}
--
2.41.0
[View Less]
Ugly as hell, but we keep breaking things otherwise, and I keep
forgetting to run this manually (as long as it's based on my local
Podman setup, that's the only alternative).
We need to clone the Podman repository as distribution packages don't
contain test scripts, typically. While at it, build the latest
version which is what really matters.
As we're planning anyway to revamp the test framework, I'd be
inclined to just add this without too many thoughts, and have it as
a nice-to-have …
[View More]requirement reminder for the new framework.
Link: https://github.com/containers/podman/pull/19699
Suggested-by: Paul Holzinger <pholzing(a)redhat.com>
Signed-off-by: Stefano Brivio <sbrivio(a)redhat.com>
---
test/README.md | 4 ++--
test/pasta_podman/bats | 22 ++++++++++++++++++++++
test/run | 4 ++++
3 files changed, 28 insertions(+), 2 deletions(-)
create mode 100644 test/pasta_podman/bats
diff --git a/test/README.md b/test/README.md
index 03c7f57..0936b04 100644
--- a/test/README.md
+++ b/test/README.md
@@ -28,8 +28,8 @@ on a system, i.e. common utilities such as a shell are not included here.
Example for Debian, and possibly most Debian-based distributions:
- build-essential git jq strace iperf3 qemu-system-x86 tmux sipcalc bc
- clang-tidy cppcheck isc-dhcp-common psmisc linux-cpupower socat
+ build-essential git jq strace iperf3 qemu-system-x86 tmux sipcalc bats bc
+ catatonit clang-tidy cppcheck go isc-dhcp-common psmisc linux-cpupower socat
netcat-openbsd fakeroot lz4 lm-sensors qemu-system-arm qemu-system-ppc
qemu-system-misc qemu-system-x86 valgrind
diff --git a/test/pasta_podman/bats b/test/pasta_podman/bats
new file mode 100644
index 0000000..f36da7c
--- /dev/null
+++ b/test/pasta_podman/bats
@@ -0,0 +1,22 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# PASST - Plug A Simple Socket Transport
+# for qemu/UNIX domain socket mode
+#
+# PASTA - Pack A Subtle Tap Abstraction
+# for network namespace/tap device mode
+#
+# test/pasta_podman/bats - Build Podman, run pasta system test with bats
+#
+# Copyright (c) 2022 Red Hat GmbH
+# Author: Stefano Brivio <sbrivio(a)redhat.com>
+
+htools git make go bats catatonit ip jq socat
+
+test Podman system test with bats
+
+host git -C __STATEDIR__ clone https://github.com/containers/podman.git
+host make -C __STATEDIR__/podman
+hout WD pwd
+host printf "[engine]\nhelper_binaries_dir=['__WD__']\n" > __STATEDIR__/containers.conf
+host PODMAN="__STATEDIR__/podman/bin/podman" CONTAINERS_CONF_OVERRIDE="__STATEDIR__/containers.conf" bats __STATEDIR__/podman/test/system/505-networking-pasta.bats
diff --git a/test/run b/test/run
index 8f4f845..3b37663 100755
--- a/test/run
+++ b/test/run
@@ -82,6 +82,10 @@ run() {
test pasta_options/log_to_file
teardown pasta_options
+ setup build
+ test pasta_podman/bats
+ teardown build
+
setup memory
test memory/passt
teardown memory
--
2.39.2
[View Less]