[PATCH v2 00/12] Cleanups to packet pool handling and sizing
This... is not any of the things I said I would be working on. I can only say that a herd of very hairy yaks led me astray. Looking at bug 66 I spotted some problems with our handling of MTUs / maximum frame sizes. Looking at that I found some weirdness and some real, if minor, bugs in the sizing and handling of the packet pools. Changes in v2: * Stefano convinced me that packet_check_range() is still worthwhile. * So don't remove it... but in looking at it I spotted various flaws in the checks, so address those in a number of new patches. David Gibson (12): test focus hack: stop on fail, but not perf fail make passt dumpable packet: Use flexible array member in struct pool packet: Don't pass start and offset separately too packet_check_range() packet: Don't hard code maximum packet size to UINT16_MAX packet: Remove unhelpful packet_get_try() macro util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers packet: Distinguish severities of different packet_{add,git}_do() errors packet: Move packet length checks into packet_check_range() tap: Don't size pool_tap[46] for the maximum number of packets packet: More cautious checks to avoid pointer arithmetic UB dhcpv6.c | 2 +- ip.c | 2 +- isolation.c | 2 +- packet.c | 106 ++++++++++++++++++++++---------------------------- packet.h | 19 ++++++--- passt.h | 2 - tap.c | 18 +++++++-- tap.h | 3 +- test/lib/term | 1 + test/lib/test | 4 +- test/run | 38 +++++++++--------- util.c | 19 +++++++++ util.h | 25 +++++------- vu_common.c | 34 ++++++++++------ 14 files changed, 153 insertions(+), 122 deletions(-) -- 2.47.1
--- test/run | 38 +++++++++++++++++++------------------- 1 file changed, 19 insertions(+), 19 deletions(-) diff --git a/test/run b/test/run index f188d8ea..fc710475 100755 --- a/test/run +++ b/test/run @@ -81,9 +81,9 @@ run() { test passt/shutdown teardown pasta - setup pasta_options - test pasta_options/log_to_file - teardown pasta_options + #setup pasta_options + #test pasta_options/log_to_file + #teardown pasta_options setup build test pasta_podman/bats @@ -132,24 +132,24 @@ run() { VALGRIND=0 VHOST_USER=0 - setup passt_in_ns - test passt/ndp - test passt_in_ns/dhcp - test perf/passt_tcp - test perf/passt_udp - test perf/pasta_tcp - test perf/pasta_udp - test passt_in_ns/shutdown - teardown passt_in_ns + #setup passt_in_ns + #test passt/ndp + #test passt_in_ns/dhcp + #test perf/passt_tcp + #test perf/passt_udp + #test perf/pasta_tcp + #test perf/pasta_udp + #test passt_in_ns/shutdown + #teardown passt_in_ns VHOST_USER=1 - setup passt_in_ns - test passt_vu/ndp - test passt_vu_in_ns/dhcp - test perf/passt_vu_tcp - test perf/passt_vu_udp - test passt_vu_in_ns/shutdown - teardown passt_in_ns + #setup passt_in_ns + #test passt_vu/ndp + #test passt_vu_in_ns/dhcp + #test perf/passt_vu_tcp + #test perf/passt_vu_udp + #test passt_vu_in_ns/shutdown + #teardown passt_in_ns # TODO: Make those faster by at least pre-installing gcc and make on # non-x86 images, then re-enable. -- 2.47.1
--- test/lib/term | 1 + test/lib/test | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/test/lib/term b/test/lib/term index ed690de8..81191714 100755 --- a/test/lib/term +++ b/test/lib/term @@ -413,6 +413,7 @@ info_failed() { [ ${FAST} -eq 1 ] || status_bar_blink + exit 1 pause_continue \ "Press any key to pause test session" \ "Resuming in " \ diff --git a/test/lib/test b/test/lib/test index e6726bea..91729af7 100755 --- a/test/lib/test +++ b/test/lib/test @@ -101,7 +101,7 @@ test_one_line() { IFS="${__ifs}" ;; "test") - [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1 + # [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1 [ ${TEST_ONE_nok} -eq 1 ] && status_test_fail [ ${TEST_ONE_nok} -eq 0 ] && status_test_ok @@ -374,7 +374,7 @@ test_one() { [ ${DEMO} -eq 1 ] && return [ ${TEST_ONE_skip} -eq 1 ] && status_test_skip && return - [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1 + # [ ${TEST_ONE_perf_nok} -eq 0 ] || TEST_ONE_nok=1 [ ${TEST_ONE_nok} -eq 0 ] && status_test_ok || status_test_fail } -- 2.47.1
--- isolation.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/isolation.c b/isolation.c index c944fb35..df58bb87 100644 --- a/isolation.c +++ b/isolation.c @@ -377,7 +377,7 @@ void isolate_postfork(const struct ctx *c) { struct sock_fprog prog; - prctl(PR_SET_DUMPABLE, 0); +// prctl(PR_SET_DUMPABLE, 0); switch (c->mode) { case MODE_PASST: -- 2.47.1
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros. However, we already require C11 which
includes flexible array members, so we can do better.
Signed-off-by: David Gibson
Fundamentally what packet_check_range() does is to check whether a given
memory range is within the allowed / expected memory set aside for packets
from a particular pool. That range could represent a whole packet (from
packet_add_do()) or part of a packet (from packet_get_do()), but it doesn't
really matter which.
However, we pass the start of the range as two parameters: @start which is
the start of the packet, and @offset which is the offset within the packet
of the range we're interested in. We never use these separately, only as
(start + offset). Simplify the interface of packet_check_range() and
vu_packet_check_range() to directly take the start of the relevant range.
This will allow some additional future improvements.
Signed-off-by: David Gibson
We verify that every packet we store in a pool - and every partial packet
we retreive from it has a length no longer than UINT16_MAX. This
originated in the older packet pool implementation which stored packet
lengths in a uint16_t. Now, that packets are represented by a struct
iovec with its size_t length, this check serves only as a sanity / security
check that we don't have some wildly out of range length due to a bug
elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as
stored in the pool is in fact an entire frame including both L2 and any
backend specific headers. We can exceed this in passt mode, even with the
default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header +
4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and
set it slightly larger, allowing 128 bytes for L2 and/or other backend
specific headers. We'll use different amounts of that depending on the
tap backend, but since this is just a sanity check, the bound doesn't need
to be 100% tight.
Signed-off-by: David Gibson
On Fri, 20 Dec 2024 19:35:29 +1100
David Gibson
We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice. By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it. -- Stefano
On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:29 +1100 David Gibson
wrote: We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of them are pretty tricky. For anything based on the kernel tap device it is 65535, because it has an internal frame size limit of 65535, already including any L2 headers (it explicitly limits the MTU to 65535 - hard_header_len). There is no "hardware" header. For the qemu stream protocol it gets pretty complicated, because there are multiple layers which could clamp the maximum size. It doesn't look like the socket protocol code itself imposes a limit beyond the structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, which could be less for 32-bit systems). AFAICT, it's not theoretically impossible to have gigabyte frames with a weird virtual NIC model... though obviously that wouldn't be IP, and probably not even Ethernet. Each virtual NIC could have its own limit. I suspect that's going to be in the vicinity of 64k. But, I'm really struggling to figure out what it is just for virtio-net, so I really don't want to try to figure it out for all of them. With a virtio-net NIC, I seem to be able to set MTU all the way up to 65535 successfully, which implies a maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol header) = 65553 at least. Similar situation for vhost-user, where I'm finding it even more inscrutable to figure out what limits are imposed at the sub-IP levels. At the moment the "hardware" header (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the packet.c layer, but we might have reasons to change that. So, any sub-IP limits for qemu, I'm basically not managing to find. However, we (more or less) only care about IP, which imposes a more practical limit of: 65535 + L2 header size + "hardware" header size. At present that maxes out at 65553, as above, but if we ever support other L2 encapsulations, or other backend protocols with larger "hardware" headers, that could change.
fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it.
So, yes, I think we can exceed 65538. But more significantly, trying to make the limit tight here feels like a borderline layering violation. The packet layer doesn't really care about the frame size as long as it's "sane". Fwiw, in the draft changes I have improving MTU handling, it's my intention that individual backends calculate and/or enforce tighter limits of their own where practical, and BUILD_ASSERT() that those fit within the packet layer's frame size limit. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 2 Jan 2025 12:00:30 +1100
David Gibson
On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:29 +1100 David Gibson
wrote: We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of them are pretty tricky.
For anything based on the kernel tap device it is 65535, because it has an internal frame size limit of 65535, already including any L2 headers (it explicitly limits the MTU to 65535 - hard_header_len). There is no "hardware" header.
For the qemu stream protocol it gets pretty complicated, because there are multiple layers which could clamp the maximum size. It doesn't look like the socket protocol code itself imposes a limit beyond the structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, which could be less for 32-bit systems). AFAICT, it's not theoretically impossible to have gigabyte frames with a weird virtual NIC model... though obviously that wouldn't be IP, and probably not even Ethernet.
Theoretically speaking, it could actually be IPv6 with Jumbograms. They never really gained traction (because of Ethernet, I think) and we don't support them, but the only attempt to deprecate them I'm aware of didn't succeed (yet): https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/ ...and I actually wonder if somebody will ever try to revive them for virtual networks, where they might make somewhat more sense (you could transfer filesystems with one packet per file or similar silly tricks).
Each virtual NIC could have its own limit. I suspect that's going to be in the vicinity of 64k. But, I'm really struggling to figure out what it is just for virtio-net, so I really don't want to try to figure it out for all of them. With a virtio-net NIC, I seem to be able to set MTU all the way up to 65535 successfully, which implies a maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol header) = 65553 at least.
The Layer-2 header is included (because that also happens to be ETH_MAX_MTU, on Linux), it wouldn't be on top.
Similar situation for vhost-user, where I'm finding it even more inscrutable to figure out what limits are imposed at the sub-IP levels. At the moment the "hardware" header (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the packet.c layer, but we might have reasons to change that.
So, any sub-IP limits for qemu, I'm basically not managing to find. However, we (more or less) only care about IP, which imposes a more practical limit of: 65535 + L2 header size + "hardware" header size.
At present that maxes out at 65553, as above, but if we ever support other L2 encapsulations, or other backend protocols with larger "hardware" headers, that could change.
Okay. I was thinking of a more practical approach, based on the fact that we only support Ethernet anyway, with essentially four types of adapters (three virtio-net implementations, and tap), plus rather rare reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we could actually test things. Well, I did that anyway, because I'm not quite comfortable dropping the UINT16_MAX check in packet_add_do() without testing... and yes, with this patch we can trigger a buffer overflow with vhost-user. In detail: - muvm, virtio-net with 65535 bytes MTU: - ping -s 65492 works (65534 bytes "on wire" from pcap) - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes on wire (that's with a newer kernel, probably that's the reason why it's not the same limit as QEMU, below) - QEMU, virtio-net without vhost-user, 65535 bytes MTU: - ping -s 65493 works (65535 bytes "on wire" from pcap) - with -s 65494: "Bad frame size from guest, resetting connection" - QEMU, virtio-net with vhost-user, 65535 bytes MTU: - ping -s 65493 works (65535 bytes "on wire" from pcap) - ping -s 65494: *** buffer overflow detected ***: terminated without this patch, we catch that in packet_add_do() (and without 9/12 we don't crash) - tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the correct maximum though): - ping -s 65493 works (65535 bytes on wire) - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments) So, I guess we should find out the issue with vhost-user first. Other than that, it looks like we can reach at least 65539 bytes if we add the "virtio-net" length descriptors, but still the frame size itself (which is actually what matters for the functions in packet.c) can't exceed 65535 bytes, at least from my tests. Then, yes, I agree that it's not actually correct, even though it fits all the use cases we have, because we *could* have an implementation exceeding that value (at the moment, it looks like we don't).
fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it.
So, yes, I think we can exceed 65538. But more significantly, trying to make the limit tight here feels like a borderline layering violation. The packet layer doesn't really care about the frame size as long as it's "sane".
It might still be convenient for some back-ends to define "sane" as 64 KiB. I'm really not sure if it is, I didn't look into the matching part of vhost-user in detail. If it's buggy because we have a user that can exceed that, sure, let's fix it. If not... also fine by me as it's some kind of theoretical flaw, but we shouldn't crash.
Fwiw, in the draft changes I have improving MTU handling, it's my intention that individual backends calculate and/or enforce tighter limits of their own where practical, and BUILD_ASSERT() that those fit within the packet layer's frame size limit.
Ah, nice, that definitely sounds like an improvement. -- Stefano
On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 12:00:30 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:29 +1100 David Gibson
wrote: We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of them are pretty tricky.
For anything based on the kernel tap device it is 65535, because it has an internal frame size limit of 65535, already including any L2 headers (it explicitly limits the MTU to 65535 - hard_header_len). There is no "hardware" header.
For the qemu stream protocol it gets pretty complicated, because there are multiple layers which could clamp the maximum size. It doesn't look like the socket protocol code itself imposes a limit beyond the structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, which could be less for 32-bit systems). AFAICT, it's not theoretically impossible to have gigabyte frames with a weird virtual NIC model... though obviously that wouldn't be IP, and probably not even Ethernet.
Theoretically speaking, it could actually be IPv6 with Jumbograms. They never really gained traction (because of Ethernet, I think) and we don't support them, but the only attempt to deprecate them I'm aware of didn't succeed (yet):
https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
...and I actually wonder if somebody will ever try to revive them for virtual networks, where they might make somewhat more sense (you could transfer filesystems with one packet per file or similar silly tricks).
Hm, yes. Well, one problem at a time. Well, ok, 2 or 3 problems at a time.
Each virtual NIC could have its own limit. I suspect that's going to be in the vicinity of 64k. But, I'm really struggling to figure out what it is just for virtio-net, so I really don't want to try to figure it out for all of them. With a virtio-net NIC, I seem to be able to set MTU all the way up to 65535 successfully, which implies a maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol header) = 65553 at least.
The Layer-2 header is included (because that also happens to be ETH_MAX_MTU, on Linux), it wouldn't be on top.
No. Or if so, that's a guest side bug. The MTU set with ip-link is the maximum L3 size - at the default 1500, the L2 frame can be 1514 bytes. If the driver can't send an L2 frame greater than 65535 bytes, then it should clamp the MTU to (65535 - hard_header_len) like tuntap already does. I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU which can be had in an ethernet frame (that's longer), or is it the maximum ethernet frame size (leading to an IP mtu of 65521). I plan to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU series. I should merge that with this one, it might make the context clearer. AFAICT there is *no* structural limit on the size of an ethernet frame; the length isn't in any header, it's just assumed to be reported out of band by the hardware. No theoretical reason that reporting mechanism couldn't allow lengths > 65535, whether slightly (65535 bytes of payload + header & FCS) or vastly.
Similar situation for vhost-user, where I'm finding it even more inscrutable to figure out what limits are imposed at the sub-IP levels. At the moment the "hardware" header (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the packet.c layer, but we might have reasons to change that.
So, any sub-IP limits for qemu, I'm basically not managing to find. However, we (more or less) only care about IP, which imposes a more practical limit of: 65535 + L2 header size + "hardware" header size.
At present that maxes out at 65553, as above, but if we ever support other L2 encapsulations, or other backend protocols with larger "hardware" headers, that could change.
Okay. I was thinking of a more practical approach, based on the fact that we only support Ethernet anyway, with essentially four types of adapters (three virtio-net implementations, and tap), plus rather rare reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we could actually test things.
Well, I did that anyway, because I'm not quite comfortable dropping the UINT16_MAX check in packet_add_do() without testing...
We're not dropping it, just raising the limit, fairly slightly.
and yes, with this patch we can trigger a buffer overflow with vhost-user. In detail:
- muvm, virtio-net with 65535 bytes MTU:
- ping -s 65492 works (65534 bytes "on wire" from pcap) - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes on wire (that's with a newer kernel, probably that's the reason why it's not the same limit as QEMU, below)
That's a guest driver bug. If the MTU is 65535, it should be able to send a 65535 byte IP datagram without fragmentation. Looks like it needs to clamp the MTU based on L2 limitations.
- QEMU, virtio-net without vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - with -s 65494: "Bad frame size from guest, resetting connection"
That's our check, which I plan to fix in the MTU series.
- QEMU, virtio-net with vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - ping -s 65494:
*** buffer overflow detected ***: terminated
without this patch, we catch that in packet_add_do() (and without 9/12 we don't crash)
Ouch. That's a bug in our vhost-user code. The easy fix would be to clamp MTU to 65521, arguably more correct would be to decouple its notion of maximum frame size from ETH_MAX_MTU.
- tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the correct maximum though):
No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed tap frame size. Seems like there are other drivers that should also be clamping their max MTU similarly, but aren't.
- ping -s 65493 works (65535 bytes on wire) - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)
This time the fragmentation is correct, because the MTU is only 65521.
So, I guess we should find out the issue with vhost-user first.
Yeah. I definitely need to intermingle the MTU series with this one to get the order of operations right.
Other than that, it looks like we can reach at least 65539 bytes if we add the "virtio-net" length descriptors, but still the frame size itself (which is actually what matters for the functions in packet.c)
Well.. sort of. The packet.c functions really care nothing about any of the layers, it's just a blob of data to them. I did miss that tap_passt_input() excludes the qemu header before inserting the frame into the pool, so indeed, the pool layer won't currently see length greater than 65535. Of course, since that frame header *is* stored in pkt_buf with all the rest, that's arguably not using the pool layer's buffer bound checks to the full extent we could.
can't exceed 65535 bytes, at least from my tests.
Then, yes, I agree that it's not actually correct, even though it fits all the use cases we have, because we *could* have an implementation exceeding that value (at the moment, it looks like we don't).
So, it seems like the Linux drivers might not actually generate ethernet frames > 65535 bytes - although they don't all correctly reduce their maximum MTU to reflect that. I don't think we should rely on that; AFAICT it would be reasonable for a driver + VMM implementation to allow ethernet frames that are at least 65535 bytes + L2 header. That might also allow for 16 byte 802.1q vlan L2 headers.
fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it.
So, yes, I think we can exceed 65538. But more significantly, trying to make the limit tight here feels like a borderline layering violation. The packet layer doesn't really care about the frame size as long as it's "sane".
It might still be convenient for some back-ends to define "sane" as 64 KiB. I'm really not sure if it is, I didn't look into the matching part of vhost-user in detail.
That's fair, but I don't think the pool layer should impose that limit on the backends, because I think it's equally reasonable or another backend to allow slightly larger frames with a 65535 byte L3 payload. Hence setting the limit to 64k + "a little bit".
If it's buggy because we have a user that can exceed that, sure, let's fix it. If not... also fine by me as it's some kind of theoretical flaw, but we shouldn't crash.
Fwiw, in the draft changes I have improving MTU handling, it's my intention that individual backends calculate and/or enforce tighter limits of their own where practical, and BUILD_ASSERT() that those fit within the packet layer's frame size limit.
Ah, nice, that definitely sounds like an improvement.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 3 Jan 2025 12:16:45 +1100
David Gibson
On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 12:00:30 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:29 +1100 David Gibson
wrote: We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of them are pretty tricky.
For anything based on the kernel tap device it is 65535, because it has an internal frame size limit of 65535, already including any L2 headers (it explicitly limits the MTU to 65535 - hard_header_len). There is no "hardware" header.
For the qemu stream protocol it gets pretty complicated, because there are multiple layers which could clamp the maximum size. It doesn't look like the socket protocol code itself imposes a limit beyond the structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, which could be less for 32-bit systems). AFAICT, it's not theoretically impossible to have gigabyte frames with a weird virtual NIC model... though obviously that wouldn't be IP, and probably not even Ethernet.
Theoretically speaking, it could actually be IPv6 with Jumbograms. They never really gained traction (because of Ethernet, I think) and we don't support them, but the only attempt to deprecate them I'm aware of didn't succeed (yet):
https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
...and I actually wonder if somebody will ever try to revive them for virtual networks, where they might make somewhat more sense (you could transfer filesystems with one packet per file or similar silly tricks).
Hm, yes. Well, one problem at a time. Well, ok, 2 or 3 problems at a time.
Each virtual NIC could have its own limit. I suspect that's going to be in the vicinity of 64k. But, I'm really struggling to figure out what it is just for virtio-net, so I really don't want to try to figure it out for all of them. With a virtio-net NIC, I seem to be able to set MTU all the way up to 65535 successfully, which implies a maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol header) = 65553 at least.
The Layer-2 header is included (because that also happens to be ETH_MAX_MTU, on Linux), it wouldn't be on top.
No. Or if so, that's a guest side bug.
I see your point, but it's pretty much a universal bug. I've never seen a packet interface on Linux that's able to send more than 65535 bytes, and yet, there are several drivers that allow the MTU to be 65535 bytes. Yes, they should be fixed, eventually, but I guess the obstacle to fixing them is that there are, of course, two ways to fix that. The correct one (enabling frames to be bigger than 64 KiB) would probably uncover all kind of issues and perhaps kill throughput in many cases. The wrong one (clamping the MTU) is... well, wrong. But it would be the only sane option, I suppose.
The MTU set with ip-link is the maximum L3 size - at the default 1500, the L2 frame can be 1514 bytes. If the driver can't send an L2 frame greater than 65535 bytes, then it should clamp the MTU to (65535 - hard_header_len) like tuntap already does.
I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU which can be had in an ethernet frame (that's longer), or is it the maximum ethernet frame size (leading to an IP mtu of 65521).
Historically, I think it used to be something on the lines of "maximum frame size of something that looks like Ethernet". Note that the maximum frame size allowed by 802.3 is 1500 bytes. With Jumbo frames, one can typically have 9000 or 9216 bytes, but there's no defined standard.
I plan to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU series. I should merge that with this one, it might make the context clearer.
Well but it's used in the kernel anyway, and that's where the confusion comes from.
AFAICT there is *no* structural limit on the size of an ethernet frame; the length isn't in any header, it's just assumed to be reported out of band by the hardware. No theoretical reason that reporting mechanism couldn't allow lengths > 65535, whether slightly (65535 bytes of payload + header & FCS) or vastly.
Same as my understanding.
Similar situation for vhost-user, where I'm finding it even more inscrutable to figure out what limits are imposed at the sub-IP levels. At the moment the "hardware" header (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the packet.c layer, but we might have reasons to change that.
So, any sub-IP limits for qemu, I'm basically not managing to find. However, we (more or less) only care about IP, which imposes a more practical limit of: 65535 + L2 header size + "hardware" header size.
At present that maxes out at 65553, as above, but if we ever support other L2 encapsulations, or other backend protocols with larger "hardware" headers, that could change.
Okay. I was thinking of a more practical approach, based on the fact that we only support Ethernet anyway, with essentially four types of adapters (three virtio-net implementations, and tap), plus rather rare reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we could actually test things.
Well, I did that anyway, because I'm not quite comfortable dropping the UINT16_MAX check in packet_add_do() without testing...
We're not dropping it, just raising the limit, fairly slightly.
and yes, with this patch we can trigger a buffer overflow with vhost-user. In detail:
- muvm, virtio-net with 65535 bytes MTU:
- ping -s 65492 works (65534 bytes "on wire" from pcap) - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes on wire (that's with a newer kernel, probably that's the reason why it's not the same limit as QEMU, below)
That's a guest driver bug. If the MTU is 65535, it should be able to send a 65535 byte IP datagram without fragmentation. Looks like it needs to clamp the MTU based on L2 limitations.
I would have assumed that the issue is in the virtio-net driver in the Linux kernel (not the network implementation in libkrun or in the virtio-drivers Rust crate), but what's interesting is that we have one byte of difference with virtio-net as implemented by QEMU... so probably my assumption is wrong.
- QEMU, virtio-net without vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - with -s 65494: "Bad frame size from guest, resetting connection"
That's our check, which I plan to fix in the MTU series.
- QEMU, virtio-net with vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - ping -s 65494:
*** buffer overflow detected ***: terminated
without this patch, we catch that in packet_add_do() (and without 9/12 we don't crash)
Ouch. That's a bug in our vhost-user code. The easy fix would be to clamp MTU to 65521, arguably more correct would be to decouple its notion of maximum frame size from ETH_MAX_MTU.
Where would you clamp that? I'm not sure if it's something that we can negotiate over the vhost-user protocol. If we can't, then we need to find another solution for compatibility, even if we fix it in the kernel.
- tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the correct maximum though):
No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed tap frame size. Seems like there are other drivers that should also be clamping their max MTU similarly, but aren't.
I don't remember exactly why now, but because of some combination of requirements from normative references, you can't really have an MTU that's not a multiple of (32-bit) IPv4 words (and things actually break with 65521... maybe TCP?). See also passt(1). That's why we use 65520 by default. I'll try to find out if you're interested.
- ping -s 65493 works (65535 bytes on wire) - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)
This time the fragmentation is correct, because the MTU is only 65521.
So, I guess we should find out the issue with vhost-user first.
Yeah. I definitely need to intermingle the MTU series with this one to get the order of operations right.
Other than that, it looks like we can reach at least 65539 bytes if we add the "virtio-net" length descriptors, but still the frame size itself (which is actually what matters for the functions in packet.c)
Well.. sort of. The packet.c functions really care nothing about any of the layers, it's just a blob of data to them. I did miss that tap_passt_input() excludes the qemu header before inserting the frame into the pool, so indeed, the pool layer won't currently see length greater than 65535. Of course, since that frame header *is* stored in pkt_buf with all the rest, that's arguably not using the pool layer's buffer bound checks to the full extent we could.
Hm, yes, true. But on the other hand, if we just say "plus 128", then we're not using bound checks to the fullest extent we can, either.
can't exceed 65535 bytes, at least from my tests.
Then, yes, I agree that it's not actually correct, even though it fits all the use cases we have, because we *could* have an implementation exceeding that value (at the moment, it looks like we don't).
So, it seems like the Linux drivers might not actually generate ethernet frames > 65535 bytes - although they don't all correctly reduce their maximum MTU to reflect that. I don't think we should rely on that; AFAICT it would be reasonable for a driver + VMM implementation to allow ethernet frames that are at least 65535 bytes + L2 header. That might also allow for 16 byte 802.1q vlan L2 headers.
If it's convenient in our implementation (I think it is, especially on the opposite path), then I think we can kind of rely on it, in the sense that we could "simply" be robust to drivers that send out frames bigger than 65535 bytes (I've seen none to date, not even with VLANs), and change things if somebody ever needs that. I mean, probably, the people who're the most likely to try and add support for bigger frames in hypervisors/kernel at some point in the future are actually us, because with user-mode networking the guest-facing MTU is almost entirely useless.
fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it.
So, yes, I think we can exceed 65538. But more significantly, trying to make the limit tight here feels like a borderline layering violation. The packet layer doesn't really care about the frame size as long as it's "sane".
It might still be convenient for some back-ends to define "sane" as 64 KiB. I'm really not sure if it is, I didn't look into the matching part of vhost-user in detail.
That's fair, but I don't think the pool layer should impose that limit on the backends, because I think it's equally reasonable or another backend to allow slightly larger frames with a 65535 byte L3 payload. Hence setting the limit to 64k + "a little bit".
If it's buggy because we have a user that can exceed that, sure, let's fix it. If not... also fine by me as it's some kind of theoretical flaw, but we shouldn't crash.
Fwiw, in the draft changes I have improving MTU handling, it's my intention that individual backends calculate and/or enforce tighter limits of their own where practical, and BUILD_ASSERT() that those fit within the packet layer's frame size limit.
Ah, nice, that definitely sounds like an improvement.
-- Stefano
On Mon, Jan 06, 2025 at 12:43:27AM +0100, Stefano Brivio wrote:
On Fri, 3 Jan 2025 12:16:45 +1100 David Gibson
wrote: On Thu, Jan 02, 2025 at 10:59:48PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 12:00:30 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:33PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:29 +1100 David Gibson
wrote: We verify that every packet we store in a pool - and every partial packet we retreive from it has a length no longer than UINT16_MAX. This originated in the older packet pool implementation which stored packet lengths in a uint16_t. Now, that packets are represented by a struct iovec with its size_t length, this check serves only as a sanity / security check that we don't have some wildly out of range length due to a bug elsewhere.
However, UINT16_MAX (65535) isn't quite enough, because the "packet" as stored in the pool is in fact an entire frame including both L2 and any backend specific headers. We can exceed this in passt mode, even with the default MTU: 65520 bytes of IP datagram + 14 bytes of Ethernet header + 4 bytes of qemu stream length header = 65538 bytes.
Introduce our own define for the maximum length of a packet in the pool and set it slightly larger, allowing 128 bytes for L2 and/or other backend specific headers. We'll use different amounts of that depending on the tap backend, but since this is just a sanity check, the bound doesn't need to be 100% tight.
I couldn't find the time to check what's the maximum amount of bytes we can get here depending on hypervisor and interface, but if this patch
So, it's a separate calculation for each backend type, and some of them are pretty tricky.
For anything based on the kernel tap device it is 65535, because it has an internal frame size limit of 65535, already including any L2 headers (it explicitly limits the MTU to 65535 - hard_header_len). There is no "hardware" header.
For the qemu stream protocol it gets pretty complicated, because there are multiple layers which could clamp the maximum size. It doesn't look like the socket protocol code itself imposes a limit beyond the structural one of (2^32-1 + 4) (well, and putting it into an ssize_t, which could be less for 32-bit systems). AFAICT, it's not theoretically impossible to have gigabyte frames with a weird virtual NIC model... though obviously that wouldn't be IP, and probably not even Ethernet.
Theoretically speaking, it could actually be IPv6 with Jumbograms. They never really gained traction (because of Ethernet, I think) and we don't support them, but the only attempt to deprecate them I'm aware of didn't succeed (yet):
https://datatracker.ietf.org/doc/draft-jones-6man-historic-rfc2675/
...and I actually wonder if somebody will ever try to revive them for virtual networks, where they might make somewhat more sense (you could transfer filesystems with one packet per file or similar silly tricks).
Hm, yes. Well, one problem at a time. Well, ok, 2 or 3 problems at a time.
Each virtual NIC could have its own limit. I suspect that's going to be in the vicinity of 64k. But, I'm really struggling to figure out what it is just for virtio-net, so I really don't want to try to figure it out for all of them. With a virtio-net NIC, I seem to be able to set MTU all the way up to 65535 successfully, which implies a maximum frame size of 65535 + 14 (L2 header) + 4 (stream protocol header) = 65553 at least.
The Layer-2 header is included (because that also happens to be ETH_MAX_MTU, on Linux), it wouldn't be on top.
No. Or if so, that's a guest side bug.
I see your point, but it's pretty much a universal bug. I've never seen a packet interface on Linux that's able to send more than 65535 bytes, and yet, there are several drivers that allow the MTU to be 65535 bytes.
That's unfortunate :/.
Yes, they should be fixed, eventually, but I guess the obstacle to fixing them is that there are, of course, two ways to fix that.
The correct one (enabling frames to be bigger than 64 KiB) would probably uncover all kind of issues and perhaps kill throughput in many cases. The wrong one (clamping the MTU) is... well, wrong. But it would be the only sane option, I suppose.
I don't think clamping the MTU is wrong, particularly. A driver for a physical NIC should clamp the MTU based on the largest frame the device will let you send (which I suspect will often be much less thatn 64k, particularly for older devices). So, I think it's reasonable for a virtual NIC to clamp the MTU based on limitations of the "virtual hardware" - like having 64k buffers for the frames. But my point here is that because the 64k frame limit isn't structural or normative in Ethernet, I don't think we should rely on it. At least, not if it's relatively easy to not do so, which I think it is. The 64k datagram limit *is* structural in IP, so I think it makes more sense to base our hard limits on that - which means allowing L2 frames slightly larger.
The MTU set with ip-link is the maximum L3 size - at the default 1500, the L2 frame can be 1514 bytes. If the driver can't send an L2 frame greater than 65535 bytes, then it should clamp the MTU to (65535 - hard_header_len) like tuntap already does.
I do think ETH_MAX_MTU is a confusing name: is it the maximum (IP) MTU which can be had in an ethernet frame (that's longer), or is it the maximum ethernet frame size (leading to an IP mtu of 65521).
Historically, I think it used to be something on the lines of "maximum frame size of something that looks like Ethernet".
Note that the maximum frame size allowed by 802.3 is 1500 bytes.
802.3 with the length field, as opposed to Ethernet with EtherType? AFAICT (though I'm struggling to find a normative reference) the limit there too is 1500 bytes for *payload*, meaning 1514 byte frames, or maybe a little more with vlan or other extensions (or if you're counting the FCS).
With Jumbo frames, one can typically have 9000 or 9216 bytes, but there's no defined standard.
I plan to eliminate use of ETH_MAX_MTU in favour of clearer things in my MTU series. I should merge that with this one, it might make the context clearer.
Well but it's used in the kernel anyway, and that's where the confusion comes from.
Right, but for what? As a limit on the IP MTU you can set on an Ethernet device, or as a limit on the L2 frame size for an Ethernet device. I mean, I suspect both, increasing the confusion, but just because there's an existing mess doesn't mean we should copy it.
AFAICT there is *no* structural limit on the size of an ethernet frame; the length isn't in any header, it's just assumed to be reported out of band by the hardware. No theoretical reason that reporting mechanism couldn't allow lengths > 65535, whether slightly (65535 bytes of payload + header & FCS) or vastly.
Same as my understanding.
Similar situation for vhost-user, where I'm finding it even more inscrutable to figure out what limits are imposed at the sub-IP levels. At the moment the "hardware" header (virtio_net_hdr_mrg_rxbuf) doesn't count towards what we store in the packet.c layer, but we might have reasons to change that.
So, any sub-IP limits for qemu, I'm basically not managing to find. However, we (more or less) only care about IP, which imposes a more practical limit of: 65535 + L2 header size + "hardware" header size.
At present that maxes out at 65553, as above, but if we ever support other L2 encapsulations, or other backend protocols with larger "hardware" headers, that could change.
Okay. I was thinking of a more practical approach, based on the fact that we only support Ethernet anyway, with essentially four types of adapters (three virtio-net implementations, and tap), plus rather rare reports with e1000e (https://bugs.passt.top/show_bug.cgi?id=107) and we could actually test things.
Well, I did that anyway, because I'm not quite comfortable dropping the UINT16_MAX check in packet_add_do() without testing...
We're not dropping it, just raising the limit, fairly slightly.
and yes, with this patch we can trigger a buffer overflow with vhost-user. In detail:
- muvm, virtio-net with 65535 bytes MTU:
- ping -s 65492 works (65534 bytes "on wire" from pcap) - ping -s 65493 doesn't because the guest fragments: 65530 plus 39 bytes on wire (that's with a newer kernel, probably that's the reason why it's not the same limit as QEMU, below)
That's a guest driver bug. If the MTU is 65535, it should be able to send a 65535 byte IP datagram without fragmentation. Looks like it needs to clamp the MTU based on L2 limitations.
I would have assumed that the issue is in the virtio-net driver in the Linux kernel (not the network implementation in libkrun or in the virtio-drivers Rust crate), but what's interesting is that we have one byte of difference with virtio-net as implemented by QEMU... so probably my assumption is wrong.
- QEMU, virtio-net without vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - with -s 65494: "Bad frame size from guest, resetting connection"
That's our check, which I plan to fix in the MTU series.
- QEMU, virtio-net with vhost-user, 65535 bytes MTU:
- ping -s 65493 works (65535 bytes "on wire" from pcap) - ping -s 65494:
*** buffer overflow detected ***: terminated
without this patch, we catch that in packet_add_do() (and without 9/12 we don't crash)
Ouch. That's a bug in our vhost-user code. The easy fix would be to clamp MTU to 65521, arguably more correct would be to decouple its notion of maximum frame size from ETH_MAX_MTU.
Where would you clamp that? I'm not sure if it's something that we can negotiate over the vhost-user protocol. If we can't, then we need to find another solution for compatibility, even if we fix it in the kernel.
So, what I was thinking of was clamping allowed values to --mtu so we don't _tell_ the guest to do something that will break us. But, yes, we also need to enforce our safety in the vhost-user backend as well. Depending on the details that could either mean allowing it to accept frames slightly larger than 65535 bytes, or filtering out such frames early enough that we don't blow up.
- tap, 65521 bytes MTU (maximum allowed, I think 65520 would be the correct maximum though):
No, 65521 is correct: 65521+14 = 65535 which is the maximum allowed tap frame size. Seems like there are other drivers that should also be clamping their max MTU similarly, but aren't.
I don't remember exactly why now, but because of some combination of requirements from normative references, you can't really have an MTU that's not a multiple of (32-bit) IPv4 words (and things actually break with 65521... maybe TCP?). See also passt(1).
I mean, you absolutely can set that as an MTU, and send at least ICMP frames of exactly thaat length. It's probably not a good idea to send odd-length frames, but you can do it.
That's why we use 65520 by default. I'll try to find out if you're interested.
Right, and I think 65520 is the right default: 65521 rounded down to a sane multiple. The enforced limit is a different matter.
- ping -s 65493 works (65535 bytes on wire) - ping -s 65494 doesn't (65530 bytes + 40 bytes fragments)
This time the fragmentation is correct, because the MTU is only 65521.
So, I guess we should find out the issue with vhost-user first.
Yeah. I definitely need to intermingle the MTU series with this one to get the order of operations right.
Other than that, it looks like we can reach at least 65539 bytes if we add the "virtio-net" length descriptors, but still the frame size itself (which is actually what matters for the functions in packet.c)
Well.. sort of. The packet.c functions really care nothing about any of the layers, it's just a blob of data to them. I did miss that tap_passt_input() excludes the qemu header before inserting the frame into the pool, so indeed, the pool layer won't currently see length greater than 65535. Of course, since that frame header *is* stored in pkt_buf with all the rest, that's arguably not using the pool layer's buffer bound checks to the full extent we could.
Hm, yes, true.
But on the other hand, if we just say "plus 128", then we're not using bound checks to the fullest extent we can, either.
Well, by "fullest extent" I'm not meaning the tightest possible bounds. I'm just meaning that we check that every byte which should live in the packet buffer *does* live in the packet buffer. Changing the length limit doesn't alter that, but excluding the "hardware" header from the pool descriptor does.
can't exceed 65535 bytes, at least from my tests.
Then, yes, I agree that it's not actually correct, even though it fits all the use cases we have, because we *could* have an implementation exceeding that value (at the moment, it looks like we don't).
So, it seems like the Linux drivers might not actually generate ethernet frames > 65535 bytes - although they don't all correctly reduce their maximum MTU to reflect that. I don't think we should rely on that; AFAICT it would be reasonable for a driver + VMM implementation to allow ethernet frames that are at least 65535 bytes + L2 header. That might also allow for 16 byte 802.1q vlan L2 headers.
If it's convenient in our implementation (I think it is, especially on the opposite path), then I think we can kind of rely on it, in the sense that we could "simply" be robust to drivers that send out frames bigger than 65535 bytes (I've seen none to date, not even with VLANs), and change things if somebody ever needs that.
Yes, that's all I'm suggesting, that we tolerate frames with 65535 bytes at L3. At least when the backend allows it - and in cases where it doesn't we shouldn't let the user set an MTU that high. I'm not suggesting we change our default MTU from 65520.
I mean, probably, the people who're the most likely to try and add support for bigger frames in hypervisors/kernel at some point in the future are actually us, because with user-mode networking the guest-facing MTU is almost entirely useless.
fixes an actual issue as you seem to imply, actually checking that with QEMU and muvm would be nice.
By the way, as you mention a specific calculation, does it really make sense to use a "good enough" value here? Can we ever exceed 65538 bytes, or can we use that as limit? It would be good to find out, while at it.
So, yes, I think we can exceed 65538. But more significantly, trying to make the limit tight here feels like a borderline layering violation. The packet layer doesn't really care about the frame size as long as it's "sane".
It might still be convenient for some back-ends to define "sane" as 64 KiB. I'm really not sure if it is, I didn't look into the matching part of vhost-user in detail.
That's fair, but I don't think the pool layer should impose that limit on the backends, because I think it's equally reasonable or another backend to allow slightly larger frames with a 65535 byte L3 payload. Hence setting the limit to 64k + "a little bit".
If it's buggy because we have a user that can exceed that, sure, let's fix it. If not... also fine by me as it's some kind of theoretical flaw, but we shouldn't crash.
Fwiw, in the draft changes I have improving MTU handling, it's my intention that individual backends calculate and/or enforce tighter limits of their own where practical, and BUILD_ASSERT() that those fit within the packet layer's frame size limit.
Ah, nice, that definitely sounds like an improvement.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Two places in the code use the packet_get_try() variant on packet_get().
The difference is that packet_get_try() passes a NULL 'func' to
packet_get_do(), which suppresses log messages. The places we use this
are ones where we expect to sometimes have a failure retreiving the packet
range, even in normal cases. So, suppressing the log messages seems like
it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit
the case where the requested range is not within the received packet.
However, it also suppresses message in cases where the requested packet
index doesn't exist, the requested range has a nonsensical length or
doesn't lie in even the right vague part of memory. None of those
latter cases are expected, and the message would be helpful if we ever
actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case
in ip.c, we'd log only if we run out of IPv6 packet before reaching a
(non-option) L4 header. That shouldn't be the case in normal operation
so getting a message (at trave level) is not unreasonable.
For the case in dhcpv6.c we do suppress a message every time we look for
but don't find a specific DHCPv6 option. That can happen in perfectly
ok cases, but, again these are trace() level and DHCPv6 transactions
aren't that common. Suppressing the message for this case doesn't
seem worth the drawbacks of (1).
Signed-off-by: David Gibson
On Fri, 20 Dec 2024 19:35:30 +1100
David Gibson
Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations. It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste. About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller. In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well. I guess the main reason to have this patch in this series is to make a later change simpler, but I'm not sure which one, so I don't know what to suggest as an alternative.
Signed-off-by: David Gibson
--- dhcpv6.c | 2 +- ip.c | 2 +- packet.c | 18 +++++------------- packet.h | 3 --- 4 files changed, 7 insertions(+), 18 deletions(-) diff --git a/dhcpv6.c b/dhcpv6.c index 0523bba8..c0d05131 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset, if (!*offset) *offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
- while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) { + while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) { unsigned int opt_len = ntohs(o->l) + sizeof(*o);
if (ntohs(o->l) > left) diff --git a/ip.c b/ip.c index 2cc7f654..e391f81b 100644 --- a/ip.c +++ b/ip.c @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto, if (!IPV6_NH_OPT(nh)) goto found;
- while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) { + while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) { nh = o->nexthdr; hdrlen = (o->hdrlen + 1) * 8;
diff --git a/packet.c b/packet.c index bcac0375..c921aa15 100644 --- a/packet.c +++ b/packet.c @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, char *ptr;
if (idx >= p->size || idx >= p->count) { - if (func) { - trace("packet %zu from pool size: %zu, count: %zu, " - "%s:%i", idx, p->size, p->count, func, line); - } + trace("packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line);
If you change this, the documentation of the arguments for packet_get_do() should be updated as well.
return NULL; }
if (len > PACKET_MAX_LEN) { - if (func) { - trace("packet data length %zu, %s:%i", - len, func, line); - } + trace("packet data length %zu, %s:%i", len, func, line); return NULL; }
if (len + offset > p->pkt[idx].iov_len) { - if (func) { - trace("data length %zu, offset %zu from length %zu, " - "%s:%i", len, offset, p->pkt[idx].iov_len, - func, line); - } + trace("data length %zu, offset %zu from length %zu, %s:%i", + len, offset, p->pkt[idx].iov_len, func, line); return NULL; }
diff --git a/packet.h b/packet.h index 05d37695..f95cda08 100644 --- a/packet.h +++ b/packet.h @@ -45,9 +45,6 @@ void pool_flush(struct pool *p); #define packet_get(p, idx, offset, len, left) \ packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
-#define packet_get_try(p, idx, offset, len, left) \ - packet_get_do(p, idx, offset, len, left, NULL, 0) - #define PACKET_POOL_DECL(_name, _size, _buf) \ struct _name ## _t { \ char *buf; \
-- Stefano
On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:30 +1100 David Gibson
wrote: Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations.
I'm not following your argument here. It's exactly because (most of) the message indicate serious issues that I don't want to suppress them. I don't know what you mean by "validating identity associations".
It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste.
About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller.
Right, but dhcpv6_opt() is called in a loop, that only stops when it returns NULL. So, by construction the last call to dhcpv6_opt(), which terminates the loop, _will_ have a failing call to packet_get(). At this point - at least assuming a correctly constructed packet - the offset will point to just past the last option, which should be exactly at the end of the packet.
In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well.
Not AFAICT, the initial packet_get just validates the basic IPv6 header. The calls to packet_get_try() in the loop validate additional IP options. I don't think it will ever fail on a well-constructed packet, but it could on a bogus (or truncated) packet, where the nexthdr field indicates an option that's actually missing. This is kind of my point: it will only trip on a badly constructed packet, in which case I don't think we want to suppress messages.
I guess the main reason to have this patch in this series is to make a later change simpler, but I'm not sure which one, so I don't know what to suggest as an alternative.
Mostly the next one - making more distinction between the different error severities that can occur here. Having to deal with the possibility of func being NULL complicates things.
Signed-off-by: David Gibson
--- dhcpv6.c | 2 +- ip.c | 2 +- packet.c | 18 +++++------------- packet.h | 3 --- 4 files changed, 7 insertions(+), 18 deletions(-) diff --git a/dhcpv6.c b/dhcpv6.c index 0523bba8..c0d05131 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset, if (!*offset) *offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
- while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) { + while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) { unsigned int opt_len = ntohs(o->l) + sizeof(*o);
if (ntohs(o->l) > left) diff --git a/ip.c b/ip.c index 2cc7f654..e391f81b 100644 --- a/ip.c +++ b/ip.c @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto, if (!IPV6_NH_OPT(nh)) goto found;
- while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) { + while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) { nh = o->nexthdr; hdrlen = (o->hdrlen + 1) * 8;
diff --git a/packet.c b/packet.c index bcac0375..c921aa15 100644 --- a/packet.c +++ b/packet.c @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, char *ptr;
if (idx >= p->size || idx >= p->count) { - if (func) { - trace("packet %zu from pool size: %zu, count: %zu, " - "%s:%i", idx, p->size, p->count, func, line); - } + trace("packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line);
If you change this, the documentation of the arguments for packet_get_do() should be updated as well.
Fixed. The doc for packet_add_do() also changed, where it was already wrong.
return NULL; }
if (len > PACKET_MAX_LEN) { - if (func) { - trace("packet data length %zu, %s:%i", - len, func, line); - } + trace("packet data length %zu, %s:%i", len, func, line); return NULL; }
if (len + offset > p->pkt[idx].iov_len) { - if (func) { - trace("data length %zu, offset %zu from length %zu, " - "%s:%i", len, offset, p->pkt[idx].iov_len, - func, line); - } + trace("data length %zu, offset %zu from length %zu, %s:%i", + len, offset, p->pkt[idx].iov_len, func, line); return NULL; }
diff --git a/packet.h b/packet.h index 05d37695..f95cda08 100644 --- a/packet.h +++ b/packet.h @@ -45,9 +45,6 @@ void pool_flush(struct pool *p); #define packet_get(p, idx, offset, len, left) \ packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
-#define packet_get_try(p, idx, offset, len, left) \ - packet_get_do(p, idx, offset, len, left, NULL, 0) - #define PACKET_POOL_DECL(_name, _size, _buf) \ struct _name ## _t { \ char *buf; \
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 2 Jan 2025 13:15:40 +1100
David Gibson
On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:30 +1100 David Gibson
wrote: Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations.
I'm not following your argument here. It's exactly because (most of) the message indicate serious issues that I don't want to suppress them. I don't know what you mean by "validating identity associations".
But dhcpv6_opt() trying to get data that doesn't exist is *not* an issue, including not a serious one, so if I'm debugging something with --trace and I see one of these messages I'll shout at "memory" or "packet" badness and waste time thinking it's an actual issue. Validating identity associations (IA_NA, IA_TA, RFC 3315) is what dhcpv6_ia_notonlink() does. That's the most common case where we'll routinely call dhcpv6_opt() to fetch data which isn't there.
It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste.
About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller.
Right, but dhcpv6_opt() is called in a loop, that only stops when it returns NULL. So, by construction the last call to dhcpv6_opt(), which terminates the loop, _will_ have a failing call to packet_get(). At this point - at least assuming a correctly constructed packet - the offset will point to just past the last option, which should be exactly at the end of the packet.
Yes, I get that, and: - I would be happy if that one were *not* reported as failure - the calls before that one should always be enough to check if we have an actual issue with the packet
In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well.
Not AFAICT, the initial packet_get just validates the basic IPv6 header. The calls to packet_get_try() in the loop validate additional IP options. I don't think it will ever fail on a well-constructed packet, but it could on a bogus (or truncated) packet, where the nexthdr field indicates an option that's actually missing.
This is kind of my point: it will only trip on a badly constructed packet, in which case I don't think we want to suppress messages.
There, I used packet_get_try() because a missing option or payload doesn't indicate a bad packet at the _data_ level. On the other hand, it's bad at the network level anyway, because option 59 *must* be there otherwise (I just realised), so while I'd still prefer another wording of the warning (not mentioning packet/buffer ranges... something more network-y), I would be fine with it.
I guess the main reason to have this patch in this series is to make a later change simpler, but I'm not sure which one, so I don't know what to suggest as an alternative.
Mostly the next one - making more distinction between the different error severities that can occur here. Having to deal with the possibility of func being NULL complicates things.
I fail to see that much of a complication as a result, and I'd still very much prefer to *not* have a warning for that case in dhcpv6_opt() (and a slight preference to have a different message for ipv6_l4hdr()), but if it really adds a ton of lines for whatever reason I'm missing, so be it...
Signed-off-by: David Gibson
--- dhcpv6.c | 2 +- ip.c | 2 +- packet.c | 18 +++++------------- packet.h | 3 --- 4 files changed, 7 insertions(+), 18 deletions(-) diff --git a/dhcpv6.c b/dhcpv6.c index 0523bba8..c0d05131 100644 --- a/dhcpv6.c +++ b/dhcpv6.c @@ -271,7 +271,7 @@ static struct opt_hdr *dhcpv6_opt(const struct pool *p, size_t *offset, if (!*offset) *offset = sizeof(struct udphdr) + sizeof(struct msg_hdr);
- while ((o = packet_get_try(p, 0, *offset, sizeof(*o), &left))) { + while ((o = packet_get(p, 0, *offset, sizeof(*o), &left))) { unsigned int opt_len = ntohs(o->l) + sizeof(*o);
if (ntohs(o->l) > left) diff --git a/ip.c b/ip.c index 2cc7f654..e391f81b 100644 --- a/ip.c +++ b/ip.c @@ -51,7 +51,7 @@ char *ipv6_l4hdr(const struct pool *p, int idx, size_t offset, uint8_t *proto, if (!IPV6_NH_OPT(nh)) goto found;
- while ((o = packet_get_try(p, idx, offset, sizeof(*o), dlen))) { + while ((o = packet_get(p, idx, offset, sizeof(*o), dlen))) { nh = o->nexthdr; hdrlen = (o->hdrlen + 1) * 8;
diff --git a/packet.c b/packet.c index bcac0375..c921aa15 100644 --- a/packet.c +++ b/packet.c @@ -112,27 +112,19 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, char *ptr;
if (idx >= p->size || idx >= p->count) { - if (func) { - trace("packet %zu from pool size: %zu, count: %zu, " - "%s:%i", idx, p->size, p->count, func, line); - } + trace("packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line);
If you change this, the documentation of the arguments for packet_get_do() should be updated as well.
Fixed. The doc for packet_add_do() also changed, where it was already wrong.
return NULL; }
if (len > PACKET_MAX_LEN) { - if (func) { - trace("packet data length %zu, %s:%i", - len, func, line); - } + trace("packet data length %zu, %s:%i", len, func, line); return NULL; }
if (len + offset > p->pkt[idx].iov_len) { - if (func) { - trace("data length %zu, offset %zu from length %zu, " - "%s:%i", len, offset, p->pkt[idx].iov_len, - func, line); - } + trace("data length %zu, offset %zu from length %zu, %s:%i", + len, offset, p->pkt[idx].iov_len, func, line); return NULL; }
diff --git a/packet.h b/packet.h index 05d37695..f95cda08 100644 --- a/packet.h +++ b/packet.h @@ -45,9 +45,6 @@ void pool_flush(struct pool *p); #define packet_get(p, idx, offset, len, left) \ packet_get_do(p, idx, offset, len, left, __func__, __LINE__)
-#define packet_get_try(p, idx, offset, len, left) \ - packet_get_do(p, idx, offset, len, left, NULL, 0) - #define PACKET_POOL_DECL(_name, _size, _buf) \ struct _name ## _t { \ char *buf; \
-- Stefano
On Thu, Jan 02, 2025 at 11:00:04PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:15:40 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:30 +1100 David Gibson
wrote: Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations.
I'm not following your argument here. It's exactly because (most of) the message indicate serious issues that I don't want to suppress them. I don't know what you mean by "validating identity associations".
But dhcpv6_opt() trying to get data that doesn't exist is *not* an issue, including not a serious one, so if I'm debugging something with --trace and I see one of these messages I'll shout at "memory" or "packet" badness and waste time thinking it's an actual issue.
Oh.. I think I see the confusion. dhcpv6_opt() trying to get data that's not in the packet is not an issue. dhcpv6_opt() trying to get data that is (theoretically) within the packet, but *not* in the buffer indicates something very bad has happened. The former is exactly one check, every other one is the second class - trying to separate those cases is the purpose of the later "different severities" patch. The difficulty is that passing func==NULL to indicate the "try" case doesn't work if we want to still give useful errors for the serious cases: we need the function name for those too. I had been considering printing occasional trace level messages for the ok case an acceptable tradeoff for not suppressing the messages which are serious. But I see your case or that being too confusing when debugging. I did have a draft where I used an explicit boolean flag to enable/disable the non-serious errors, but gave up on it for simplicity. I'll look into a way to continue suppressing the non-serious error here. Maybe moving the (single) non-serious error case message into the caller with a wrapper.
Validating identity associations (IA_NA, IA_TA, RFC 3315) is what dhcpv6_ia_notonlink() does. That's the most common case where we'll routinely call dhcpv6_opt() to fetch data which isn't there.
Ok.
It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste.
About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller.
Right, but dhcpv6_opt() is called in a loop, that only stops when it returns NULL. So, by construction the last call to dhcpv6_opt(), which terminates the loop, _will_ have a failing call to packet_get(). At this point - at least assuming a correctly constructed packet - the offset will point to just past the last option, which should be exactly at the end of the packet.
Yes, I get that, and:
- I would be happy if that one were *not* reported as failure
Right, that's also my preference, but as above I compromised on this to simplify preserving the error cases that do matter.
- the calls before that one should always be enough to check if we have an actual issue with the packet
Yes, in this case I think that's correct.
In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well.
Not AFAICT, the initial packet_get just validates the basic IPv6 header. The calls to packet_get_try() in the loop validate additional IP options. I don't think it will ever fail on a well-constructed packet, but it could on a bogus (or truncated) packet, where the nexthdr field indicates an option that's actually missing.
This is kind of my point: it will only trip on a badly constructed packet, in which case I don't think we want to suppress messages.
There, I used packet_get_try() because a missing option or payload doesn't indicate a bad packet at the _data_ level.
Not really sure what you mean by the data level, here.
On the other hand, it's bad at the network level anyway, because option 59 *must* be there otherwise (I just realised), so while I'd still prefer another wording of the warning (not mentioning packet/buffer ranges... something more network-y), I would be fine with it.
That sounds like another argument for moving the message for the "requested range is outside packet" case into the caller. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 3 Jan 2025 15:48:47 +1100
David Gibson
On Thu, Jan 02, 2025 at 11:00:04PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:15:40 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:30 +1100 David Gibson
wrote: Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations.
I'm not following your argument here. It's exactly because (most of) the message indicate serious issues that I don't want to suppress them. I don't know what you mean by "validating identity associations".
But dhcpv6_opt() trying to get data that doesn't exist is *not* an issue, including not a serious one, so if I'm debugging something with --trace and I see one of these messages I'll shout at "memory" or "packet" badness and waste time thinking it's an actual issue.
Oh.. I think I see the confusion. dhcpv6_opt() trying to get data that's not in the packet is not an issue. dhcpv6_opt() trying to get data that is (theoretically) within the packet, but *not* in the buffer indicates something very bad has happened. The former is exactly one check, every other one is the second class - trying to separate those cases is the purpose of the later "different severities" patch.
The difficulty is that passing func==NULL to indicate the "try" case doesn't work if we want to still give useful errors for the serious cases: we need the function name for those too.
I had been considering printing occasional trace level messages for the ok case an acceptable tradeoff for not suppressing the messages which are serious. But I see your case or that being too confusing when debugging. I did have a draft where I used an explicit boolean flag to enable/disable the non-serious errors, but gave up on it for simplicity.
I'll look into a way to continue suppressing the non-serious error here. Maybe moving the (single) non-serious error case message into the caller with a wrapper.
Okay, yes, thanks, that would be helpful.
Validating identity associations (IA_NA, IA_TA, RFC 3315) is what dhcpv6_ia_notonlink() does. That's the most common case where we'll routinely call dhcpv6_opt() to fetch data which isn't there.
Ok.
It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste.
About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller.
Right, but dhcpv6_opt() is called in a loop, that only stops when it returns NULL. So, by construction the last call to dhcpv6_opt(), which terminates the loop, _will_ have a failing call to packet_get(). At this point - at least assuming a correctly constructed packet - the offset will point to just past the last option, which should be exactly at the end of the packet.
Yes, I get that, and:
- I would be happy if that one were *not* reported as failure
Right, that's also my preference, but as above I compromised on this to simplify preserving the error cases that do matter.
- the calls before that one should always be enough to check if we have an actual issue with the packet
Yes, in this case I think that's correct.
In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well.
Not AFAICT, the initial packet_get just validates the basic IPv6 header. The calls to packet_get_try() in the loop validate additional IP options. I don't think it will ever fail on a well-constructed packet, but it could on a bogus (or truncated) packet, where the nexthdr field indicates an option that's actually missing.
This is kind of my point: it will only trip on a badly constructed packet, in which case I don't think we want to suppress messages.
There, I used packet_get_try() because a missing option or payload doesn't indicate a bad packet at the _data_ level.
Not really sure what you mean by the data level, here.
I meant that in the sense of Layer-2: the packet is fine at Layer 2 in the sense that there's no missing data, but it's bad at Layer-3 level because IPv6 doesn't allow a missing next-option.
On the other hand, it's bad at the network level anyway, because option 59 *must* be there otherwise (I just realised), so while I'd still prefer another wording of the warning (not mentioning packet/buffer ranges... something more network-y), I would be fine with it.
That sounds like another argument for moving the message for the "requested range is outside packet" case into the caller.
-- Stefano
On Mon, Jan 06, 2025 at 11:55:22AM +0100, Stefano Brivio wrote:
On Fri, 3 Jan 2025 15:48:47 +1100 David Gibson
wrote: On Thu, Jan 02, 2025 at 11:00:04PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:15:40 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:37PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:30 +1100 David Gibson
wrote: Two places in the code use the packet_get_try() variant on packet_get(). The difference is that packet_get_try() passes a NULL 'func' to packet_get_do(), which suppresses log messages. The places we use this are ones where we expect to sometimes have a failure retreiving the packet range, even in normal cases. So, suppressing the log messages seems like it makes sense, except:
1) It suppresses log messages on all errors. We (somewhat) expect to hit the case where the requested range is not within the received packet. However, it also suppresses message in cases where the requested packet index doesn't exist, the requested range has a nonsensical length or doesn't lie in even the right vague part of memory. None of those latter cases are expected, and the message would be helpful if we ever actually hit them.
2) The suppressed messages aren't actually that disruptive. For the case in ip.c, we'd log only if we run out of IPv6 packet before reaching a (non-option) L4 header. That shouldn't be the case in normal operation so getting a message (at trave level) is not unreasonable. For the case in dhcpv6.c we do suppress a message every time we look for but don't find a specific DHCPv6 option. That can happen in perfectly ok cases, but, again these are trace() level and DHCPv6 transactions aren't that common. Suppressing the message for this case doesn't seem worth the drawbacks of (1).
The reason why I implemented packet_get_try() is that the messages from packet_get_do() indicate serious issues, and if I'm debugging something by looking at traces it's not great to have messages indicating that we hit a serious issue while we're simply validating identity associations.
I'm not following your argument here. It's exactly because (most of) the message indicate serious issues that I don't want to suppress them. I don't know what you mean by "validating identity associations".
But dhcpv6_opt() trying to get data that doesn't exist is *not* an issue, including not a serious one, so if I'm debugging something with --trace and I see one of these messages I'll shout at "memory" or "packet" badness and waste time thinking it's an actual issue.
Oh.. I think I see the confusion. dhcpv6_opt() trying to get data that's not in the packet is not an issue. dhcpv6_opt() trying to get data that is (theoretically) within the packet, but *not* in the buffer indicates something very bad has happened. The former is exactly one check, every other one is the second class - trying to separate those cases is the purpose of the later "different severities" patch.
The difficulty is that passing func==NULL to indicate the "try" case doesn't work if we want to still give useful errors for the serious cases: we need the function name for those too.
I had been considering printing occasional trace level messages for the ok case an acceptable tradeoff for not suppressing the messages which are serious. But I see your case or that being too confusing when debugging. I did have a draft where I used an explicit boolean flag to enable/disable the non-serious errors, but gave up on it for simplicity.
I'll look into a way to continue suppressing the non-serious error here. Maybe moving the (single) non-serious error case message into the caller with a wrapper.
Okay, yes, thanks, that would be helpful.
Validating identity associations (IA_NA, IA_TA, RFC 3315) is what dhcpv6_ia_notonlink() does. That's the most common case where we'll routinely call dhcpv6_opt() to fetch data which isn't there.
Ok.
It's not about the amount of logged messages, it's about the type of message being logged and the distracting noise possibly resulting in a substantial time waste.
About 1): dhcpv6_opt() always picks pool index 0, and the base offset was already checked by the caller.
Right, but dhcpv6_opt() is called in a loop, that only stops when it returns NULL. So, by construction the last call to dhcpv6_opt(), which terminates the loop, _will_ have a failing call to packet_get(). At this point - at least assuming a correctly constructed packet - the offset will point to just past the last option, which should be exactly at the end of the packet.
Yes, I get that, and:
- I would be happy if that one were *not* reported as failure
Right, that's also my preference, but as above I compromised on this to simplify preserving the error cases that do matter.
- the calls before that one should always be enough to check if we have an actual issue with the packet
Yes, in this case I think that's correct.
In ipv6_l4hdr(), the index was already validated by a previous call to packet_get(), and the starting offset as well.
Not AFAICT, the initial packet_get just validates the basic IPv6 header. The calls to packet_get_try() in the loop validate additional IP options. I don't think it will ever fail on a well-constructed packet, but it could on a bogus (or truncated) packet, where the nexthdr field indicates an option that's actually missing.
This is kind of my point: it will only trip on a badly constructed packet, in which case I don't think we want to suppress messages.
There, I used packet_get_try() because a missing option or payload doesn't indicate a bad packet at the _data_ level.
Not really sure what you mean by the data level, here.
I meant that in the sense of Layer-2: the packet is fine at Layer 2 in the sense that there's no missing data, but it's bad at Layer-3 level because IPv6 doesn't allow a missing next-option.
Oh, right. That's true, but in this case the caller is code dealing with L3 structure, so it seems reasonable a failure should give an error.
On the other hand, it's bad at the network level anyway, because option 59 *must* be there otherwise (I just realised), so while I'd still prefer another wording of the warning (not mentioning packet/buffer ranges... something more network-y), I would be fine with it.
That sounds like another argument for moving the message for the "requested range is outside packet" case into the caller.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
We already have the ASSERT() macro which will abort() passt based on a
condition. It always has a fixed error message based on its location and
the asserted expression. We have some upcoming cases where we want to
customise the message when hitting an assert.
Add abort_with_msg() and ASSERT_WITH_MSG() helpers to allow this.
Signed-off-by: David Gibson
packet_add() and packet_get() can fail for a number of reasons, and we
currently treat them all basically the same: we log a trace() level message
and for packet_get() we return NULL. However the different causes of
error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal,
but is an unusual situation which we might want to know about. Promote
it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate
a serious problem. Due to either a bug in calling code, or some sort
of memory-clobbering, we've been given addresses or sizes of packets
that are nonsensical. If things are this bad, we shouldn't try to carry
on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get()
indicates a bug in the caller - it should always check the index first.
Replace this with an assert.
4) On packet_get() requesting a section of the packet beyond its bounds is
correct already. This happens routinely from some callers when they
probe to see if certain parts of the packet are present. At worst it
indicates that the guest has generate a malformed packet, which we
should quietly drop as we already do.
Signed-off-by: David Gibson
On Fri, 20 Dec 2024 19:35:32 +1100
David Gibson
packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability). If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them. I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything. Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation. I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course, but it relies on two assumptions that might not hold: 1. the crash will be reported, while a debug() message will go unnoticed. Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws. On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages. 2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions. Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation. In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- Stefano
On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:32 +1100 David Gibson
wrote: packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability).
Hmm..
If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them.
I see the case for packet_add_do(). In that case it could be that this is just a bad packet triggering bugs in the initial processing code, and just dropping it will be sufficient to save us from damaging state. For packet_get_do(), however, this indicates that we have a packet that's out of bounds, despite the fact that we checked its bounds when it was added to the pool. That means we *already* have corrupted memory, and we have no way to assess the extent of the damage. It's entirely possible we're already badly compromised. Under those circumstances I think it is much better to terminate immediately, rather than risk doing any more hard to trace damage.
I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything.
Ok, that argument makes sense to me for packet_add_do(), but again, not for packet_get_do().
Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation.
Right, I see the argument, but it also worries me that the message - which even if we are successfully containing the damage here definitely indicates a bug in the caller - could easily never be seen at trace or even debug level.
I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course,
That's the main reason. The secondary reason is that it clarifies the meaning of a NULL return from packet_get_do(). With this change it means: you requested a range that's not in the packet, whereas previously it could mean either that or "something is horribly wrong in the pool state, and we don't know what".
but it relies on two assumptions that might not hold:
1. the crash will be reported, while a debug() message will go unnoticed.
Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws.
I suppose so.
On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages.
That only applies if this causes malfunctions in intended behaviour. If it's triggered due to an attack, rather than a flaw handling expected behaviour then there's no reason it would be reported.
2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions.
That's fair, I can see why you'd be gun shy about asserts given that.
Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation.
Eh... I can't really see a way these assertions are different other than that one turned out to be wrong and the other didn't.
In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
Again, true (plausibly) for packet_add_do(), but not for packet_get_do().
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
This patch doesn't change that. We check against the stated packet bounds first and return NULL *before* we check against the buffer bounds and possibly trigger an assert. Ok... here's a tentative proposal, let me know what you think: * I promote the packet_check_range() messages to warn(), or even err(). Yes, this might allow for log flooding, but I would argue that if we have a bug allowing this path to be triggered and something is frequently triggering it, we *want* that to be noisy. We *think* we're containing the damage here, but we don't really know. * I make packet_check_range() failures in packet_add_do() drop the packet and continue, as they used to * packet_check_range() failures in packet_get_do() will still assert() * I leave the assert()s for packet _index_ out of range
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 2 Jan 2025 13:58:19 +1100
David Gibson
On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:32 +1100 David Gibson
wrote: packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability).
Hmm..
If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them.
I see the case for packet_add_do(). In that case it could be that this is just a bad packet triggering bugs in the initial processing code, and just dropping it will be sufficient to save us from damaging state.
For packet_get_do(), however, this indicates that we have a packet that's out of bounds, despite the fact that we checked its bounds when it was added to the pool. That means we *already* have corrupted memory, and we have no way to assess the extent of the damage. It's entirely possible we're already badly compromised. Under those circumstances I think it is much better to terminate immediately, rather than risk doing any more hard to trace damage.
It depends on which part of packet_get_do(). If we're exceeding the packet count or the size of a pool, then I guess yes, otherwise not so much.
I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything.
Ok, that argument makes sense to me for packet_add_do(), but again, not for packet_get_do().
Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation.
Right, I see the argument, but it also worries me that the message - which even if we are successfully containing the damage here definitely indicates a bug in the caller - could easily never be seen at trace or even debug level.
I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course,
That's the main reason. The secondary reason is that it clarifies the meaning of a NULL return from packet_get_do(). With this change it means: you requested a range that's not in the packet, whereas previously it could mean either that or "something is horribly wrong in the pool state, and we don't know what".
but it relies on two assumptions that might not hold:
1. the crash will be reported, while a debug() message will go unnoticed.
Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws.
I suppose so.
On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages.
That only applies if this causes malfunctions in intended behaviour. If it's triggered due to an attack, rather than a flaw handling expected behaviour then there's no reason it would be reported.
2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions.
That's fair, I can see why you'd be gun shy about asserts given that.
Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation.
Eh... I can't really see a way these assertions are different other than that one turned out to be wrong and the other didn't.
Ah, sorry, I meant that there is a substantial difference between the ones related to Podman issue 22925 *and what might happen in packet_get_range()*. Not that there's a difference between those related to Podman issue 22925 and those from commit 61c0b0d0f199 and bug #105.
In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
Again, true (plausibly) for packet_add_do(), but not for packet_get_do().
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
This patch doesn't change that. We check against the stated packet bounds first and return NULL *before* we check against the buffer bounds and possibly trigger an assert.
Yes... I wasn't actually disputing that (quote is from yourself).
Ok... here's a tentative proposal, let me know what you think:
So, first off, I have to say that it's a bit hard to reason about something that is not giving users any problem at the moment (well, at least not without a part of this series), and which didn't really give us problems in the past. It's difficult to quantify with experience and examples (even letting alone the priority of the whole matter for a moment...). Anyway, let me try:
* I promote the packet_check_range() messages to warn(), or even err(). Yes, this might allow for log flooding, but I would argue that if we have a bug allowing this path to be triggered and something is frequently triggering it, we *want* that to be noisy. We *think* we're containing the damage here, but we don't really know.
I still think it's our responsibility to not flood the system log in *any* circumstance. Remember that this is not necessarily *our* log where the worst that can happen is that we hide information. Even with sane system log configurations, one message per packet can get close to a substantial denial of service (especially with systemd-syslogd, see also https://github.com/systemd/systemd/issues/13425). I guess you don't realise until you run passt without -f, for example with libvirt, for a large number of virtual machines. Or in KubeVirt. Or for a bunch of containers. It's a common and very much intended usage. It's not true in general that we don't know. We can quantify the difference in damage very clearly if the attacker manages to make packet_check_range() print one line for each packet (and nothing else).
* I make packet_check_range() failures in packet_add_do() drop the packet and continue, as they used to * packet_check_range() failures in packet_get_do() will still assert()
I guess the (start < p->buf) one makes sense, but not so much the one where we check that offset plus length is within the pool, because that one could be very well triggered by a specific packet itself.
* I leave the assert()s for packet _index_ out of range
Okay, yes, those make sense to me as well.
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- Stefano
On Thu, Jan 02, 2025 at 11:00:08PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:58:19 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:32 +1100 David Gibson
wrote: packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability).
Hmm..
If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them.
I see the case for packet_add_do(). In that case it could be that this is just a bad packet triggering bugs in the initial processing code, and just dropping it will be sufficient to save us from damaging state.
For packet_get_do(), however, this indicates that we have a packet that's out of bounds, despite the fact that we checked its bounds when it was added to the pool. That means we *already* have corrupted memory, and we have no way to assess the extent of the damage. It's entirely possible we're already badly compromised. Under those circumstances I think it is much better to terminate immediately, rather than risk doing any more hard to trace damage.
It depends on which part of packet_get_do(). If we're exceeding the packet count or the size of a pool, then I guess yes, otherwise not so much.
There's one, maybe two of the tests in packet_get_do() that don't indicate a serious error: if (len + offset > p->pkt[idx].iov_len) { is definitely ok, and you've convinced me I should try harder to make sure this doesn't print any error (even trace()) in at least the "try" cases. if (len > UINT16_MAX) { is questionable. Technically it could be ok, but since that parameter is generally a constant, chances are it does indicate a bad error in the caller. All the other checks, from packet_check_range() must indicate something very bad.
I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything.
Ok, that argument makes sense to me for packet_add_do(), but again, not for packet_get_do().
Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation.
Right, I see the argument, but it also worries me that the message - which even if we are successfully containing the damage here definitely indicates a bug in the caller - could easily never be seen at trace or even debug level.
I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course,
That's the main reason. The secondary reason is that it clarifies the meaning of a NULL return from packet_get_do(). With this change it means: you requested a range that's not in the packet, whereas previously it could mean either that or "something is horribly wrong in the pool state, and we don't know what".
but it relies on two assumptions that might not hold:
1. the crash will be reported, while a debug() message will go unnoticed.
Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws.
I suppose so.
On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages.
That only applies if this causes malfunctions in intended behaviour. If it's triggered due to an attack, rather than a flaw handling expected behaviour then there's no reason it would be reported.
2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions.
That's fair, I can see why you'd be gun shy about asserts given that.
Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation.
Eh... I can't really see a way these assertions are different other than that one turned out to be wrong and the other didn't.
Ah, sorry, I meant that there is a substantial difference between the ones related to Podman issue 22925 *and what might happen in packet_get_range()*.
Not that there's a difference between those related to Podman issue 22925 and those from commit 61c0b0d0f199 and bug #105.
Ok, but I'm still not really seeing how to distinguish the cases without the benefit of hindsight.
In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
Again, true (plausibly) for packet_add_do(), but not for packet_get_do().
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
This patch doesn't change that. We check against the stated packet bounds first and return NULL *before* we check against the buffer bounds and possibly trigger an assert.
Yes... I wasn't actually disputing that (quote is from yourself).
Ok... here's a tentative proposal, let me know what you think:
So, first off, I have to say that it's a bit hard to reason about something that is not giving users any problem at the moment (well, at least not without a part of this series), and which didn't really give us problems in the past.
It's difficult to quantify with experience and examples (even letting alone the priority of the whole matter for a moment...). Anyway, let me try:
I mean, I see your point. This is basically following on from the MTU stuff - I was tracing what enforces / relies on packet/datagram/frame size limits and making sure they're consistent through the entire chain. At the moment we allow the user to set --mtu 65535 (i.e. 65535 byte IP datagrams, not including L2), but the pool layer (amongst other things) doesn't allow us to actually deliver it. This does cause user visible (though minor) problems, such as bug 65 & bug 66. I hope merging the two series will make what's going on clearer.
* I promote the packet_check_range() messages to warn(), or even err(). Yes, this might allow for log flooding, but I would argue that if we have a bug allowing this path to be triggered and something is frequently triggering it, we *want* that to be noisy. We *think* we're containing the damage here, but we don't really know.
I still think it's our responsibility to not flood the system log in *any* circumstance.
Remember that this is not necessarily *our* log where the worst that can happen is that we hide information. Even with sane system log configurations, one message per packet can get close to a substantial denial of service (especially with systemd-syslogd, see also https://github.com/systemd/systemd/issues/13425).
I guess you don't realise until you run passt without -f, for example with libvirt, for a large number of virtual machines. Or in KubeVirt. Or for a bunch of containers. It's a common and very much intended usage.
Hm, fair point. I feel like at some point we're probably going to have to bite the bullet and implement rate-limited messages and/or some sort of "warn once" thing at some point.
It's not true in general that we don't know. We can quantify the difference in damage very clearly if the attacker manages to make packet_check_range() print one line for each packet (and nothing else).
* I make packet_check_range() failures in packet_add_do() drop the packet and continue, as they used to * packet_check_range() failures in packet_get_do() will still assert()
I guess the (start < p->buf) one makes sense, but not so much the one where we check that offset plus length is within the pool, because that one could be very well triggered by a specific packet itself.
I really don't see the difference. Since we're talking packet_get_do() here, this means that we've *already* inserted a packet in the pool which extends outside the buffer(s) it's supposed to be in. But we can't have inserted this packet with packet_add_do(), because it checked the same things non-fatally. Which means either a packet has been inserted bypassing packet_add_do() or the pool metadata has been corrupted since then. Either of those situations indicates something very, very bad.
* I leave the assert()s for packet _index_ out of range
Okay, yes, those make sense to me as well.
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 3 Jan 2025 16:06:57 +1100
David Gibson
On Thu, Jan 02, 2025 at 11:00:08PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:58:19 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:32 +1100 David Gibson
wrote: packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability).
Hmm..
If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them.
I see the case for packet_add_do(). In that case it could be that this is just a bad packet triggering bugs in the initial processing code, and just dropping it will be sufficient to save us from damaging state.
For packet_get_do(), however, this indicates that we have a packet that's out of bounds, despite the fact that we checked its bounds when it was added to the pool. That means we *already* have corrupted memory, and we have no way to assess the extent of the damage. It's entirely possible we're already badly compromised. Under those circumstances I think it is much better to terminate immediately, rather than risk doing any more hard to trace damage.
It depends on which part of packet_get_do(). If we're exceeding the packet count or the size of a pool, then I guess yes, otherwise not so much.
There's one, maybe two of the tests in packet_get_do() that don't indicate a serious error: if (len + offset > p->pkt[idx].iov_len) { is definitely ok, and you've convinced me I should try harder to make sure this doesn't print any error (even trace()) in at least the "try" cases. if (len > UINT16_MAX) { is questionable. Technically it could be ok, but since that parameter is generally a constant, chances are it does indicate a bad error in the caller.
...generally, yes, but I'm sure we don't want to crash on: - a malformed DHCP request plus subtle bug (not saying that there's one, just that I could see myself adding one), dhcp(): packet_get(p, 0, offset + opt_off + 2, *olen, NULL); - a malformed TCP header plus some kind of bug, tcp_data_from_tap(): data = packet_get(p, i, off, len, NULL); Yes, those would need "bad errors". But the bad error here would crash us with an assert, and cause us to throw away one packet without it.
All the other checks, from packet_check_range() must indicate something very bad.
I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything.
Ok, that argument makes sense to me for packet_add_do(), but again, not for packet_get_do().
Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation.
Right, I see the argument, but it also worries me that the message - which even if we are successfully containing the damage here definitely indicates a bug in the caller - could easily never be seen at trace or even debug level.
I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course,
That's the main reason. The secondary reason is that it clarifies the meaning of a NULL return from packet_get_do(). With this change it means: you requested a range that's not in the packet, whereas previously it could mean either that or "something is horribly wrong in the pool state, and we don't know what".
but it relies on two assumptions that might not hold:
1. the crash will be reported, while a debug() message will go unnoticed.
Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws.
I suppose so.
On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages.
That only applies if this causes malfunctions in intended behaviour. If it's triggered due to an attack, rather than a flaw handling expected behaviour then there's no reason it would be reported.
2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions.
That's fair, I can see why you'd be gun shy about asserts given that.
Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation.
Eh... I can't really see a way these assertions are different other than that one turned out to be wrong and the other didn't.
Ah, sorry, I meant that there is a substantial difference between the ones related to Podman issue 22925 *and what might happen in packet_get_range()*.
Not that there's a difference between those related to Podman issue 22925 and those from commit 61c0b0d0f199 and bug #105.
Ok, but I'm still not really seeing how to distinguish the cases without the benefit of hindsight.
In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
Again, true (plausibly) for packet_add_do(), but not for packet_get_do().
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
This patch doesn't change that. We check against the stated packet bounds first and return NULL *before* we check against the buffer bounds and possibly trigger an assert.
Yes... I wasn't actually disputing that (quote is from yourself).
Ok... here's a tentative proposal, let me know what you think:
So, first off, I have to say that it's a bit hard to reason about something that is not giving users any problem at the moment (well, at least not without a part of this series), and which didn't really give us problems in the past.
It's difficult to quantify with experience and examples (even letting alone the priority of the whole matter for a moment...). Anyway, let me try:
I mean, I see your point. This is basically following on from the MTU stuff - I was tracing what enforces / relies on packet/datagram/frame size limits and making sure they're consistent through the entire chain. At the moment we allow the user to set --mtu 65535 (i.e. 65535 byte IP datagrams, not including L2), but the pool layer (amongst other things) doesn't allow us to actually deliver it. This does cause user visible (though minor) problems, such as bug 65 & bug 66.
I see the relationship to that up to patch 6/12. After that, not so much. I understand that there might be inconsistencies (at least in your interpretation) but you wouldn't be making them any worse.
I hope merging the two series will make what's going on clearer.
* I promote the packet_check_range() messages to warn(), or even err(). Yes, this might allow for log flooding, but I would argue that if we have a bug allowing this path to be triggered and something is frequently triggering it, we *want* that to be noisy. We *think* we're containing the damage here, but we don't really know.
I still think it's our responsibility to not flood the system log in *any* circumstance.
Remember that this is not necessarily *our* log where the worst that can happen is that we hide information. Even with sane system log configurations, one message per packet can get close to a substantial denial of service (especially with systemd-syslogd, see also https://github.com/systemd/systemd/issues/13425).
I guess you don't realise until you run passt without -f, for example with libvirt, for a large number of virtual machines. Or in KubeVirt. Or for a bunch of containers. It's a common and very much intended usage.
Hm, fair point. I feel like at some point we're probably going to have to bite the bullet and implement rate-limited messages and/or some sort of "warn once" thing at some point.
I thought you implemented something like that, by the way, but I can't find it right now. Yes, a "warn once" thing would fix all this. I think we could happily warn() or err() on anything like that if we had it. By the way, same here, I guess that wouldn't really need to be in the same series as patches up to 6/12 here.
It's not true in general that we don't know. We can quantify the difference in damage very clearly if the attacker manages to make packet_check_range() print one line for each packet (and nothing else).
* I make packet_check_range() failures in packet_add_do() drop the packet and continue, as they used to * packet_check_range() failures in packet_get_do() will still assert()
I guess the (start < p->buf) one makes sense, but not so much the one where we check that offset plus length is within the pool, because that one could be very well triggered by a specific packet itself.
I really don't see the difference. Since we're talking packet_get_do() here, this means that we've *already* inserted a packet in the pool which extends outside the buffer(s) it's supposed to be in.
That's not necessarily the case: it could be that we just use an offset that extends beyond the end of the pool, due to bad/missing handling of some malformed packet/request. Maybe the packet is fine, but we're asking for something that's not. I'm talking about this one: if (start + len + offset > p->buf + p->buf_size) {
But we can't have inserted this packet with packet_add_do(), because it checked the same things non-fatally. Which means either a packet has been inserted bypassing packet_add_do() or the pool metadata has been corrupted since then. Either of those situations indicates something very, very bad.
* I leave the assert()s for packet _index_ out of range
Okay, yes, those make sense to me as well.
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- Stefano
On Mon, Jan 06, 2025 at 11:55:25AM +0100, Stefano Brivio wrote:
On Fri, 3 Jan 2025 16:06:57 +1100 David Gibson
wrote: On Thu, Jan 02, 2025 at 11:00:08PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 13:58:19 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:41PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:32 +1100 David Gibson
wrote: packet_add() and packet_get() can fail for a number of reasons, and we currently treat them all basically the same: we log a trace() level message and for packet_get() we return NULL. However the different causes of error are quite different and suggest different handling:
1) If we run out of space in the pool on add, that's (rightly) non-fatal, but is an unusual situation which we might want to know about. Promote it's message to debug() level.
2) All packet_check_range() errors and the checks on packet length indicate a serious problem. Due to either a bug in calling code, or some sort of memory-clobbering, we've been given addresses or sizes of packets that are nonsensical. If things are this bad, we shouldn't try to carry on replace these with asserts.
3) Requesting a packet index that doesn't exist in the pool in packet_get() indicates a bug in the caller - it should always check the index first. Replace this with an assert.
The reasons for 2) and 3) being trace() messages (they should probably be debug() following the same reasoning as 1)) are graceful degradation and security (availability).
Hmm..
If we have a packet triggering some sort of "memory clobbering", or an issue in the caller, that might very well be just one packet or a few of them.
I see the case for packet_add_do(). In that case it could be that this is just a bad packet triggering bugs in the initial processing code, and just dropping it will be sufficient to save us from damaging state.
For packet_get_do(), however, this indicates that we have a packet that's out of bounds, despite the fact that we checked its bounds when it was added to the pool. That means we *already* have corrupted memory, and we have no way to assess the extent of the damage. It's entirely possible we're already badly compromised. Under those circumstances I think it is much better to terminate immediately, rather than risk doing any more hard to trace damage.
It depends on which part of packet_get_do(). If we're exceeding the packet count or the size of a pool, then I guess yes, otherwise not so much.
There's one, maybe two of the tests in packet_get_do() that don't indicate a serious error: if (len + offset > p->pkt[idx].iov_len) { is definitely ok, and you've convinced me I should try harder to make sure this doesn't print any error (even trace()) in at least the "try" cases. if (len > UINT16_MAX) { is questionable. Technically it could be ok, but since that parameter is generally a constant, chances are it does indicate a bad error in the caller.
...generally, yes, but I'm sure we don't want to crash on:
- a malformed DHCP request plus subtle bug (not saying that there's one, just that I could see myself adding one), dhcp():
packet_get(p, 0, offset + opt_off + 2, *olen, NULL);
- a malformed TCP header plus some kind of bug, tcp_data_from_tap():
data = packet_get(p, i, off, len, NULL);
Yes, those would need "bad errors". But the bad error here would crash us with an assert, and cause us to throw away one packet without it.
Actually, I realised this borderline case is addressed by the end of the series. The "hard" length check is moved into packet_check_range() and is always fatal. However, *before* we do that we indirectly check the range length against the packet length with if (len + offset > p->pkt[idx].iov_len) { If that fails, we hit a "soft" error (returning NULL), before we call packet_check_range() and possibly trigger a "hard" error (assert). I'll attempt to get the order of operations right when I rework this, so we don't have interim patches where this breaks.
All the other checks, from packet_check_range() must indicate something very bad.
I think it's a disservice to users to crash in that case, and it has the potential to worsen a security flaw if this packet can be built on purpose. It's also a disservice to package maintainers because it has the potential to turn a harmless issue into an annoying situation with associated time pressure and everything.
Ok, that argument makes sense to me for packet_add_do(), but again, not for packet_get_do().
Those are not warn() calls, by the way, because we don't want to make it too easy, in that (perhaps unlikely) case, to flood logs, and possibly hide information due to rotation.
Right, I see the argument, but it also worries me that the message - which even if we are successfully containing the damage here definitely indicates a bug in the caller - could easily never be seen at trace or even debug level.
I suppose that the argument for asserts here is so that we discover functional issues as soon as possible, which I understand of course,
That's the main reason. The secondary reason is that it clarifies the meaning of a NULL return from packet_get_do(). With this change it means: you requested a range that's not in the packet, whereas previously it could mean either that or "something is horribly wrong in the pool state, and we don't know what".
but it relies on two assumptions that might not hold:
1. the crash will be reported, while a debug() message will go unnoticed.
Well, the crash is more noticeable, but that doesn't mean that it will be reported. Users might/will try to hack something around it or try to survive with slirp4netns and all its flaws.
I suppose so.
On the other hand, a malfunction (say, failed connections) might be equally likely to be reported, along with debug() messages.
That only applies if this causes malfunctions in intended behaviour. If it's triggered due to an attack, rather than a flaw handling expected behaviour then there's no reason it would be reported.
2. the check and the assertion themselves are correct. This was not the case for the two most recent severe issues (commit 61c0b0d0f199, bug #105) we discovered through assertions.
That's fair, I can see why you'd be gun shy about asserts given that.
Issues such as https://github.com/containers/podman/issues/22925, on the other hand, are good examples of how asserts saved us a lot of time and effort later on (compared to, say, "connections stop working after three days"). But I would argue that those are substantially different cases where assertions are checking something that's much more inherent to the implementation.
Eh... I can't really see a way these assertions are different other than that one turned out to be wrong and the other didn't.
Ah, sorry, I meant that there is a substantial difference between the ones related to Podman issue 22925 *and what might happen in packet_get_range()*.
Not that there's a difference between those related to Podman issue 22925 and those from commit 61c0b0d0f199 and bug #105.
Ok, but I'm still not really seeing how to distinguish the cases without the benefit of hindsight.
In this case, I would argue that there's no need to crash, so we shouldn't. We can presumably carry on after that, and we're not leaking any resource or sneakily leave behind some condition that will cause apparently unrelated issues at a later time.
Again, true (plausibly) for packet_add_do(), but not for packet_get_do().
4) On packet_get() requesting a section of the packet beyond its bounds is correct already. This happens routinely from some callers when they probe to see if certain parts of the packet are present. At worst it indicates that the guest has generate a malformed packet, which we should quietly drop as we already do.
This patch doesn't change that. We check against the stated packet bounds first and return NULL *before* we check against the buffer bounds and possibly trigger an assert.
Yes... I wasn't actually disputing that (quote is from yourself).
Ok... here's a tentative proposal, let me know what you think:
So, first off, I have to say that it's a bit hard to reason about something that is not giving users any problem at the moment (well, at least not without a part of this series), and which didn't really give us problems in the past.
It's difficult to quantify with experience and examples (even letting alone the priority of the whole matter for a moment...). Anyway, let me try:
I mean, I see your point. This is basically following on from the MTU stuff - I was tracing what enforces / relies on packet/datagram/frame size limits and making sure they're consistent through the entire chain. At the moment we allow the user to set --mtu 65535 (i.e. 65535 byte IP datagrams, not including L2), but the pool layer (amongst other things) doesn't allow us to actually deliver it. This does cause user visible (though minor) problems, such as bug 65 & bug 66.
I see the relationship to that up to patch 6/12. After that, not so much.
True, the later patches are for different reasons. I'll reexamine what lives in the series.
I understand that there might be inconsistencies (at least in your interpretation) but you wouldn't be making them any worse.
I hope merging the two series will make what's going on clearer.
* I promote the packet_check_range() messages to warn(), or even err(). Yes, this might allow for log flooding, but I would argue that if we have a bug allowing this path to be triggered and something is frequently triggering it, we *want* that to be noisy. We *think* we're containing the damage here, but we don't really know.
I still think it's our responsibility to not flood the system log in *any* circumstance.
Remember that this is not necessarily *our* log where the worst that can happen is that we hide information. Even with sane system log configurations, one message per packet can get close to a substantial denial of service (especially with systemd-syslogd, see also https://github.com/systemd/systemd/issues/13425).
I guess you don't realise until you run passt without -f, for example with libvirt, for a large number of virtual machines. Or in KubeVirt. Or for a bunch of containers. It's a common and very much intended usage.
Hm, fair point. I feel like at some point we're probably going to have to bite the bullet and implement rate-limited messages and/or some sort of "warn once" thing at some point.
I thought you implemented something like that, by the way, but I can't find it right now.
Yes, a "warn once" thing would fix all this. I think we could happily warn() or err() on anything like that if we had it.
By the way, same here, I guess that wouldn't really need to be in the same series as patches up to 6/12 here.
No.
It's not true in general that we don't know. We can quantify the difference in damage very clearly if the attacker manages to make packet_check_range() print one line for each packet (and nothing else).
* I make packet_check_range() failures in packet_add_do() drop the packet and continue, as they used to * packet_check_range() failures in packet_get_do() will still assert()
I guess the (start < p->buf) one makes sense, but not so much the one where we check that offset plus length is within the pool, because that one could be very well triggered by a specific packet itself.
I really don't see the difference. Since we're talking packet_get_do() here, this means that we've *already* inserted a packet in the pool which extends outside the buffer(s) it's supposed to be in.
That's not necessarily the case: it could be that we just use an offset that extends beyond the end of the pool, due to bad/missing handling of some malformed packet/request.
Maybe the packet is fine, but we're asking for something that's not.
No. We've already checked the requested range against the bounds of the packet (and returned a soft error if it's out of bounds). If the range is bad once we reach packet_check_range(), then the packet is bad too.
I'm talking about this one:
if (start + len + offset > p->buf + p->buf_size) {
But we can't have inserted this packet with packet_add_do(), because it checked the same things non-fatally. Which means either a packet has been inserted bypassing packet_add_do() or the pool metadata has been corrupted since then. Either of those situations indicates something very, very bad.
* I leave the assert()s for packet _index_ out of range
Okay, yes, those make sense to me as well.
Signed-off-by: David Gibson
--- packet.c | 71 ++++++++++++++++++++--------------------------------- packet.h | 3 ++- vu_common.c | 13 +++++++--- 3 files changed, 38 insertions(+), 49 deletions(-) diff --git a/packet.c b/packet.c index c921aa15..24f12448 100644 --- a/packet.c +++ b/packet.c @@ -30,37 +30,27 @@ * @func: For tracing: name of calling function * @line: For tracing: caller line of function call * - * Return: 0 if the range is valid, -1 otherwise + * ASSERT()s if the given range isn't valid (within the expected buffer(s) for + * the pool). */ -static int packet_check_range(const struct pool *p, const char *ptr, size_t len, - const char *func, int line) +static void packet_check_range(const struct pool *p, + const char *ptr, size_t len, + const char *func, int line) { if (p->buf_size == 0) { - int ret; - - ret = vu_packet_check_range((void *)p->buf, ptr, len); - - if (ret == -1) - trace("cannot find region, %s:%i", func, line); - - return ret; - } - - if (ptr < p->buf) { - trace("packet range start %p before buffer start %p, %s:%i", - (void *)ptr, (void *)p->buf, func, line); - return -1; - } - - if (ptr + len > p->buf + p->buf_size) { - trace("packet range end %p after buffer end %p, %s:%i", - (void *)(ptr + len), (void *)(p->buf + p->buf_size), - func, line); - return -1; + vu_packet_check_range((void *)p->buf, ptr, len, func, line); + return; }
- return 0; + ASSERT_WITH_MSG(ptr >= p->buf, + "packet range start %p before buffer start %p, %s:%i", + (void *)ptr, (void *)p->buf, func, line); + ASSERT_WITH_MSG(ptr + len <= p->buf + p->buf_size, + "packet range end %p after buffer end %p, %s:%i", + (void *)(ptr + len), (void *)(p->buf + p->buf_size), + func, line); } + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -75,18 +65,16 @@ void packet_add_do(struct pool *p, size_t len, const char *start, size_t idx = p->count;
if (idx >= p->size) { - trace("add packet index %zu to pool with size %zu, %s:%i", + debug("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; }
- if (packet_check_range(p, start, len, func, line)) - return; + packet_check_range(p, start, len, func, line);
- if (len > PACKET_MAX_LEN) { - trace("add packet length %zu, %s:%i", len, func, line); - return; - } + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "add packet length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
p->pkt[idx].iov_base = (void *)start; p->pkt[idx].iov_len = len; @@ -111,16 +99,12 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset, { char *ptr;
- if (idx >= p->size || idx >= p->count) { - trace("packet %zu from pool size: %zu, count: %zu, %s:%i", - idx, p->size, p->count, func, line); - return NULL; - } - - if (len > PACKET_MAX_LEN) { - trace("packet data length %zu, %s:%i", len, func, line); - return NULL; - } + ASSERT_WITH_MSG(idx < p->size && idx < p->count, + "packet %zu from pool size: %zu, count: %zu, %s:%i", + idx, p->size, p->count, func, line); + ASSERT_WITH_MSG(len <= PACKET_MAX_LEN, + "packet range length %zu (max %zu), %s:%i", + len, PACKET_MAX_LEN, func, line);
if (len + offset > p->pkt[idx].iov_len) { trace("data length %zu, offset %zu from length %zu, %s:%i", @@ -130,8 +114,7 @@ void *packet_get_do(const struct pool *p, size_t idx, size_t offset,
ptr = (char *)p->pkt[idx].iov_base + offset;
- if (packet_check_range(p, ptr, len, func, line)) - return NULL; + packet_check_range(p, ptr, len, func, line);
if (left) *left = p->pkt[idx].iov_len - offset - len; diff --git a/packet.h b/packet.h index f95cda08..b164f77e 100644 --- a/packet.h +++ b/packet.h @@ -31,7 +31,8 @@ struct pool { struct iovec pkt[]; };
-int vu_packet_check_range(void *buf, const char *ptr, size_t len); +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line); void packet_add_do(struct pool *p, size_t len, const char *start, const char *func, int line); void *packet_get_do(const struct pool *p, const size_t idx, diff --git a/vu_common.c b/vu_common.c index 531f8786..528b9b08 100644 --- a/vu_common.c +++ b/vu_common.c @@ -24,10 +24,13 @@ * @buf: Array of the available memory regions * @ptr: Start of desired data range * @size: Length of desired data range + * @func: For tracing: name of calling function + * @line: For tracing: caller line of function call
If those arguments are mandatory, I would drop the "For tracing" specification.
* - * Return: 0 if the zone is in a mapped memory region, -1 otherwise + * Aborts if the zone isn't in any of the device regions */ -int vu_packet_check_range(void *buf, const char *ptr, size_t len) +void vu_packet_check_range(void *buf, const char *ptr, size_t len, + const char *func, int line) { struct vu_dev_region *dev_region;
@@ -37,10 +40,12 @@ int vu_packet_check_range(void *buf, const char *ptr, size_t len)
if (m <= ptr && ptr + len <= m + dev_region->mmap_offset + dev_region->size) - return 0; + return; }
- return -1; + abort_with_msg( + "package range at %p, length %zd not within dev region %s:%i",
s/package/packet/
+ (void *)ptr, len, func, line); }
/**
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Both packet_add_do() and packet_get_do() have a check on the given length,
essentially sanity checking it before validating that it's in an expected
memory region. This can be folded into packet_check_range() which performs
similar checks for both functions.
Signed-off-by: David Gibson
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf, TAP_MSGS. However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size. But, we don't enforce that L2 frames
are at least ETH_ZLEN when we receive them from the tap backend, and since
we're dealing with virtual interfaces we don't have the physical Ethernet
limitations requiring that length. Indeed it is possible to generate a
legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't
limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with
even larger arrays, which in practice rarely accumulate more than a handful
of packets. So, instead, put an arbitrary cap on the number of packets we
can put in a batch, and if we run out of space, process and flush the
batch.
Signed-off-by: David Gibson
On Fri, 20 Dec 2024 19:35:34 +1100
David Gibson
Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
I ran a few more tests with this, keeping TAP_MSGS at 256, and in general I couldn't really see a difference in latency (especially for UDP streams with small packets) or throughput. Figures from short throughput tests (such as the ones from the test suite) look a bit more variable, but I don't have any statistically meaningful data. Then I looked into how many messages we might have in the array without this change, and I realised that, with the throughput tests from the suite, we very easily exceed the 256 limit. Perhaps surprisingly we get the highest buffer counts with TCP transfers and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, the extra system call overhead from forcibly flushing batches might become rather relevant. With lower MTUs, it looks like we have a lower CPU load and transmissions are scheduled differently (resulting in smaller batches), but I didn't really trace things. So I start thinking that this has the *potential* to introduce a performance regression in some cases and we shouldn't just assume that some arbitrary 256 limit is good enough. I didn't check with perf(1), though. Right now that array takes, effectively, less than 100 KiB (it's ~5000 copies of struct iovec, 16 bytes each), and in theory that could be ~2.5 MiB (at 161319 items). Even if we double or triple that (let's assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will have no practical effect anyway. All in all, I think we shouldn't change this limit without a deeper understanding of the practical impact. While this change doesn't bring any practical advantage, the current behaviour is somewhat tested by now, and a small limit isn't. -- Stefano
On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:34 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
I ran a few more tests with this, keeping TAP_MSGS at 256, and in general I couldn't really see a difference in latency (especially for UDP streams with small packets) or throughput. Figures from short throughput tests (such as the ones from the test suite) look a bit more variable, but I don't have any statistically meaningful data.
Then I looked into how many messages we might have in the array without this change, and I realised that, with the throughput tests from the suite, we very easily exceed the 256 limit.
Ah, interesting.
Perhaps surprisingly we get the highest buffer counts with TCP transfers and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, the extra system call overhead from forcibly flushing batches might become rather relevant.
Really? I thought syscall overhead (as in the part that's per-syscall, rather than per-work) was generally in the tens of µs range, rather than the ms range. But in any case, I'd be fine with upping the size of the array to 4k or 8k based on that empirical data. That's still much smaller than the >150k we have now.
With lower MTUs, it looks like we have a lower CPU load and transmissions are scheduled differently (resulting in smaller batches), but I didn't really trace things.
Ok. I wonder if with the small MTUs we're hitting throughput bottlenecks elsewhere which mean this particular path isn't over-exercised.
So I start thinking that this has the *potential* to introduce a performance regression in some cases and we shouldn't just assume that some arbitrary 256 limit is good enough. I didn't check with perf(1), though.
Right now that array takes, effectively, less than 100 KiB (it's ~5000 copies of struct iovec, 16 bytes each), and in theory that could be ~2.5 MiB (at 161319 items). Even if we double or triple that (let's assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will have no practical effect anyway.
Yeah, I guess. I don't love the fact that currently for correctness (not spuriously dropping packets) we rely on a fairly complex calculation that's based on information from different layers: the buffer size and enforcement is in the packet pool layer and is independent of packet layout, but the minimum frame size comes from the tap layer and depends quite specifically on which L2 encapsulation we're using.
All in all, I think we shouldn't change this limit without a deeper understanding of the practical impact. While this change doesn't bring any practical advantage, the current behaviour is somewhat tested by now, and a small limit isn't.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Thu, 2 Jan 2025 14:46:45 +1100
David Gibson
On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:34 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
I ran a few more tests with this, keeping TAP_MSGS at 256, and in general I couldn't really see a difference in latency (especially for UDP streams with small packets) or throughput. Figures from short throughput tests (such as the ones from the test suite) look a bit more variable, but I don't have any statistically meaningful data.
Then I looked into how many messages we might have in the array without this change, and I realised that, with the throughput tests from the suite, we very easily exceed the 256 limit.
Ah, interesting.
Perhaps surprisingly we get the highest buffer counts with TCP transfers and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, the extra system call overhead from forcibly flushing batches might become rather relevant.
Really? I thought syscall overhead (as in the part that's per-syscall, rather than per-work) was generally in the tens of µs range, rather than the ms range.
Tens or hundreds of µs (mind that it's several of them), so there could be just one order of magnitude between the two.
But in any case, I'd be fine with upping the size of the array to 4k or 8k based on that empirical data. That's still much smaller than the >150k we have now.
I would even go with 32k -- there are embedded systems with a ton of memory but still much slower clocks compared to my typical setup. Go figure. Again, I think we should test and profile this, ideally, but if not, then let's pick something that's ~10x of what I see.
With lower MTUs, it looks like we have a lower CPU load and transmissions are scheduled differently (resulting in smaller batches), but I didn't really trace things.
Ok. I wonder if with the small MTUs we're hitting throughput bottlenecks elsewhere which mean this particular path isn't over-exercised.
So I start thinking that this has the *potential* to introduce a performance regression in some cases and we shouldn't just assume that some arbitrary 256 limit is good enough. I didn't check with perf(1), though.
Right now that array takes, effectively, less than 100 KiB (it's ~5000 copies of struct iovec, 16 bytes each), and in theory that could be ~2.5 MiB (at 161319 items). Even if we double or triple that (let's assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will have no practical effect anyway.
Yeah, I guess. I don't love the fact that currently for correctness (not spuriously dropping packets) we rely on a fairly complex calculation that's based on information from different layers: the buffer size and enforcement is in the packet pool layer and is independent of packet layout, but the minimum frame size comes from the tap layer and depends quite specifically on which L2 encapsulation we're using.
Well but it's exactly one line, and we're talking about the same project and tool, not something that's separated by several API layers. By the way: on one hand you have that. On the other hand, you're adding an arbitrary limit that comes from a test I just did, which is also based on information from different layers.
All in all, I think we shouldn't change this limit without a deeper understanding of the practical impact. While this change doesn't bring any practical advantage, the current behaviour is somewhat tested by now, and a small limit isn't.
...and this still applies, I think. -- Stefano
On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 14:46:45 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:34 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
I ran a few more tests with this, keeping TAP_MSGS at 256, and in general I couldn't really see a difference in latency (especially for UDP streams with small packets) or throughput. Figures from short throughput tests (such as the ones from the test suite) look a bit more variable, but I don't have any statistically meaningful data.
Then I looked into how many messages we might have in the array without this change, and I realised that, with the throughput tests from the suite, we very easily exceed the 256 limit.
Ah, interesting.
Perhaps surprisingly we get the highest buffer counts with TCP transfers and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, the extra system call overhead from forcibly flushing batches might become rather relevant.
Really? I thought syscall overhead (as in the part that's per-syscall, rather than per-work) was generally in the tens of µs range, rather than the ms range.
Tens or hundreds of µs (mind that it's several of them), so there could be just one order of magnitude between the two.
Hm, ok.
But in any case, I'd be fine with upping the size of the array to 4k or 8k based on that empirical data. That's still much smaller than the >150k we have now.
I would even go with 32k -- there are embedded systems with a ton of memory but still much slower clocks compared to my typical setup. Go figure. Again, I think we should test and profile this, ideally, but if not, then let's pick something that's ~10x of what I see.
With lower MTUs, it looks like we have a lower CPU load and transmissions are scheduled differently (resulting in smaller batches), but I didn't really trace things.
Ok. I wonder if with the small MTUs we're hitting throughput bottlenecks elsewhere which mean this particular path isn't over-exercised.
So I start thinking that this has the *potential* to introduce a performance regression in some cases and we shouldn't just assume that some arbitrary 256 limit is good enough. I didn't check with perf(1), though.
Right now that array takes, effectively, less than 100 KiB (it's ~5000 copies of struct iovec, 16 bytes each), and in theory that could be ~2.5 MiB (at 161319 items). Even if we double or triple that (let's assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will have no practical effect anyway.
Yeah, I guess. I don't love the fact that currently for correctness (not spuriously dropping packets) we rely on a fairly complex calculation that's based on information from different layers: the buffer size and enforcement is in the packet pool layer and is independent of packet layout, but the minimum frame size comes from the tap layer and depends quite specifically on which L2 encapsulation we're using.
Well but it's exactly one line, and we're talking about the same project and tool, not something that's separated by several API layers.
By the way: on one hand you have that. On the other hand, you're adding an arbitrary limit that comes from a test I just did, which is also based on information from different layers.
The difference is that it used to be a hard limit: if we exceeded it we drop packets. Now, we just do a little more work. Actually... that makes me realise the actual size of the batches isn't the relevant factor in choosing the size. The amortized cost of processing a packet will be pretty much: (per byte costs) * (packet size) + (per packet costs) + (per batch costs) / (batch size) So, we just need to allow (batch size) to become large enough to make the last term negligible compared to the other two. I find it hard to believe that we'd need more than ~1000 to do that.
All in all, I think we shouldn't change this limit without a deeper understanding of the practical impact. While this change doesn't bring any practical advantage, the current behaviour is somewhat tested by now, and a small limit isn't.
...and this still applies, I think.
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 3 Jan 2025 17:06:49 +1100
David Gibson
On Thu, Jan 02, 2025 at 11:00:14PM +0100, Stefano Brivio wrote:
On Thu, 2 Jan 2025 14:46:45 +1100 David Gibson
wrote: On Wed, Jan 01, 2025 at 10:54:44PM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 19:35:34 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
I ran a few more tests with this, keeping TAP_MSGS at 256, and in general I couldn't really see a difference in latency (especially for UDP streams with small packets) or throughput. Figures from short throughput tests (such as the ones from the test suite) look a bit more variable, but I don't have any statistically meaningful data.
Then I looked into how many messages we might have in the array without this change, and I realised that, with the throughput tests from the suite, we very easily exceed the 256 limit.
Ah, interesting.
Perhaps surprisingly we get the highest buffer counts with TCP transfers and intermediate MTUs: we're at about 4000-5000 with 1500 bytes (and more like ~1000 with 1280 bytes) meaning that we move 6 to 8 megabytes in one shot, every 5-10ms (at 8 Gbps). With that kind of time interval, the extra system call overhead from forcibly flushing batches might become rather relevant.
Really? I thought syscall overhead (as in the part that's per-syscall, rather than per-work) was generally in the tens of µs range, rather than the ms range.
Tens or hundreds of µs (mind that it's several of them), so there could be just one order of magnitude between the two.
Hm, ok.
But in any case, I'd be fine with upping the size of the array to 4k or 8k based on that empirical data. That's still much smaller than the >150k we have now.
I would even go with 32k -- there are embedded systems with a ton of memory but still much slower clocks compared to my typical setup. Go figure. Again, I think we should test and profile this, ideally, but if not, then let's pick something that's ~10x of what I see.
With lower MTUs, it looks like we have a lower CPU load and transmissions are scheduled differently (resulting in smaller batches), but I didn't really trace things.
Ok. I wonder if with the small MTUs we're hitting throughput bottlenecks elsewhere which mean this particular path isn't over-exercised.
So I start thinking that this has the *potential* to introduce a performance regression in some cases and we shouldn't just assume that some arbitrary 256 limit is good enough. I didn't check with perf(1), though.
Right now that array takes, effectively, less than 100 KiB (it's ~5000 copies of struct iovec, 16 bytes each), and in theory that could be ~2.5 MiB (at 161319 items). Even if we double or triple that (let's assume we use 2 * ETH_ALEN to keep it simple) it's not much... and will have no practical effect anyway.
Yeah, I guess. I don't love the fact that currently for correctness (not spuriously dropping packets) we rely on a fairly complex calculation that's based on information from different layers: the buffer size and enforcement is in the packet pool layer and is independent of packet layout, but the minimum frame size comes from the tap layer and depends quite specifically on which L2 encapsulation we're using.
Well but it's exactly one line, and we're talking about the same project and tool, not something that's separated by several API layers.
By the way: on one hand you have that. On the other hand, you're adding an arbitrary limit that comes from a test I just did, which is also based on information from different layers.
The difference is that it used to be a hard limit: if we exceeded it we drop packets. Now, we just do a little more work. Actually... that makes me realise the actual size of the batches isn't the relevant factor in choosing the size. The amortized cost of processing a packet will be pretty much:
(per byte costs) * (packet size) + (per packet costs)
Which is not independent of batch size, though (cache locality).
+ (per batch costs) / (batch size)
So, we just need to allow (batch size) to become large enough to make the last term negligible compared to the other two. I find it hard to believe that we'd need more than ~1000 to do that.
...but anyway, yes, that's probably a minor detail, so in theory it's like that, and probably also in practice. But my whole point here is: let's not change the behaviour (regardless of the theory) if we don't have time to profile things, because it's not great to let users do that, and the current behaviour with the current threshold is something we had for a long time. Quoting myself:
Again, I think we should test and profile this, ideally, but if not, then let's pick something that's ~10x of what I see.
-- Stefano
packet_check_range (and vu_packet_check_range()) verify that the packet or
section of packet we're interested in lies in the packet buffer pool we
expect it to. However, in doing so it doesn't avoid the possibility of
an integer overflow while performing pointer arithmetic, with is UB. In
fact, AFAICT it's UB even to use arbitrary pointer arithmetic to construct
a pointer outside of a known valid buffer.
To do this safely, we can't calculate the end of a memory region with
pointer addition, at least if we're treating the length as untrusted.
Instead we must work out the offset of one memory region within another
using pointer subtraction, then do integer checks against the length of
the outer region. We then need to be careful about the order of checks so
that those integer checks can't themselves overflow.
While addressing this, correct a flaw in vu_packet_check_range() where it
only checked that the given packet piece was above the region's mmap
address, not the base of the specific section of the mmap we expect to find
it in (dev_region->mmap_addr + dev_region->mmap_offset).
Signed-off-by: David Gibson
On Fri, Dec 20, 2024 at 07:35:23PM +1100, David Gibson wrote:
This... is not any of the things I said I would be working on. I can only say that a herd of very hairy yaks led me astray. Looking at bug 66 I spotted some problems with our handling of MTUs / maximum frame sizes. Looking at that I found some weirdness and some real, if minor, bugs in the sizing and handling of the packet pools.
Changes in v2: * Stefano convinced me that packet_check_range() is still worthwhile. * So don't remove it... but in looking at it I spotted various flaws in the checks, so address those in a number of new patches.
David Gibson (12): test focus hack: stop on fail, but not perf fail make passt dumpable
Argh, oops. Of course, 1..3 are test/debug patches that should not be included in the series.
packet: Use flexible array member in struct pool packet: Don't pass start and offset separately too packet_check_range() packet: Don't hard code maximum packet size to UINT16_MAX packet: Remove unhelpful packet_get_try() macro util: Add abort_with_msg() and ASSERT_WITH_MSG() helpers packet: Distinguish severities of different packet_{add,git}_do() errors packet: Move packet length checks into packet_check_range() tap: Don't size pool_tap[46] for the maximum number of packets packet: More cautious checks to avoid pointer arithmetic UB
dhcpv6.c | 2 +- ip.c | 2 +- isolation.c | 2 +- packet.c | 106 ++++++++++++++++++++++---------------------------- packet.h | 19 ++++++--- passt.h | 2 - tap.c | 18 +++++++-- tap.h | 3 +- test/lib/term | 1 + test/lib/test | 4 +- test/run | 38 +++++++++--------- util.c | 19 +++++++++ util.h | 25 +++++------- vu_common.c | 34 ++++++++++------ 14 files changed, 153 insertions(+), 122 deletions(-)
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 20 Dec 2024 20:00:13 +1100
David Gibson
On Fri, Dec 20, 2024 at 07:35:23PM +1100, David Gibson wrote:
This... is not any of the things I said I would be working on. I can only say that a herd of very hairy yaks led me astray. Looking at bug 66 I spotted some problems with our handling of MTUs / maximum frame sizes. Looking at that I found some weirdness and some real, if minor, bugs in the sizing and handling of the packet pools.
Changes in v2: * Stefano convinced me that packet_check_range() is still worthwhile. * So don't remove it... but in looking at it I spotted various flaws in the checks, so address those in a number of new patches.
David Gibson (12): test focus hack: stop on fail, but not perf fail make passt dumpable
Argh, oops. Of course, 1..3 are test/debug patches that should not be included in the series.
Yeah, it's fine, 4/12 to 12/12 apply cleanly anyway. -- Stefano
participants (2)
-
David Gibson
-
Stefano Brivio