[PATCH 0/3] Cleanups to packet pool handling and sizing
This... is not any of the things I said I would be working on. I can only say that a herd of very hairy yaks led me astray. Looking at bug 66 I spotted some problems with our handling of MTUs / maximum frame sizes. Looking at that I found some weirdness and some real, if minor, bugs in the sizing and handling of the packet pools. David Gibson (3): packet: Use flexible array member in struct pool packet: Don't have struct pool specify its buffer tap: Don't size pool_tap[46] for the maximum number of packets packet.c | 63 ++++++---------------------------------------------- packet.h | 41 ++++++++++++---------------------- passt.h | 2 -- tap.c | 63 +++++++++++++++++++++++++--------------------------- tap.h | 4 ++-- vhost_user.c | 2 -- vu_common.c | 31 ++------------------------ 7 files changed, 55 insertions(+), 151 deletions(-) -- 2.47.1
Currently we have a dummy pkt[1] array, which we alias with an array of
a different size via various macros. However, we already require C11 which
includes flexible array members, so we can do better.
Signed-off-by: David Gibson
struct pool, which represents a batch of packets includes values giving
the buffer in which all the packets lie - or for vhost_user a link to the
vu_dev_region array in which the packets sit. Originally that made sense
because we stored each packet as an offset and length within that buffer.
However dd143e389 ("packet: replace struct desc by struct iovec") replaced
the offset and length with a struct iovec which can directly reference a
packet anywhere in memory. This means we no longer need the buffer
reference to interpret packets from the pool. So there's really no need
to check where the packet sits. We can remove the buf reference and all
checks associated with it. As a bonus this removes the special case for
vhost-user.
Similarly the old representation used a 16-bit length, so there were some
checks that packets didn't exceed that. That's also no longer necessary
with the struct iovec which uses a size_t length.
I think under an unlikely set of circumstances it might have been possible
to hit that 16-bit limit for a legitimate packet: other parts of the code
place a limit of 65535 bytes on the L2 frame, however that doesn't include
the length tag used by the qemu socket protocol. That tag *is* included in
the packet as stored in the pool, however, meaning we could get a 65539
byte packet at this level.
Signed-off-by: David Gibson
On Fri, 13 Dec 2024 23:01:55 +1100
David Gibson
struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer.
However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user.
Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length.
I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.
As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks. Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be. If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296. By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked. -- Stefano
On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson
wrote: struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer.
However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user.
Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length.
I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.
As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks.
Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be.
If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.
Right, I have draft patches that do basically this.
By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.
I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 20 Dec 2024 11:59:09 +1100
David Gibson
On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson
wrote: struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer.
However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user.
Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length.
I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.
As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks.
Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be.
If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.
Right, I have draft patches that do basically this.
By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.
I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver.
Oh, sorry, I was deep in the perspective of trying to make things crash... and it didn't do anything, just accepted the setting and kept sending packets out. Let me try that then, with and without your new series... -- Stefano
On Fri, Dec 20, 2024 at 10:51:33AM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 11:59:09 +1100 David Gibson
wrote: On Thu, Dec 19, 2024 at 10:00:11AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:55 +1100 David Gibson
wrote: struct pool, which represents a batch of packets includes values giving the buffer in which all the packets lie - or for vhost_user a link to the vu_dev_region array in which the packets sit. Originally that made sense because we stored each packet as an offset and length within that buffer.
However dd143e389 ("packet: replace struct desc by struct iovec") replaced the offset and length with a struct iovec which can directly reference a packet anywhere in memory. This means we no longer need the buffer reference to interpret packets from the pool. So there's really no need to check where the packet sits. We can remove the buf reference and all checks associated with it. As a bonus this removes the special case for vhost-user.
Similarly the old representation used a 16-bit length, so there were some checks that packets didn't exceed that. That's also no longer necessary with the struct iovec which uses a size_t length.
I think under an unlikely set of circumstances it might have been possible to hit that 16-bit limit for a legitimate packet: other parts of the code place a limit of 65535 bytes on the L2 frame, however that doesn't include the length tag used by the qemu socket protocol. That tag *is* included in the packet as stored in the pool, however, meaning we could get a 65539 byte packet at this level.
As I mentioned in the call on Monday: sure, we need to fix this, but at the same time I'm not quite convinced that it's a good idea to drop all these sanity checks.
Even if they're not based on offsets anymore, I think it's still valuable to ensure that the packets are not exactly _anywhere_ in memory, but only where we expect them to be.
If it's doable, I would rather keep these checks, and change the ones on the length to allow a maximum value of 65539 bytes. I mean, there's a big difference between 65539 and, say, 4294967296.
Right, I have draft patches that do basically this.
By the way, I haven't checked what happens with MTUs slightly bigger than 65520 bytes: virtio-net (at least with QEMU) doesn't budge if I set more than 65520, but I didn't actually send big packets. I'll try to have a look (also with muvm) unless you already checked.
I'm not sure what you mean by "doesn't budge". No, I haven't checked with either qemu or muvm. There could of course be limits applied by either VMM, or by the guest virtio-net driver.
Oh, sorry, I was deep in the perspective of trying to make things crash... and it didn't do anything, just accepted the setting and kept sending packets out.
Right. Even without my packet pool changes, I'm not aware of a way to make it crash, just ways to cause packets to be dropped when they shouldn't.
Let me try that then, with and without your new series...
-- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
Currently we attempt to size pool_tap[46] so they have room for the maximum
possible number of packets that could fit in pkt_buf, TAP_MSGS. However,
the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as
the minimum possible L2 frame size. But, we don't enforce that L2 frames
are at least ETH_ZLEN when we receive them from the tap backend, and since
we're dealing with virtual interfaces we don't have the physical Ethernet
limitations requiring that length. Indeed it is possible to generate a
legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on
the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't
limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with
even larger arrays, which in practice rarely accumulate more than a handful
of packets. So, instead, put an arbitrary cap on the number of packets we
can put in a batch, and if we run out of space, process and flush the
batch.
Signed-off-by: David Gibson
On Fri, 13 Dec 2024 23:01:56 +1100
David Gibson
Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
Signed-off-by: David Gibson
--- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h"
+/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count;
- if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H
+#include
+ /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
#define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h"
+#define TAP_MSGS 256
Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this. Assuming 4 KiB pages, this changes from 161319 to 256. You mention that in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time. I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right? -- Stefano
On Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
Signed-off-by: David Gibson
--- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h"
+/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count;
- if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H
+#include
+ /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
#define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h"
+#define TAP_MSGS 256
Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this.
Assuming 4 KiB pages, this changes from 161319 to 256. You mention that
Yes. I'm certainly open to arguments on what the number should be.
in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.
Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.
I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?
I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not. Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
On Fri, 20 Dec 2024 12:13:23 +1100
David Gibson
On Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
Signed-off-by: David Gibson
--- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h"
+/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count;
- if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H
+#include
+ /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
#define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h"
+#define TAP_MSGS 256
Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this.
Assuming 4 KiB pages, this changes from 161319 to 256. You mention that
Yes. I'm certainly open to arguments on what the number should be.
No idea, until now I just thought we'd have a limit that we can't hit in practice. Let me have a look at what happens with 256 (your new series) and iperf3 or udp_stream from neper.
in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.
Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.
It could even be that flushing more frequently actually improves things. I'm not sure. I was just pointing out that, quite likely, we can actually hit the new limit.
I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?
I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not.
Yes, that's what I meant.
Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32.
Okay, let me give that a try. I guess you didn't see a change in UDP throughput tests... but there we use fairly large messages. -- Stefano
On Fri, Dec 20, 2024 at 10:51:36AM +0100, Stefano Brivio wrote:
On Fri, 20 Dec 2024 12:13:23 +1100 David Gibson
wrote: On Thu, Dec 19, 2024 at 10:00:15AM +0100, Stefano Brivio wrote:
On Fri, 13 Dec 2024 23:01:56 +1100 David Gibson
wrote: Currently we attempt to size pool_tap[46] so they have room for the maximum possible number of packets that could fit in pkt_buf, TAP_MSGS. However, the calculation isn't quite correct: TAP_MSGS is based on ETH_ZLEN (60) as the minimum possible L2 frame size. But, we don't enforce that L2 frames are at least ETH_ZLEN when we receive them from the tap backend, and since we're dealing with virtual interfaces we don't have the physical Ethernet limitations requiring that length. Indeed it is possible to generate a legitimate frame smaller than that (e.g. a zero-payload UDP/IPv4 frame on the 'pasta' backend is only 42 bytes long).
It's also unclear if this limit is sufficient for vhost-user which isn't limited by the size of pkt_buf as the other modes are.
We could attempt to correct the calculation, but that would leave us with even larger arrays, which in practice rarely accumulate more than a handful of packets. So, instead, put an arbitrary cap on the number of packets we can put in a batch, and if we run out of space, process and flush the batch.
Signed-off-by: David Gibson
--- packet.c | 13 ++++++++++++- packet.h | 3 +++ passt.h | 2 -- tap.c | 18 +++++++++++++++--- tap.h | 3 ++- vu_common.c | 3 ++- 6 files changed, 34 insertions(+), 8 deletions(-) diff --git a/packet.c b/packet.c index 5bfa7304..b68580cc 100644 --- a/packet.c +++ b/packet.c @@ -22,6 +22,17 @@ #include "util.h" #include "log.h"
+/** + * pool_full() - Is a packet pool full? + * @p: Pointer to packet pool + * + * Return: true if the pool is full, false if more packets can be added + */ +bool pool_full(const struct pool *p) +{ + return p->count >= p->size; +} + /** * packet_add_do() - Add data as packet descriptor to given pool * @p: Existing pool @@ -35,7 +46,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, { size_t idx = p->count;
- if (idx >= p->size) { + if (pool_full(p)) { trace("add packet index %zu to pool with size %zu, %s:%i", idx, p->size, func, line); return; diff --git a/packet.h b/packet.h index 98eb8812..3618f213 100644 --- a/packet.h +++ b/packet.h @@ -6,6 +6,8 @@ #ifndef PACKET_H #define PACKET_H
+#include
+ /** * struct pool - Generic pool of packets stored in nmemory * @size: Number of usable descriptors for the pool @@ -23,6 +25,7 @@ void packet_add_do(struct pool *p, size_t len, const char *start, void *packet_get_do(const struct pool *p, const size_t idx, size_t offset, size_t len, size_t *left, const char *func, int line); +bool pool_full(const struct pool *p); void pool_flush(struct pool *p); #define packet_add(p, len, start) \ diff --git a/passt.h b/passt.h index 0dd4efa0..81b2787f 100644 --- a/passt.h +++ b/passt.h @@ -70,8 +70,6 @@ static_assert(sizeof(union epoll_ref) <= sizeof(union epoll_data),
#define TAP_BUF_BYTES \ ROUND_DOWN(((ETH_MAX_MTU + sizeof(uint32_t)) * 128), PAGE_SIZE) -#define TAP_MSGS \ - DIV_ROUND_UP(TAP_BUF_BYTES, ETH_ZLEN - 2 * ETH_ALEN + sizeof(uint32_t))
#define PKT_BUF_BYTES MAX(TAP_BUF_BYTES, 0) extern char pkt_buf [PKT_BUF_BYTES]; diff --git a/tap.c b/tap.c index 68231f09..42370a26 100644 --- a/tap.c +++ b/tap.c @@ -61,6 +61,8 @@ #include "vhost_user.h" #include "vu_common.h"
+#define TAP_MSGS 256
Sorry, I stopped at 2/3, had just a quick look at this one, and I missed this.
Assuming 4 KiB pages, this changes from 161319 to 256. You mention that
Yes. I'm certainly open to arguments on what the number should be.
No idea, until now I just thought we'd have a limit that we can't hit in practice. Let me have a look at what happens with 256 (your new series) and iperf3 or udp_stream from neper.
in practice we never have more than a handful of messages, which is probably almost always the case, but I wonder if that's also the case with UDP "real-time" streams, where we could have bursts of a few hundred (thousand?) messages at a time.
Maybe. If we are getting them in large bursts, then we're no longer really suceeding at the streams being "real-time", but sure, we should try to catch up as best we can.
It could even be that flushing more frequently actually improves things. I'm not sure. I was just pointing out that, quite likely, we can actually hit the new limit.
I wonder: how bad would it be to correct the calculation, instead? We wouldn't actually use more memory, right?
I was pretty painful when I tried, and it would use more memory. The safe option would be to use ETH_HLEN as the minimum size (which is pretty much all we enforce in the tap layer), which would expand the iovec array here by 2-3x. It's not enormous, but it's not nothing. Or do you mean the unused pages of the array would never be instantiated? In which case, yeah, I guess not.
Yes, that's what I meant.
Remember that with the changes in this patch if we exceed TAP_MSGS, nothing particularly bad happens: we don't crash, and we don't drop packets; we just process things in batches of TAP_MSGS frames at a time. So this doesn't need to be large enough to handle any burst we could ever get, just large enough to adequately mitigate the per-batch costs, which I don't think are _that_ large. 256 was a first guess at that. Maybe it's not enough, but I'd be pretty surprised if it needed to be greater than ~1000 to make the per-batch costs negligible compared to the per-frame costs. UDP_MAX_FRAMES, which is on the reverse path but serves a similar function, is only 32.
Okay, let me give that a try. I guess you didn't see a change in UDP throughput tests... but there we use fairly large messages.
No, but TBH I didn't look that closely. -- David Gibson (he or they) | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you, not the other way | around. http://www.ozlabs.org/~dgibson
participants (2)
-
David Gibson
-
Stefano Brivio