[PATCH] tcp_splice: splice() all we have to the writing side, not what we just read

24 Oct 2024

In tcp_splice_sock_handler(), we try to calculate how much we can move
from the pipe to the writing socket: if we just read some bytes, we'll
use that amount, but if we haven't, we just try to empty the pipe.

However, if we just read something, that doesn't mean that that's all
the data we have on the pipe, as it's obvious from this sequence, where:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 98304 from read-side call
  Flow 0 (TCP connection (spliced)): 33615 from write-side call (passed 98304)
  Flow 0 (TCP connection (spliced)): -1 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 524288)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580
  Flow 0 (TCP connection (spliced)): OUT_WAIT_0

we first pile up 98304 - 33615 = 64689 pending bytes, that we read but
couldn't write, as the receiver buffer is full, and we set the
corresponding OUT_WAIT flag. Then:

  pasta: epoll event on connected spliced TCP socket 54 (events: 0x00000001)
  Flow 0 (TCP connection (spliced)): 32768 from read-side call
  Flow 0 (TCP connection (spliced)): -1 from write-side call (passed 32768)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:580

we splice() 32768 more bytes from our receiving side to the pipe. At
some point:

  pasta: epoll event on connected spliced TCP socket 49 (events: 0x00000004)
  Flow 0 (TCP connection (spliced)): event at tcp_splice_sock_handler:489
  Flow 0 (TCP connection (spliced)): ~OUT_WAIT_0
  Flow 0 (TCP connection (spliced)): 1320 from read-side call
  Flow 0 (TCP connection (spliced)): 1320 from write-side call (passed 1320)

the receiver is signalling to us that it's ready for more data
(EPOLLOUT). We reset the OUT_WAIT flag, read 1320 more bytes from
our receiving socket into the pipe, and that's what we write to the
receiver, forgetting about the pending 97457 bytes we had, which the
receiver might never get (not the same 97547 bytes: we'll actually
send 1320 of those).

This condition is rather hard to reproduce, and it was observed with
Podman pulling container images via HTTPS. In the traces above, the
client is side 0 (the initiating peer), and the server is sending the
whole data.

Instead of splicing from pipe to socket the amount of data we just
read, we need to splice all the pending data we piled up until that
point. We could do that using 'read' and 'written' counters, but
there's actually no need, as the kernel also keeps track of how much
data is available on the pipe.

So, to make this simple and more robust, just give the whole pipe size
as length to splice(). The kernel knows what to do with it.

Later in the function, we used 'to_write' for an optimisation meant
to reduce wakeups which retries right away to splice() in both
directions if we couldn't write to the receiver the whole amount of
pending data. Calculate a 'pending' value instead, only if we reach
that point.

Now that we check for the actual amount of pending data in that
optimisation, we need to make sure we don't compare a zero or negative
'written' value: if we met that, it means that the receiver signalled
end-of-file, an error, or to try again later. In those three cases,
the optimisation doesn't make any sense, so skip it.

Reported-by: Ed Santiago 
Reported-by: Paul Holzinger 
Analysed-by: Paul Holzinger 
Link: https://github.com/containers/podman/issues/24219
Signed-off-by: Stefano Brivio 
---
 tcp_splice.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/tcp_splice.c b/tcp_splice.c
index 9f5cc27..f112cfe 100644
--- a/tcp_splice.c
+++ b/tcp_splice.c
@@ -503,7 +503,7 @@ swap:
 	lowat_act_flag = RCVLOWAT_ACT(fromsidei);
 
 	while (1) {
-		ssize_t readlen, to_write = 0, written;
+		ssize_t readlen, written, pending;
 		int more = 0;
 
 retry:
@@ -518,14 +518,11 @@ retry:
 
 			if (errno != EAGAIN)
 				goto close;
-
-			to_write = c->tcp.pipe_size;
 		} else if (!readlen) {
 			eof = 1;
-			to_write = c->tcp.pipe_size;
 		} else {
 			never_read = 0;
-			to_write += readlen;
+
 			if (readlen >= (long)c->tcp.pipe_size * 90 / 100)
 				more = SPLICE_F_MORE;
 
@@ -535,10 +532,10 @@ retry:
 
 eintr:
 		written = splice(conn->pipe[fromsidei][0], NULL,
-				 conn->s[!fromsidei], NULL, to_write,
+				 conn->s[!fromsidei], NULL, c->tcp.pipe_size,
 				 SPLICE_F_MOVE | more | SPLICE_F_NONBLOCK);
 		flow_trace(conn, "%zi from write-side call (passed %zi)",
-			   written, to_write);
+			   written, c->tcp.pipe_size);
 
 		/* Most common case: skip updating counters. */
 		if (readlen > 0 && readlen == written) {
@@ -584,10 +581,9 @@ eintr:
 		if (never_read && written == (long)(c->tcp.pipe_size))
 			goto retry;
 
-		if (!never_read && written < to_write) {
-			to_write -= written;
+		pending = conn->read[fromsidei] - conn->written[fromsidei];
+		if (!never_read && written > 0 && written < pending)
 			goto retry;
-		}
 
 		if (eof)
 			break;
-- 
2.43.0

    

Stefano Brivio

David Gibson

tags

participants (2)