From: Jon Maloy <jmaloy(a)redhat.com> When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that. In this commit, we add support for the SO_PEEK_OFF socket option for TCP, in a similar way it is done for Unix Domain sockets. In the iperf3 log examples shown below, we can observe a throughput improvement of 15-20 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result. pasta(1) and passt(1) implement user-mode networking for network namespaces (containers) and virtual machines by means of a translation layer between Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo). Received, pending TCP data to the container/guest is kept in kernel buffers until acknowledged, so the tool routinely needs to fetch new data from socket, skipping data that was already sent. At the moment this is implemented using a dummy buffer passed to recvmsg(). With this change, we don't need a dummy buffer and the related buffer copy (copy_to_user()) anymore. passt and pasta are supported in KubeVirt and libvirt/qemu. jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF not supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated jmaloy@freyr:~/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@freyr:~/passt$ jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 52084 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@freyr:~/passt$ The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support. Without offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg With offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg Suggested-by: Paolo Abeni <pabeni(a)redhat.com> Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni --- net/ipv4/af_inet.c | 1 + net/ipv4/tcp.c | 16 ++++++++++------ 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 4e635dd3d3c8..5f0e5d10c416 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { #endif .splice_eof = inet_splice_eof, .splice_read = tcp_splice_read, + .set_peek_off = sk_set_peek_off, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7e2481b9eae1..1c8cab14a32c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) struct sk_buff *skb; int copied = 0, err = 0; - /* XXX -- need to support SO_PEEK_OFF */ - skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { err = skb_copy_datagram_msg(skb, 0, msg, skb->len); if (err) @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; + u32 peek_offset = 0; u32 urg_hole = 0; err = -ENOTCONN; @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, seq = &tp->copied_seq; if (flags & MSG_PEEK) { - peek_seq = tp->copied_seq; + peek_offset = max(sk_peek_offset(sk, flags), 0); + peek_seq = tp->copied_seq + peek_offset; seq = &peek_seq; } @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if ((flags & MSG_PEEK) && - (peek_seq - copied - urg_hole != tp->copied_seq)) { + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", current->comm, task_pid_nr(current)); - peek_seq = tp->copied_seq; + peek_seq = tp->copied_seq + peek_offset; } continue; @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; - + if (flags & MSG_PEEK) + sk_peek_offset_fwd(sk, used); + else + sk_peek_offset_bwd(sk, used); tcp_rcv_space_adjust(sk); skip_copy: @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags) __skb_queue_purge(&sk->sk_receive_queue); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); WRITE_ONCE(tp->urg_data, 0); + sk_set_peek_off(sk, -1); tcp_write_queue_purge(sk); tcp_fastopen_active_disable_ofo_check(sk); skb_rbtree_purge(&tp->out_of_order_queue); -- 2.42.0
On Fri, 9 Feb 2024 17:12:33 -0500 jmaloy(a)redhat.com wrote:From: Jon Maloy <jmaloy(a)redhat.com> When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that. In this commit, we add support for the SO_PEEK_OFF socket option for TCP, in a similar way it is done for Unix Domain sockets. In the iperf3 log examples shown below, we can observe a throughput improvement of 15-20 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result. pasta(1) and passt(1) implement user-mode networking for network namespaces (containers) and virtual machines by means of a translation layer between Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo). Received, pending TCP data to the container/guest is kept in kernel buffers until acknowledged, so the tool routinely needs to fetch new data from socket, skipping data that was already sent. At the moment this is implemented using a dummy buffer passed to recvmsg(). With this change, we don't need a dummy buffer and the related buffer copy (copy_to_user()) anymore. passt and pasta are supported in KubeVirt and libvirt/qemu. jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF not supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated jmaloy@freyr:~/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@freyr:~/passt$ jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 52084 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@freyr:~/passt$ The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support. Without offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg With offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg Suggested-by: Paolo Abeni <pabeni(a)redhat.com> Signed-off-by: Jon Maloy <jmaloy(a)redhat.com>Reviewed-by: Stefano Brivio <sbrivio(a)redhat.com> -- Stefano
Oops, I just noticed Eric is missing from the recipients list, adding him now. On Fri, 2024-02-09 at 17:12 -0500, jmaloy(a)redhat.com wrote:From: Jon Maloy <jmaloy(a)redhat.com> When reading received messages from a socket with MSG_PEEK, we may want to read the contents with an offset, like we can do with pread/preadv() when reading files. Currently, it is not possible to do that. In this commit, we add support for the SO_PEEK_OFF socket option for TCP, in a similar way it is done for Unix Domain sockets. In the iperf3 log examples shown below, we can observe a throughput improvement of 15-20 % in the direction host->namespace when using the protocol splicer 'pasta' (https://passt.top). This is a consistent result. pasta(1) and passt(1) implement user-mode networking for network namespaces (containers) and virtual machines by means of a translation layer between Layer-2 network interface and native Layer-4 sockets (TCP, UDP, ICMP/ICMPv6 echo). Received, pending TCP data to the container/guest is kept in kernel buffers until acknowledged, so the tool routinely needs to fetch new data from socket, skipping data that was already sent. At the moment this is implemented using a dummy buffer passed to recvmsg(). With this change, we don't need a dummy buffer and the related buffer copy (copy_to_user()) anymore. passt and pasta are supported in KubeVirt and libvirt/qemu. jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF not supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 44822 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated jmaloy@freyr:~/passt# logout [ perf record: Woken up 23 times to write data ] [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] jmaloy@freyr:~/passt$ jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f SO_PEEK_OFF supported by kernel. jmaloy@freyr:~/passt# iperf3 -s ----------------------------------------------------------- Server listening on 5201 (test #1) ----------------------------------------------------------- Accepted connection from 192.168.122.1, port 52084 [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver ----------------------------------------------------------- Server listening on 5201 (test #2) ----------------------------------------------------------- ^Ciperf3: interrupt - the server has terminated logout [ perf record: Woken up 20 times to write data ] [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] jmaloy@freyr:~/passt$ The perf record confirms this result. Below, we can observe that the CPU spends significantly less time in the function ____sys_recvmsg() when we have offset support. Without offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg With offset support: ---------------------- jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ -p ____sys_recvmsg -x --stdio -i perf.data | head -1 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg Suggested-by: Paolo Abeni <pabeni(a)redhat.com> Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> --- v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni --- net/ipv4/af_inet.c | 1 + net/ipv4/tcp.c | 16 ++++++++++------ 2 files changed, 11 insertions(+), 6 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 4e635dd3d3c8..5f0e5d10c416 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { #endif .splice_eof = inet_splice_eof, .splice_read = tcp_splice_read, + .set_peek_off = sk_set_peek_off, .read_sock = tcp_read_sock, .read_skb = tcp_read_skb, .sendmsg_locked = tcp_sendmsg_locked, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7e2481b9eae1..1c8cab14a32c 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) struct sk_buff *skb; int copied = 0, err = 0; - /* XXX -- need to support SO_PEEK_OFF */ - skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { err = skb_copy_datagram_msg(skb, 0, msg, skb->len); if (err) @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; + u32 peek_offset = 0; u32 urg_hole = 0; err = -ENOTCONN; @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, seq = &tp->copied_seq; if (flags & MSG_PEEK) { - peek_seq = tp->copied_seq; + peek_offset = max(sk_peek_offset(sk, flags), 0); + peek_seq = tp->copied_seq + peek_offset; seq = &peek_seq; } @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if ((flags & MSG_PEEK) && - (peek_seq - copied - urg_hole != tp->copied_seq)) { + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", current->comm, task_pid_nr(current)); - peek_seq = tp->copied_seq; + peek_seq = tp->copied_seq + peek_offset; } continue; @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; - + if (flags & MSG_PEEK) + sk_peek_offset_fwd(sk, used); + else + sk_peek_offset_bwd(sk, used); tcp_rcv_space_adjust(sk); skip_copy: @@ -3007,6 +3010,7 @@ int tcp_disconnect(struct sock *sk, int flags) __skb_queue_purge(&sk->sk_receive_queue); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); WRITE_ONCE(tp->urg_data, 0); + sk_set_peek_off(sk, -1); tcp_write_queue_purge(sk); tcp_fastopen_active_disable_ofo_check(sk); skb_rbtree_purge(&tp->out_of_order_queue);
On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote:Oops, I just noticed Eric is missing from the recipients list, adding him now.Hmmm thanks.On Fri, 2024-02-09 at 17:12 -0500, jmaloy(a)redhat.com wrote: > From: Jon Maloy <jmaloy(a)redhat.com> > > When reading received messages from a socket with MSG_PEEK, we may want > to read the contents with an offset, like we can do with pread/preadv() > when reading files. Currently, it is not possible to do that. > > In this commit, we add support for the SO_PEEK_OFF socket option for TCP, > in a similar way it is done for Unix Domain sockets. > > In the iperf3 log examples shown below, we can observe a throughput > improvement of 15-20 % in the direction host->namespace when using the > protocol splicer 'pasta' (https://passt.top). > This is a consistent result. > > pasta(1) and passt(1) implement user-mode networking for network > namespaces (containers) and virtual machines by means of a translation > layer between Layer-2 network interface and native Layer-4 sockets > (TCP, UDP, ICMP/ICMPv6 echo). > > Received, pending TCP data to the container/guest is kept in kernel > buffers until acknowledged, so the tool routinely needs to fetch new > data from socket, skipping data that was already sent. > > At the moment this is implemented using a dummy buffer passed to > recvmsg(). With this change, we don't need a dummy buffer and the > related buffer copy (copy_to_user()) anymore. > > passt and pasta are supported in KubeVirt and libvirt/qemu. > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF not supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 44822 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 44832 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.02 GBytes 8.78 Gbits/sec > [ 5] 1.00-2.00 sec 1.06 GBytes 9.08 Gbits/sec > [ 5] 2.00-3.00 sec 1.07 GBytes 9.15 Gbits/sec > [ 5] 3.00-4.00 sec 1.10 GBytes 9.46 Gbits/sec > [ 5] 4.00-5.00 sec 1.03 GBytes 8.85 Gbits/sec > [ 5] 5.00-6.00 sec 1.10 GBytes 9.44 Gbits/sec > [ 5] 6.00-7.00 sec 1.11 GBytes 9.56 Gbits/sec > [ 5] 7.00-8.00 sec 1.07 GBytes 9.20 Gbits/sec > [ 5] 8.00-9.00 sec 667 MBytes 5.59 Gbits/sec > [ 5] 9.00-10.00 sec 1.03 GBytes 8.83 Gbits/sec > [ 5] 10.00-10.04 sec 30.1 MBytes 6.36 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 10.3 GBytes 8.78 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > jmaloy@freyr:~/passt# > logout > [ perf record: Woken up 23 times to write data ] > [ perf record: Captured and wrote 5.696 MB perf.data (35580 samples) ] > jmaloy@freyr:~/passt$ > > jmaloy@freyr:~/passt$ perf record -g ./pasta --config-net -f > SO_PEEK_OFF supported by kernel. > > jmaloy@freyr:~/passt# iperf3 -s > ----------------------------------------------------------- > Server listening on 5201 (test #1) > ----------------------------------------------------------- > Accepted connection from 192.168.122.1, port 52084 > [ 5] local 192.168.122.180 port 5201 connected to 192.168.122.1 port 52098 > [ ID] Interval Transfer Bitrate > [ 5] 0.00-1.00 sec 1.32 GBytes 11.3 Gbits/sec > [ 5] 1.00-2.00 sec 1.19 GBytes 10.2 Gbits/sec > [ 5] 2.00-3.00 sec 1.26 GBytes 10.8 Gbits/sec > [ 5] 3.00-4.00 sec 1.36 GBytes 11.7 Gbits/sec > [ 5] 4.00-5.00 sec 1.33 GBytes 11.4 Gbits/sec > [ 5] 5.00-6.00 sec 1.21 GBytes 10.4 Gbits/sec > [ 5] 6.00-7.00 sec 1.31 GBytes 11.2 Gbits/sec > [ 5] 7.00-8.00 sec 1.25 GBytes 10.7 Gbits/sec > [ 5] 8.00-9.00 sec 1.33 GBytes 11.5 Gbits/sec > [ 5] 9.00-10.00 sec 1.24 GBytes 10.7 Gbits/sec > [ 5] 10.00-10.04 sec 56.0 MBytes 12.1 Gbits/sec > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID] Interval Transfer Bitrate > [ 5] 0.00-10.04 sec 12.9 GBytes 11.0 Gbits/sec receiver > ----------------------------------------------------------- > Server listening on 5201 (test #2) > ----------------------------------------------------------- > ^Ciperf3: interrupt - the server has terminated > logout > [ perf record: Woken up 20 times to write data ] > [ perf record: Captured and wrote 5.040 MB perf.data (33411 samples) ] > jmaloy@freyr:~/passt$ > > The perf record confirms this result. Below, we can observe that the > CPU spends significantly less time in the function ____sys_recvmsg() > when we have offset support. > > Without offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 46.32% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > With offset support: > ---------------------- > jmaloy@freyr:~/passt$ perf report -q --symbol-filter=do_syscall_64 \ > -p ____sys_recvmsg -x --stdio -i perf.data | head -1 > 28.12% 0.00% passt.avx2 [kernel.vmlinux] [k] do_syscall_64 ____sys_recvmsg > > Suggested-by: Paolo Abeni <pabeni(a)redhat.com> > Signed-off-by: Jon Maloy <jmaloy(a)redhat.com> > > --- > v3: - Applied changes suggested by Stefano Brivio and Paolo Abeni > --- > net/ipv4/af_inet.c | 1 + > net/ipv4/tcp.c | 16 ++++++++++------ > 2 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 4e635dd3d3c8..5f0e5d10c416 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1071,6 +1071,7 @@ const struct proto_ops inet_stream_ops = { > #endif > .splice_eof = inet_splice_eof, > .splice_read = tcp_splice_read, > + .set_peek_off = sk_set_peek_off, > .read_sock = tcp_read_sock, > .read_skb = tcp_read_skb, > .sendmsg_locked = tcp_sendmsg_locked, > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 7e2481b9eae1..1c8cab14a32c 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -1415,8 +1415,6 @@ static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len) > struct sk_buff *skb; > int copied = 0, err = 0; > > - /* XXX -- need to support SO_PEEK_OFF */ > - > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) { > err = skb_copy_datagram_msg(skb, 0, msg, skb->len); > if (err) > @@ -2327,6 +2325,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > int target; /* Read at least this many bytes */ > long timeo; > struct sk_buff *skb, *last; > + u32 peek_offset = 0; > u32 urg_hole = 0; > > err = -ENOTCONN; > @@ -2360,7 +2359,8 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > seq = &tp->copied_seq; > if (flags & MSG_PEEK) { > - peek_seq = tp->copied_seq; > + peek_offset = max(sk_peek_offset(sk, flags), 0); > + peek_seq = tp->copied_seq + peek_offset; > seq = &peek_seq; > } > > @@ -2463,11 +2463,11 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > } > > if ((flags & MSG_PEEK) && > - (peek_seq - copied - urg_hole != tp->copied_seq)) { > + (peek_seq - peek_offset - copied - urg_hole != tp->copied_seq)) { > net_dbg_ratelimited("TCP(%s:%d): Application bug, race in MSG_PEEK\n", > current->comm, > task_pid_nr(current)); > - peek_seq = tp->copied_seq; > + peek_seq = tp->copied_seq + peek_offset; > } > continue; > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used);Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().
On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote:Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something? Thanks, Paolo> @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used);Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().
On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni(a)redhat.com> wrote:On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote:Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something?> @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used);Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().
On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable? Thanks! PaoloOn Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote:Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something?> @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > WRITE_ONCE(*seq, *seq + used); > copied += used; > len -= used; > - > + if (flags & MSG_PEEK) > + sk_peek_offset_fwd(sk, used); > + else > + sk_peek_offset_bwd(sk, used);Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().
On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni(a)redhat.com> wrote:On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves. BTW I see the man pages say SO_PEEK_OFF is "is currently supported only for unix(7) sockets"On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote:This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote: > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > WRITE_ONCE(*seq, *seq + used); > > copied += used; > > len -= used; > > - > > + if (flags & MSG_PEEK) > > + sk_peek_offset_fwd(sk, used); > > + else > > + sk_peek_offset_bwd(sk, used); Yet another cache miss in TCP fast path... We need to move sk_peek_off in a better location before we accept this patch. I always thought MSK_PEEK was very inefficient, I am surprised we allow arbitrary loops in recvmsg().Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something?
On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote:On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni(a)redhat.com> wrote:I feel like there is some misunderstanding, or at least I can't follow. Let me be more verbose, to try to clarify my reasoning. Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will return adjacent data. AFAICS this is the same semantic currently implemented by UDP and unix sockets. Currently 'sk_peek_off' maintains the next offset to be peeked into the current receive queue. To implement the above behaviour, tcp_recvmsg() has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the next data, and after a plain read, to account for the data removed from the receive queue. I proposed to let introduce a tcp-specific set_peek_off doing something alike: WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val); so that the recvmsg will need to update sk_peek_off only for MSG_PEEK, while retaining the semantic described above. To keep the userspace interface unchanged that will need a paired tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the user a plain offset. An additional bit flag will be needed to store the information "the user-space enabled peek with offset". I don't understand how a setsockopt(PEEK_OFFSET) variant would help avoiding touching sk->sk_peek_offset? Thanks! PaoloOn Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves.This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?
On Tue, Feb 13, 2024 at 7:39 PM Paolo Abeni <pabeni(a)redhat.com> wrote:On Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote:I was trying to avoid using an extra storage, I was not trying to implement the alternative myself :0) If the recvmsg( MSG_PEEK) is supposed to auto-advance the peek_offset, we probably need more than a mere 32bit field.On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni(a)redhat.com> wrote:I feel like there is some misunderstanding, or at least I can't follow. Let me be more verbose, to try to clarify my reasoning. Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will return adjacent data. AFAICS this is the same semantic currently implemented by UDP and unix sockets. Currently 'sk_peek_off' maintains the next offset to be peeked into the current receive queue. To implement the above behaviour, tcp_recvmsg() has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the next data, and after a plain read, to account for the data removed from the receive queue. I proposed to let introduce a tcp-specific set_peek_off doing something alike: WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val); so that the recvmsg will need to update sk_peek_off only for MSG_PEEK, while retaining the semantic described above. To keep the userspace interface unchanged that will need a paired tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the user a plain offset. An additional bit flag will be needed to store the information "the user-space enabled peek with offset". I don't understand how a setsockopt(PEEK_OFFSET) variant would help avoiding touching sk->sk_peek_offset?On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves.This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?Thanks! Paolo
On 2024-02-13 14:31, Eric Dumazet wrote:On Tue, Feb 13, 2024 at 7:39 PM Paolo Abeni<pabeni(a)redhat.com> wrote:I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case. ///jonOn Tue, 2024-02-13 at 16:49 +0100, Eric Dumazet wrote:I was trying to avoid using an extra storage, I was not trying to implement the alternative myself :0) If the recvmsg( MSG_PEEK) is supposed to auto-advance the peek_offset, we probably need more than a mere 32bit field.On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni<pabeni(a)redhat.com> wrote:I feel like there is some misunderstanding, or at least I can't follow. Let me be more verbose, to try to clarify my reasoning. Two consecutive recvmsg(MSG_PEEK) calls for TCP after SO_PEEK_OFF will return adjacent data. AFAICS this is the same semantic currently implemented by UDP and unix sockets. Currently 'sk_peek_off' maintains the next offset to be peeked into the current receive queue. To implement the above behaviour, tcp_recvmsg() has to update 'sk_peek_off' after MSG_PEEK, to move the offset to the next data, and after a plain read, to account for the data removed from the receive queue. I proposed to let introduce a tcp-specific set_peek_off doing something alike: WRTIE_ONCE(sk->sk_peek_off, tcp_sk(sk)->copied_seq + val); so that the recvmsg will need to update sk_peek_off only for MSG_PEEK, while retaining the semantic described above. To keep the userspace interface unchanged that will need a paired tcp_get_peek_off(), so that getsockopt(SO_PEEK_OFF) could return to the user a plain offset. An additional bit flag will be needed to store the information "the user-space enabled peek with offset". I don't understand how a setsockopt(PEEK_OFFSET) variant would help avoiding touching sk->sk_peek_offset?On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote: > This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non > MSG_PEEK case is very strange IMO. > > Ideally, we should read/write over sk_peek_offset only when MSG_PEEK > is used by the caller. > > That would only touch non fast paths. > > Since the API is mono-threaded anyway, the caller should not rely on > the fact that normal recvmsg() call > would 'consume' sk_peek_offset. Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves.> Thanks! > > Paolo >
Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case.I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :) Cheers, Paolo
On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case.I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :)
On 2024-02-15 12:46, Eric Dumazet wrote:On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions. ///jonNote: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case.I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :)
On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:On 2024-02-15 12:46, Eric Dumazet wrote:I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT? thanks! PaoloOn Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions.Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case.I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :)
On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni <pabeni(a)redhat.com> wrote:On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:I was about to send a similar change, also moving sk_rcvtimeo, and adding __cacheline_group_begin()/__cacheline_group_end annotations. I can finish this today.On 2024-02-15 12:46, Eric Dumazet wrote:I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT?On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions.Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > I wonder if the following could be acceptable: > > if (flags & MSG_PEEK) > sk_peek_offset_fwd(sk, used); > else if (peek_offset > 0) > sk_peek_offset_bwd(sk, used); > > peek_offset is already present in the data cache, and if it has the value > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > Either way, no rewind is needed in that case. I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :)An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.
On 2024-02-16 04:21, Eric Dumazet wrote:On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni<pabeni(a)redhat.com> wrote:There is also the following alternative: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (flags & MSG_TRUNC) sk_peek_offset_bwd(sk, used); This is the way we use it, and probably the typical usage. It would force a user to drain the receive queue with MSG_TRUNC whenever he is using MSG_PEEK_OFF, but I don't really see that as a limitation. Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary. ///jonOn Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:I was about to send a similar change, also moving sk_rcvtimeo, and adding __cacheline_group_begin()/__cacheline_group_end annotations. I can finish this today.On 2024-02-15 12:46, Eric Dumazet wrote:I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT?On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni<pabeni(a)redhat.com> wrote: > Note: please send text-only email to netdev. > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: >> I wonder if the following could be acceptable: >> >> if (flags & MSG_PEEK) >> sk_peek_offset_fwd(sk, used); >> else if (peek_offset > 0) >> sk_peek_offset_bwd(sk, used); >> >> peek_offset is already present in the data cache, and if it has the value >> zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. >> Either way, no rewind is needed in that case. > I agree the above should avoid touching cold cachelines in the > fastpath, and looks functionally correct to me. > > The last word is up to Eric :) > An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions.
On Fri, Feb 16, 2024 at 11:13 AM Jon Maloy <jmaloy(a)redhat.com> wrote:There is also the following alternative: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (flags & MSG_TRUNC) sk_peek_offset_bwd(sk, used); This is the way we use it, and probably the typical usage. It would force a user to drain the receive queue with MSG_TRUNC whenever he is using MSG_PEEK_OFF, but I don't really see that as a limitation. Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary.I think the suggestion to move sk_peek_off came from my first message on this thread ;) "We need to move sk_peek_off in a better location before we accept this patch." Anyway, a complete reorg of 'struct sock' was overdue, I am working on it.
On Fri, Feb 16, 2024 at 05:13:34AM -0500, Jon Maloy wrote:On 2024-02-16 04:21, Eric Dumazet wrote:I really don't like this, although it would certainly do what we need for passt/pasta. SO_PEEK_OFF has established semantics for Unix sockets, which includes regular recv() adjusting the offset. Having it behave subtlety differently for TCP seems like a very bad idea.On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni<pabeni(a)redhat.com> wrote:There is also the following alternative: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (flags & MSG_TRUNC) sk_peek_offset_bwd(sk, used); This is the way we use it, and probably the typical usage. It would force a user to drain the receive queue with MSG_TRUNC whenever he is using MSG_PEEK_OFF, but I don't really see that as a limitation.On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:I was about to send a similar change, also moving sk_rcvtimeo, and adding __cacheline_group_begin()/__cacheline_group_end annotations. I can finish this today.On 2024-02-15 12:46, Eric Dumazet wrote: > On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni<pabeni(a)redhat.com> wrote: > > Note: please send text-only email to netdev. > > > > On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote: > > > I wonder if the following could be acceptable: > > > > > > if (flags & MSG_PEEK) > > > sk_peek_offset_fwd(sk, used); > > > else if (peek_offset > 0) > > > sk_peek_offset_bwd(sk, used); > > > > > > peek_offset is already present in the data cache, and if it has the value > > > zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. > > > Either way, no rewind is needed in that case. > > I agree the above should avoid touching cold cachelines in the > > fastpath, and looks functionally correct to me. > > > > The last word is up to Eric :) > > > An actual patch seems needed. > > In the current form, local variable peek_offset is 0 when !MSG_PEEK. > > So the "else if (peek_offset > 0)" would always be false. > Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions.I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT?Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary. ///jon-- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote:On Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Those semantics would likely defeat the purpose of using SO_PEEK_OFF for our use case, since we'd need an additional setsockopt() for every non-PEEK recv() (which are all MSG_TRUNC in our case).On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves.On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote: > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > WRITE_ONCE(*seq, *seq + used); > > > copied += used; > > > len -= used; > > > - > > > + if (flags & MSG_PEEK) > > > + sk_peek_offset_fwd(sk, used); > > > + else > > > + sk_peek_offset_bwd(sk, used); > > Yet another cache miss in TCP fast path... > > We need to move sk_peek_off in a better location before we accept this patch. > > I always thought MSK_PEEK was very inefficient, I am surprised we > allow arbitrary loops in recvmsg(). Let me double check I read the above correctly: are you concerned by the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could touch a lot of skbs/cachelines before reaching the relevant skb? The end goal here is allowing an user-space application to read incrementally/sequentially the received data while leaving them in receive buffer. I don't see a better option than MSG_PEEK, am I missing something?This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.BTW I see the man pages say SO_PEEK_OFF is "is currently supported only for unix(7) sockets"Yes, this patch is explicitly aiming to change that. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
On Wed, Feb 14, 2024 at 12:34 AM David Gibson <david(a)gibson.dropbear.id.au> wrote:I was pointing out that UDP is supposed to support it already, since 2016, the manpage should mention UDP already. Thanks.BTW I see the man pages say SO_PEEK_OFF is "is currently supported only for unix(7) sockets"Yes, this patch is explicitly aiming to change that.
On Wed, Feb 14, 2024 at 04:41:14AM +0100, Eric Dumazet wrote:On Wed, Feb 14, 2024 at 12:34 AM David Gibson <david(a)gibson.dropbear.id.au> wrote:Oh, sorry, I misunderstood. I tend to forget about this applied to datagram sockets, since we have no use for that. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonI was pointing out that UDP is supposed to support it already, since 2016, the manpage should mention UDP already.BTW I see the man pages say SO_PEEK_OFF is "is currently supported only for unix(7) sockets"Yes, this patch is explicitly aiming to change that.
On Wed, Feb 14, 2024 at 10:34:50AM +1100, David Gibson wrote:On Tue, Feb 13, 2024 at 04:49:01PM +0100, Eric Dumazet wrote:Btw, Eric, If you're concerned about the extra access added to the "regular" TCP path, would you be happier with the original approach Jon proposed: that allowed a user to essentially supply an offset to an individial MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a NULL pointer and length to skip. It did the job for us, although I admit it's a little ugly, which I presume is why Paolo suggested we investigate SO_PEEK_OFF instead. I think the SO_PEEK_OFF approach is more elegant, but maybe the performance impact outweighs that. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibsonOn Tue, Feb 13, 2024 at 4:28 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Those semantics would likely defeat the purpose of using SO_PEEK_OFF for our use case, since we'd need an additional setsockopt() for every non-PEEK recv() (which are all MSG_TRUNC in our case).On Tue, 2024-02-13 at 14:34 +0100, Eric Dumazet wrote:We could have a parallel SO_PEEK_OFFSET option, reusing the same socket field. The new semantic would be : Supported by TCP (so far), and tcp recvmsg() only reads/writes this field when MSG_PEEK is used. Applications would have to clear the values themselves.On Tue, Feb 13, 2024 at 2:02 PM Paolo Abeni <pabeni(a)redhat.com> wrote: > > On Tue, 2024-02-13 at 13:24 +0100, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 11:49 AM Paolo Abeni <pabeni(a)redhat.com> wrote: > > > > > > @@ -2508,7 +2508,10 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, > > > > WRITE_ONCE(*seq, *seq + used); > > > > copied += used; > > > > len -= used; > > > > - > > > > + if (flags & MSG_PEEK) > > > > + sk_peek_offset_fwd(sk, used); > > > > + else > > > > + sk_peek_offset_bwd(sk, used); > > > > Yet another cache miss in TCP fast path... > > > > We need to move sk_peek_off in a better location before we accept this patch. > > > > I always thought MSK_PEEK was very inefficient, I am surprised we > > allow arbitrary loops in recvmsg(). > > Let me double check I read the above correctly: are you concerned by > the 'skb_queue_walk(&sk->sk_receive_queue, skb) {' loop that could > touch a lot of skbs/cachelines before reaching the relevant skb? > > The end goal here is allowing an user-space application to read > incrementally/sequentially the received data while leaving them in > receive buffer. > > I don't see a better option than MSG_PEEK, am I missing something? This sk_peek_offset protocol, needing sk_peek_offset_bwd() in the non MSG_PEEK case is very strange IMO. Ideally, we should read/write over sk_peek_offset only when MSG_PEEK is used by the caller. That would only touch non fast paths. Since the API is mono-threaded anyway, the caller should not rely on the fact that normal recvmsg() call would 'consume' sk_peek_offset.Storing in sk_peek_seq the tcp next sequence number to be peeked should avoid changes in the non MSG_PEEK cases. AFAICS that would need a new get_peek_off() sock_op and a bit somewhere (in sk_flags?) to discriminate when sk_peek_seq is actually set. Would that be acceptable?
On Thu, Feb 15, 2024 at 4:21 AM David Gibson <david(a)gibson.dropbear.id.au> wrote:Btw, Eric, If you're concerned about the extra access added to the "regular" TCP path, would you be happier with the original approach Jon proposed: that allowed a user to essentially supply an offset to an individial MSG_PEEK recvmsg() by inserting a dummy entry as msg_iov[0] with a NULL pointer and length to skip. It did the job for us, although I admit it's a little ugly, which I presume is why Paolo suggested we investigate SO_PEEK_OFF instead. I think the SO_PEEK_OFF approach is more elegant, but maybe the performance impact outweighs that.Sorry, this was too ugly. SO_PEEK_OFF is way better/standard, we have to polish the implementation so that it is zero-cost for 99.9999 % of the users not using MSG_PEEK..