On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:On 2024-02-15 12:46, Eric Dumazet wrote:I *think* that moving sk_peek_off this way: --- diff --git a/include/net/sock.h b/include/net/sock.h index a9d99a9c583f..576a6a6abb03 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -413,7 +413,7 @@ struct sock { unsigned int sk_napi_id; #endif int sk_rcvbuf; - int sk_disconnects; + int sk_peek_off; struct sk_filter __rcu *sk_filter; union { @@ -439,7 +439,7 @@ struct sock { struct rb_root tcp_rtx_queue; }; struct sk_buff_head sk_write_queue; - __s32 sk_peek_off; + int sk_disconnects; int sk_write_pending; __u32 sk_dst_pending_confirm; u32 sk_pacing_status; /* see enum sk_pacing */ --- should avoid problematic accesses, The relevant cachelines layout is as follow: /* --- cacheline 4 boundary (256 bytes) --- */ struct sk_buff * tail; /* 256 8 */ } sk_backlog; /* 240 24 */ int sk_forward_alloc; /* 264 4 */ u32 sk_reserved_mem; /* 268 4 */ unsigned int sk_ll_usec; /* 272 4 */ unsigned int sk_napi_id; /* 276 4 */ int sk_rcvbuf; /* 280 4 */ int sk_disconnects; /* 284 4 */ // will become sk_peek_off struct sk_filter * sk_filter; /* 288 8 */ union { struct socket_wq * sk_wq; /* 296 8 */ struct socket_wq * sk_wq_raw; /* 296 8 */ }; /* 296 8 */ struct xfrm_policy * sk_policy[2]; /* 304 16 */ /* --- cacheline 5 boundary (320 bytes) --- */ // ... /* --- cacheline 6 boundary (384 bytes) --- */ __s32 sk_peek_off; /* 384 4 */ // will become sk_diconnects int sk_write_pending; /* 388 4 */ __u32 sk_dst_pending_confirm; /* 392 4 */ u32 sk_pacing_status; /* 396 4 */ long int sk_sndtimeo; /* 400 8 */ struct timer_list sk_timer; /* 408 40 */ /* XXX last struct has 4 bytes of padding */ /* --- cacheline 7 boundary (448 bytes) --- */ sk_peek_off will be in the same cachline of sk_forward_alloc / sk_reserved_mem / backlog tail, that are already touched by the tcp_recvmsg_locked() main loop. WDYT? thanks! PaoloOn Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni(a)redhat.com> wrote:Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the beginning of the function. I will look at the other suggestions.Note: please send text-only email to netdev. On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:An actual patch seems needed. In the current form, local variable peek_offset is 0 when !MSG_PEEK. So the "else if (peek_offset > 0)" would always be false.I wonder if the following could be acceptable: if (flags & MSG_PEEK) sk_peek_offset_fwd(sk, used); else if (peek_offset > 0) sk_peek_offset_bwd(sk, used); peek_offset is already present in the data cache, and if it has the value zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero. Either way, no rewind is needed in that case.I agree the above should avoid touching cold cachelines in the fastpath, and looks functionally correct to me. The last word is up to Eric :)