On 2024-02-16 04:21, Eric Dumazet wrote:
On Fri, Feb 16, 2024 at 10:14 AM Paolo Abeni <pabeni@redhat.com> wrote:
On Thu, 2024-02-15 at 17:24 -0500, Jon Maloy wrote:
On 2024-02-15 12:46, Eric Dumazet wrote:
On Thu, Feb 15, 2024 at 6:41 PM Paolo Abeni <pabeni@redhat.com> wrote:
Note: please send text-only email to netdev.

On Thu, 2024-02-15 at 10:11 -0500, Jon Maloy wrote:
I wonder if the following could be acceptable:

  if (flags & MSG_PEEK)
         sk_peek_offset_fwd(sk, used);
  else if (peek_offset > 0)
        sk_peek_offset_bwd(sk, used);

  peek_offset is already present in the data cache, and if it has the value
  zero it means either that that sk->sk_peek_off is unused (-1) or actually is zero.
  Either way, no rewind is needed in that case.
I agree the above should avoid touching cold cachelines in the
fastpath, and looks functionally correct to me.

The last word is up to Eric :)

An actual patch seems needed.

In the current form, local variable peek_offset is 0 when !MSG_PEEK.

So the "else if (peek_offset > 0)" would always be false.

Yes, of course. This wouldn't work unless we read sk->sk_peek_off at the
beginning of the function.
I will look at the other suggestions.
I *think* that moving sk_peek_off this way:

---
diff --git a/include/net/sock.h b/include/net/sock.h
index a9d99a9c583f..576a6a6abb03 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -413,7 +413,7 @@ struct sock {
        unsigned int            sk_napi_id;
 #endif
        int                     sk_rcvbuf;
-       int                     sk_disconnects;
+       int                     sk_peek_off;

        struct sk_filter __rcu  *sk_filter;
        union {
@@ -439,7 +439,7 @@ struct sock {
                struct rb_root  tcp_rtx_queue;
        };
        struct sk_buff_head     sk_write_queue;
-       __s32                   sk_peek_off;
+       int                     sk_disconnects;
        int                     sk_write_pending;
        __u32                   sk_dst_pending_confirm;
        u32                     sk_pacing_status; /* see enum sk_pacing */
---

should avoid problematic accesses,

The relevant cachelines layout is as follow:

                        /* --- cacheline 4 boundary (256 bytes) --- */
                struct sk_buff *   tail;                 /*   256     8 */
        } sk_backlog;                                    /*   240    24 */
        int                        sk_forward_alloc;     /*   264     4 */
        u32                        sk_reserved_mem;      /*   268     4 */
        unsigned int               sk_ll_usec;           /*   272     4 */
        unsigned int               sk_napi_id;           /*   276     4 */
        int                        sk_rcvbuf;            /*   280     4 */
        int                        sk_disconnects;       /*   284     4 */
                                // will become sk_peek_off
        struct sk_filter *         sk_filter;            /*   288     8 */
        union {
                struct socket_wq * sk_wq;                /*   296     8 */
                struct socket_wq * sk_wq_raw;            /*   296     8 */
        };                                               /*   296     8 */
        struct xfrm_policy *       sk_policy[2];         /*   304    16 */
        /* --- cacheline 5 boundary (320 bytes) --- */

        //  ...

        /* --- cacheline 6 boundary (384 bytes) --- */
        __s32                      sk_peek_off;          /*   384     4 */
                                // will become sk_diconnects
        int                        sk_write_pending;     /*   388     4 */
        __u32                      sk_dst_pending_confirm; /*   392     4 */
        u32                        sk_pacing_status;     /*   396     4 */
        long int                   sk_sndtimeo;          /*   400     8 */
        struct timer_list          sk_timer;             /*   408    40 */

        /* XXX last struct has 4 bytes of padding */

        /* --- cacheline 7 boundary (448 bytes) --- */

sk_peek_off will be in the same cachline of sk_forward_alloc /
sk_reserved_mem / backlog tail, that are already touched by the
tcp_recvmsg_locked() main loop.

WDYT?
I was about to send a similar change, also moving sk_rcvtimeo, and
adding __cacheline_group_begin()/__cacheline_group_end
annotations.

I can finish this today.

There is also the following alternative:

if (flags & MSG_PEEK)
       sk_peek_offset_fwd(sk, used);
else if (flags & MSG_TRUNC)
       sk_peek_offset_bwd(sk, used);

This is the way we use it, and probably the typical usage.
It would force a user to drain the receive queue with MSG_TRUNC whenever he is using
MSG_PEEK_OFF, but I don't really see that as a limitation.

Anyway, if Paolo's suggestion solves the problem this shouldn't be necessary.

///jon