On Sat, Apr 6, 2024 at 8:37 PM Eric Dumazet
On Sat, Apr 6, 2024 at 8:21 PM
wrote: From: Jon Maloy
Testing of the previous commit ("tcp: add support for SO_PEEK_OFF") in this series along with the pasta protocol splicer revealed a bug in the way tcp handles window advertising during extreme memory squeeze situations.
The excerpt of the below logging session shows what is happeing:
[5201<->54494]: ==== Activating log @ tcp_select_window()/268 ==== [5201<->54494]: (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) --> TRUE [5201<->54494]: tcp_select_window(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354, returning 0 [5201<->54494]: ADVERTISING WINDOW SIZE 0 [5201<->54494]: __tcp_transmit_skb(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354
[5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 83
[...]
I would prefer a packetdrill test, it is not clear what is happening...
In particular, have you used SO_RCVBUF ?
[5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 131072 bytes, window now: 250164, qlen: 1
[5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: (win_now: 250164, new_win: 262144 >= (2 * win_now): 500328))? --> time_to_ack: 0 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning 57036 bytes, window now: 250164, qlen: 0
[5201<->54494]: tcp_recvmsg_locked(->) [5201<->54494]: __tcp_cleanup_rbuf(->) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: NOT calling tcp_send_ack() [5201<->54494]: __tcp_cleanup_rbuf(<-) tp->rcv_wup: 2812454294, tp->rcv_wnd: 5812224, tp->rcv_nxt 2818016354 [5201<->54494]: tcp_recvmsg_locked(<-) returning -11 bytes, window now: 250164, qlen: 0
We can see that although we are adverising a window size of zero, tp->rcv_wnd is not updated accordingly. This leads to a discrepancy between this side's and the peer's view of the current window size. - The peer thinks the window is zero, and stops sending. - This side ends up in a cycle where it repeatedly caclulates a new window size it finds too small to advertise.
Hence no messages are received, and no acknowledges are sent, and the situation remains locked even after the last queued receive buffer has been consumed.
We fix this by setting tp->rcv_wnd to 0 before we return from the function tcp_select_window() in this particular case. Further testing shows that the connection recovers neatly from the squeeze situation, and traffic can continue indefinitely.
Reviewed-by: Stefano Brivio
Signed-off-by: Jon Maloy
I do not think this patch is good. If we reach zero window, it is a sign something is wrong. TCP has heuristics to slow down the sender if the receiver does not drain the receive queue fast enough. MSG_PEEK is an obvious reason, and SO_RCVLOWAT too. I suggest you take a look at tcp_set_rcvlowat(), see what is needed for SO_PEEK_OFF (ab)use ? In short, when SO_PEEK_OFF is in action : - TCP needs to not delay ACK when receive queue starts to fill - TCP needs to make sure sk_rcvbuf and tp->window_clamp grow (if autotuning is enabled)