On Wed, 29 Oct 2025 11:35:29 +1100
David Gibson
On Wed, Oct 29, 2025 at 12:13:30AM +0100, Stefano Brivio wrote:
On Mon, 20 Oct 2025 20:17:10 +1100 David Gibson
wrote: On Mon, Oct 20, 2025 at 07:11:07AM +0200, Stefano Brivio wrote:
On Mon, 20 Oct 2025 11:20:19 +1100 David Gibson
wrote: [snip] Rather than the local link I was thinking of whatever monitor or liveness probe in KubeVirt which might have a 60-second period, or some firewall agent, or how long it typically takes for guests to stop and resume again in KubeVirt.
Right, I hadn't considered those. Although.. do those actually re-use a single connection? I would have guessed they use a new connection each time, making the timeouts here irrelevant.
It depends on the definition of "each time", because we don't time out host-side connections immediately.
Hm, ok. Is your concern that getting a negative answer from the probe will take too long?
More like getting a positive answer taking too long, because we retry so infrequently.
Right, but it will only be slow if we lose the first probe, which should be very rare.
No, because again, that might be due to the guest doing something with its firewall or stopping/resuming/getting online etc. It's not necessarily rare. If that situation persists for at least 1 + 2 + 4 + 8 + 16 + 32 = 55 seconds, without a clamp, we'll wait 119 seconds next, and 247 seconds after that. In this case, to me, it looks more reasonable to retry every minute instead.
Pretending passt isn't there, the timeout would come from the default values for TCP connections. It looks like there's no specific SO_SNDTIMEO value set for those probes, and you can't configure the timeout, at least according to:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-...
My guess would be that the probe would probably time out at the application level long before the TCP layer times out, but I don't know for sure.
I don't think so. What I was pointing out is that I couldn't find any place in the implementation of those probes where a particular *handshake timeout* (not probe interval) is set on top of Linux's defaults, so timeouts at TCP layer and application level should be the same (no additional timeout in application logic).
Huh, that's mildly surprising to me.
-- Stefano