TIME_WAIT: Random Notes

bithacksbithacks wrote 08/22/2015 at 08:12 • 2 min read • Like

[Just a draft - please ignore for now]

We have all seen it. TIME_WAIT sockets piling up and all hell breaks loose. This is an attempt to document all there is to know about this, so that we can learn how to address the issue better and move on with our lives.

TIME_WAIT has been around for a while. We see it for the first time in the good-old RFC 793 (Sept 1981):

    TIME-WAIT - represents waiting for enough time to pass to be sure
    the remote TCP received the acknowledgment of its connection
    termination request.

RFC 793 specifies that a connection can stay in TIME-WAIT for a maximum of four minutes. Most modern Linux systems seem to be setting the Maximum segment lifetime value at 60 seconds.
Open Question: Does it make sense to have an uniform tcp_fin_timeout value for all sockets? Shouldn't it be routing sensitive? Localhost connection values should be order of magnitude lower.

Fundamental problem here is that TIME_WAIT is entered after the invoking side close()s the connection on their side but socket is still left hanging around to wait for other side to do the close(). Given that there's no handshake in this process, an oblivious process is required, hence the sleep.

Can parallels be drawn between TIME_WAIT sockets and "zombie" forks? Not quite - as TIME_WAITs have a time limit

Classical problem is combination of fixed input socket rate, 2/4min TIME_WAIT and a limit on the number of file descriptors. Given the request volume of X queries/sec, the expected number of sockets become:

So for a host with 10 req/sec and standard MSL of 60sec, expected number of TIME_WAIT sockets on the host will be 1200. With a standard ulimit of 1024 open files, that means we're already deep in trouble.

Useful Links