From bd773295ae61210befceba7a8af7784b04706c5c Mon Sep 17 00:00:00 2001 From: Willy Tarreau Date: Wed, 16 Oct 2024 18:08:39 +0200 Subject: [PATCH] BUG/MEDIUM: queue: make sure never to queue when there's no more served conns Since commit 53f52e67a0 ("BUG/MEDIUM: queue: always dequeue the backend when redistributing the last server"), we've got two reports again still showing the theoretically impossible condition in pendconn_add(), including a single threaded one. Thanks to the traces, the issue could be tracked down to the redispatch part. In fact, in non-determinist LB algorithms (RR, LC, FAS), we don't perform the LB if there are pending connections in the backend, since it indicates that previous attempts already failed, so we directly return SRV_STATUS_FULL. And contrary to a previous belief, it is possible to meet this condition with be->served==0 when redispatching (and likely with maxconn not greater than the number of threads). The problem is that in this case, the entry is queued and then the pendconn_must_try_again() function checks if any connections are currently being served to detect if we missed a race, and tries again, but that situation is not caused by a concurrent thread and will never fix itself, resulting in the loop. All that part around pendconn_must_try_again() is still quite brittle, and a safer approach would involve a sequence counter to detect new arrivals and dequeues during the pendconn_add() call. But it's more sensitive work, probably for a later fix. This fix must be backported wherever the fix above was backported. Thanks to Patrick Hemmer, as well as Damien Claisse and Basha Mougamadou from Criteo for their help on tracking this one! (cherry picked from commit ca275d99ce02e72d707fc87da133d739cdda5146) Signed-off-by: Christopher Faulet --- src/backend.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend.c b/src/backend.c index ff698fe..4fb4cc1 100644 --- a/src/backend.c +++ b/src/backend.c @@ -681,7 +681,7 @@ int assign_server(struct stream *s) /* if there's some queue on the backend, with certain algos we * know it's because all servers are full. */ - if (s->be->queue.length && s->be->queue.length != s->be->beconn && + if (s->be->queue.length && s->be->served && s->be->queue.length != s->be->beconn && (((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_FAS)|| // first ((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_RR) || // roundrobin ((s->be->lbprm.algo & (BE_LB_KIND|BE_LB_NEED|BE_LB_PARM)) == BE_LB_ALGO_SRR))) { // static-rr -- 1.7.10.4