when dns session callback (dns_session_release()) is called upon error
(ie: when some pending queries were not sent), we try our best to
re-create the applet in order to preserve the pending queries and give
them a chance to be retried. This is done at the end of
dns_session_release().
However, doing so exposes to an issue: if the error preventing queries
from being sent is still encountered over and over the dns session could
stay there indefinitely. Meanwhile, other dns sessions may be created on
the same dns_stream_server periodically. If previous failing dns sessions
don't terminate but we also keep creating new ones, we end up accumulating
failing sessions on a given dns_stream_server, which can eventually cause
ressource shortage.
This issue was found when trying to address ("BUG/MINOR: dns: add tempo
between 2 connection attempts for dns servers")
To fix it, we track the number of failed consecutive sessions for a given
dns server. When we reach the threshold (set to 100), we consider that the
link to the dns server is broken (at least temporarily) and we force
dns_session_new() to fail, so that we stop creating new sessions until one
of the existing one eventually succeeds.
A workaround for this fix consists in setting the "maxconn" parameter on
nameserver directive (under resolvers section) to a reasonnable value so
that no more than "maxconn" sessions may co-exist on the same server at
a given time.
This may be backported to all stable versions.
("CLEANUP: dns: remove unused dns_stream_server struct member") may be
backported to ease the backport.
(cherry picked from commit
5288b39011b2449bfa896f7932c7702b5a85ee77)
Signed-off-by: Willy Tarreau <w@1wt.eu>
(cherry picked from commit
020348974b09cb011159260c08a8b821456029a6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>
#define DNS_TCP_MSG_MAX_SIZE 65535
#define DNS_TCP_MSG_RING_MAX_SIZE (1 + 1 + 3 + DNS_TCP_MSG_MAX_SIZE) // varint_bytes(DNS_TCP_MSG_MAX_SIZE) == 3
+/* threshold to consider that the link to dns server is failing
+ * and we should stop creating new sessions
+ */
+#define DNS_MAX_DSS_CONSECUTIVE_ERRORS 100
+
/* DNS request or response header structure */
struct dns_header {
uint16_t id;
struct dns_stream_server {
struct server *srv;
struct dns_ring *ring_req;
+ int consecutive_errors; /* number of errors since last successful query (atomically updated without lock) */
int maxconn;
int idle_conns;
int cur_conns;
LIST_APPEND(&ds->queries, &query->list);
eb32_insert(&ds->query_ids, &query->qid);
ds->onfly_queries++;
+ HA_ATOMIC_STORE(&ds->dss->consecutive_errors, 0);
}
/* update the tx_offset to handle output in 16k streams */
{
struct dns_session *ds = appctx->svcctx;
struct dns_stream_server *dss __maybe_unused;
+ int consecutive_errors;
if (!ds)
return;
/* reset offset to be sure to start from message start */
ds->tx_msg_offset = 0;
+ consecutive_errors = HA_ATOMIC_LOAD(&ds->dss->consecutive_errors);
+ /* we know ds encountered an error because it failed to send all
+ * its queries: increase consecutive_errors (we take some precautions
+ * to prevent the counter from overflowing since it is atomically
+ * updated)
+ */
+ while (consecutive_errors < DNS_MAX_DSS_CONSECUTIVE_ERRORS &&
+ !HA_ATOMIC_CAS(&ds->dss->consecutive_errors,
+ &consecutive_errors, consecutive_errors + 1) &&
+ __ha_cpu_relax());
+
/* here the ofs and the attached counter
* are kept unchanged
*/
if (dss->maxconn && (dss->maxconn <= dss->cur_conns))
return NULL;
+ if (HA_ATOMIC_LOAD(&dss->consecutive_errors) >= DNS_MAX_DSS_CONSECUTIVE_ERRORS)
+ return NULL;
+
ds = pool_zalloc(dns_session_pool);
if (!ds)
return NULL;