Repositories - haproxy-2.3.git/log

BUG/MEDIUM: mux-fcgi: Fix locking of idle_conns lock in the FCGI I/O callback

When the commit e388f2fbc ("MEDIUM: muxes: mark idle conns tasklets with
TASK_F_USR1") was backported (commit id b360bb88 on the 2.3), a call to
HA_SPIN_UNLOCK() was missed and not moved in the right code block. Thus it
is possible to unlock the idle_conns lock while it was never acquired.

This patch should fix the issue #1191. It is 2.3 specific, thus there is no
upstream commid id. No backport is needed except if commit b360bb88 is
backported.

BUG/MINOR: freq_ctr/threads: make use of the last updated global time

The freq counters were using the thread's own time as the start of the
current period. The problem is that in case of contention, it was
occasionally possible to perform non-monotonic updates on the edge of
the next second, because if the upfront thread updates a counter first,
it causes a rotation, then the second thread loses the race from its
older time, and tries again, and detects a different time again, but
in the past so it only updates the counter, then a third thread on the
new date would detect a change again, thus provoking a rotation again.

The effect was triple:
  - rare loss of stored values during certain transitions from one
    period to the next one, causing counters to report 0
  - half of the threads forced to go through the slow path every second
  - difficult convergence when using many threads where the CAS can fail
    a lot and we can observe N(N-1) attempts for N threads to complete

This patch fixes this issue in two ways:
  - first, it now makes use og the monotonic global_now value which also
    happens to be volatile and to carry the latest known time; this way
    time will never jump backwards anymore and only the first thread
    updates it on transition, the other ones do not need to.

  - second, re-read the time in the loop after each failure, because
    if the date changed in the counter, it means that one thread knows
    a more recent one and we need to update. In this case if it matches
    the new current second, the fast path is usable.

This patch relies on previous patch "MINOR: time: export the global_now
variable" and must be backported as far as 1.8.

(cherry picked from commit a1ecbca0a50ccb273bb5cdc9f031408d782a7bcf)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: time: export the global_now variable

This is the process-wide monotonic time that is used to update each
thread's own time. It may be required at a few places where a strictly
monotonic clock is required such as freq_ctr. It will be have to be
backported as a dependency of a forthcoming fix.

(cherry picked from commit 650f374f24c4dd14e90d336c72c2370242be1459)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

[RELEASE] Released version 2.3.7

Released version 2.3.7 with the following main changes :
    - BUG/MINOR: backend: fix condition for reuse on mode HTTP
    - BUG/MINOR: hlua: Don't strip last non-LWS char in hlua_pushstrippedstring()
    - BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode
    - REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h
    - BUG/MINOR: mt-list: always perform a cpu_relax call on failure
    - MINOR: atomic: add armv8.1-a atomics variant for cas-dw
    - MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs
    - BUG/MEDIUM: session: NULL dereference possible when accessing the listener
    - MINOR: tasks: refine the default run queue depth
    - MINOR: listener: refine the default MAX_ACCEPT from 64 to 4
    - OPTIM: server: switch the actconn list to an mt-list
    - MINOR: server: move actconns to the per-thread structure
    - MINOR: lb/api: let callers of take_conn/drop_conn tell if they have the lock
    - OPTIM: lb-first: do not take the server lock on take_conn/drop_conn
    - OPTIM: lb-leastconn: do not take the server lock on take_conn/drop_conn
    - OPTIM: lb-leastconn: do not unlink the server if it did not change
    - MINOR: dynbuf: make the buffer wait queue per thread
    - MINOR: dynbuf: use regular lists instead of mt_lists for buffer_wait
    - MINOR: dynbuf: pass offer_buffers() the number of buffers instead of a threshold
    - MINOR: stream: add an "epoch" to figure which streams appeared when
    - MINOR: cli/streams: make "show sess" dump all streams till the new epoch
    - MINOR: streams: use one list per stream instead of a global one
    - MEDIUM: streams: do not use the streams lock anymore
    - MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS
    - MINOR: pools: double the local pool cache size to 1 MB
    - MEDIUM: backend: use a trylock when trying to grab an idle connection
    - MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY
    - MINOR: ssl: mark the SSL handshake tasklet as heavy
    - BUG/MEDIUM: ssl: properly remove the TASK_HEAVY flag at end of handshake
    - MINOR: task: add an application specific flag to the state: TASK_F_USR1
    - MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1
    - MINOR: xprt: add new xprt_set_idle and xprt_set_used methods
    - MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks
    - MEDIUM: task: remove the tasks_run_queue counter and have one per thread
    - MINOR: task: give the scheduler a bit more flexibility in the runqueue size
    - OPTIM: task: automatically adjust the default runqueue-depth to the threads
    - BUG/MEDIUM: stick-tables: fix ref counter in table entry using multiple http tracksc.
    - BUILD: atomic/arm64: force the register pairs to use in __ha_cas_dw()
    - BUG/MEDIUM: filters: Set CF_FL_ANALYZE on channels when filters are attached
    - BUG/MINOR: tcpcheck: Update .health threshold of agent inside an agent-check
    - BUG/MINOR: proxy/session: Be sure to have a listener to increment its counters
    - BUG/MINOR: session: Add some forgotten tests on session's listener
    - BUG/MINOR: tcpcheck: Fix double free on error path when parsing tcp/http-check
    - CLEANUP: tcp-rules: add missing actions in the tcp-request error message
    - Revert "BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record"
    - BUG/MINOR: resolvers: Consider server to have no IP on DNS resolution error
    - BUG/MINOR: resolvers: Reset server address on DNS error only on status change
    - BUG/MINOR: resolvers: Unlink DNS resolution to set RMAINT on SRV resolution
    - BUG/MEDIUM: resolvers: Don't set an address-less server as UP
    - BUG/MEDIUM: resolvers: Fix the loop looking for an existing ADD item
    - MINOR: resolvers: new function find_srvrq_answer_record()
    - BUG/MINOR; resolvers: Ignore DNS resolution for expired SRV item
    - BUG/MEDIUM: resolvers: Trigger a DNS resolution if an ADD item is obsolete
    - MINOR: resolvers: Use a function to remove answers attached to a resolution
    - MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
    - MINOR: resolvers: Add function to change the srv status based on SRV resolution
    - MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
    - BUG/MEDIUM: resolvers: Don't release resolution from a requester callbacks
    - BUG/MEDIUM: resolvers: Skip DNS resolution at startup if SRV resolution is set
    - MINOR: resolvers: Use milliseconds for cached items in resolver responses
    - MINOR: resolvers: Don't try to match immediatly renewed ADD items
    - BUG/MINOR: resolvers: Add missing case-insensitive comparisons of DNS hostnames

BUG/MINOR: resolvers: Add missing case-insensitive comparisons of DNS hostnames

DNS hostname comparisons were fixed to be case-insensitive (see b17b88487
"BUG/MEDIUM: dns: Consider the fact that dns answers are
case-insensitive"). However 2 comparisons are still case-sensitive.

This patch must be backported as far as 1.8.

(cherry picked from commit 59b29257338f8e7fb6def6bddda45d08651ffe1c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Don't try to match immediatly renewed ADD items

The loop looking for existing ADD items to renew their last_seen must ignore
the items already renewed in the same loop. To do so, we rely on the
last_seen time. because it is now based on now_ms, it is safe.

Doing so avoid to match several time the same ADD item when the same IP
address is found in several ADD item. This reduces the number of extra DNS
resolutions.

This patch depends on "MINOR: resolvers: Use milliseconds for cached items
in resolver responses". Both may be backported as far as 2.2 if necessary.

(cherry picked from commit e8674c71840cd783d6ba1cc073c0d074aea09fd8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Use milliseconds for cached items in resolver responses

The last time when an item was seen in a resolver responses is now stored in
milliseconds instead of seconds. This avoid some corner-cases at the
edges. This also simplifies time comparisons.

(cherry picked from commit 55c1c4053f70bf53adfcd5067fa959e20f2b44fe)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Skip DNS resolution at startup if SRV resolution is set

At startup, if a SRV resolution is set for a server, no DNS resolution is
created. We must wait the first SRV resolution to know if it must be
triggered. It is important to do so for two reasons.

First, during a "classical" startup, a server based on a SRV resolution has
no hostname. Thus the created DNS resolution is useless. Best waiting the
first SRV resolution. It is not really a bug at this stage, it is just
useless.

Second, in the same situation, if the server state is loaded from a file,
its hosname will be set a bit later. Thus, if there is no additionnal record
for this server, because there is already a DNS resolution, it inhibits any
new DNS resolution. But there is no hostname attached to the existing DNS
resolution. So no resolution is performed at all for this server.

To avoid any problem, it is fairly easier to handle this special case during
startup. But this means we must be prepared to have no "resolv_requester"
field for a server at runtime.

This patch must be backported as far as 2.2.

(cherry picked from commit d83a6df5cde5f2e5417e9f995ab966ba01d17501)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Don't release resolution from a requester callbacks

Another way to say it: "Safely unlink requester from a requester callbacks".

Requester callbacks must never try to unlink a requester from a resolution, for
the current requester or another one. First, these callback functions are called
in a loop on a request list, not necessarily safe. Thus unlink resolution at
this place, may be unsafe. And it is useless to try to make these loops safe
because, all this stuff is placed in a loop on a resolution list. Unlink a
requester may lead to release a resolution if it is the last requester.

However, the unkink is necessary because we cannot reset the server state
(hostname and IP) with some pending DNS resolution on it. So, to workaround
this issue, we introduce the "safe" unlink. It is only performed from a
requester callback. In this case, the unlink function never releases the
resolution, it only reset it if necessary. And when a resolution is found
with an empty requester list, it is released.

This patch depends on the following commits :

* MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
* MINOR: resolvers: Use a function to remove answers attached to a resolution
* MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
* MINOR: resolvers: Add function to change the srv status based on SRV resolution

All the series must be backported as far as 2.2. It fixes a regression
introduced by the commit b4badf720 ("BUG/MINOR: resolvers: new callback to
properly handle SRV record errors").

don't release resolution from requester cb

(cherry picked from commit 0efc0993ec5210c8cf5a98eb9c0220014e030fd6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Directly call srvrq_update_srv_state() when possible

When the server status must be updated from the result of a SRV resolution,
we can directly call srvrq_update_srv_state(). It is simpler and this avoid
a test on the server DNS resolution.

This patch is mandatory for the next commit. It also rely on "MINOR:
resolvers: Directly call srvrq_update_srv_state() when possible".

(cherry picked from commit 6b117aed492d9916a5c656b4fbfcac6159133c0f)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Add function to change the srv status based on SRV resolution

srvrq_update_srv_status() update the server status based on result of SRV
resolution. For now, it is only used from snr_update_srv_status() when
appropriate.

(cherry picked from commit 5efdef24c1753d7a68d1f6c8dc8cb6b4b84a3361)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Purge answer items when a SRV resolution triggers an error

When a SRV request trigger an error, if we decide to handle the error
because last_valid duration is expired, the answer list may be purged. All
items are considered as obsolete.

(cherry picked from commit 51d5e3bda7fb07a5f03cb878dd66f673c3ed1a59)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Use a function to remove answers attached to a resolution

resolv_purge_resolution_answer_records() must be used to removed all answers
attached to a resolution. For now, it is only used when a resolution is
released.

(cherry picked from commit 1dec5c793474027ddffbfca3849e5ca4e9e51083)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Trigger a DNS resolution if an ADD item is obsolete

When a ADD item attached to a SRV item is removed because it is obsolete, we
must trigger a DNS resolution to be sure the hostname still resolves or
not. There is no other way to be the entry is still valid. And we cannot set
the server in RMAINT immediatly, because a DNS server may be inconsitent and
may stop to add some additionnal records.

The opposite is also true. If a valid ADD item is still attached to a SRV
item, any DNS resolution must be stopped. There is no reason to perform
extra resolution in this case.

This patch must be backported as far as 2.2.

(cherry picked from commit 3e0600fbbfee7daf957033505d56bc8dcc874185)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR; resolvers: Ignore DNS resolution for expired SRV item

If no ADD item is found for a SRV item in a SRV response, a DNS resolution
is triggered. When it succeeds, we must be sure the SRV item is still
alive. Otherwise the DNS resolution must be ignored.

This patch depends on the commit "MINOR: resolvers: Move last_seen time of
an ADD into its corresponding SRV item". Both must be backported as far as
2.2.

(cherry picked from commit 49531e8471a1c1ae9e1c7a6c1aeeb00713024beb)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: new function find_srvrq_answer_record()

This function search for a SRV answer item associated to a requester
whose type is server.
This is mainly useful to "link" a server to its SRV record when no
additional record were found to configure the IP address.

This patch is required by a bug fix.

(cherry picked from commit 6a8d11dc8051f4fc40c9ab1266cd58367ea1ca19)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Fix the loop looking for an existing ADD item

For each ADD item found in a SRV response, we try to find a corresponding
ADD item already attached to an existing SRV item. If found, the ADD
last_seen time is updated, otherwise we try to find a SRV item with no ADD
to attached the new one.

However, the loop is buggy. Instead of comparing 2 ADD items, it compares
the new ADD item with the SRV item. Because of this bug, we are unable to
renew last_seen time of existing ADD.

This patch must be backported as far as 2.2.

(cherry picked from commit 77f860699c17426ee6e727fa07a287b32276f1dd)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Don't set an address-less server as UP

when a server status is updated based on a SRV item, it is always set to UP,
regardless it has an IP address defined or not. For instance, if only a SRV
item is received, with no additional record, only the server hostname is
defined. We must wait to have an IP address to set the server as UP.

This patch must be backported as far as 2.2.

(cherry picked from commit ab177ac1f31b5c6c2e4f7d023923dd668cce3076)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Unlink DNS resolution to set RMAINT on SRV resolution

When a server is set in RMAINT becaues of a SRV resolution failure, the
server DNS resolution, if any, must be unlink first. It is mandatory to
handle the change in the context of a SRV resolution.

This patch must be backported as far as 2.2.

(cherry picked from commit bca680ba90787fc57b9711e268f6fb0c74e7e49c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Reset server address on DNS error only on status change

When a DNS resolution error is detected, in snr_resolution_error_cb(), the
server address must be reset only if the server status has changed. It this
case, it means the server is set to RMAINT. Thus the server address may by
reset.

This patch fixes a bug introduced by commit d127ffa9f ("BUG/MEDIUM:
resolvers: Reset address for unresolved servers"). It must be backported as
far as 2.0.

(cherry picked from commit 5130c21fbb806a28ab4e535b700df872a80798d8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Consider server to have no IP on DNS resolution error

When an error is received for a DNS resolution, for instance a NXDOMAIN
error, the server must be considered to have no address when its status is
updated, not the opposite.

Concretly, because this parameter is not used on error path in
snr_update_srv_status(), there is no impact.

This patch must be backported as far as 1.8.

(cherry picked from commit bd0227c1096ecc9e11da315c399b178b16ad7a9d)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

Revert "BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record"

This reverts commit a331a1e8eb2ad4750711a477ca3e22d940495faf.

This commit fixes a real bug, but it also reveals some hidden bugs, mostly
because of some design issues. Thus, in itself, it create more problem than
it solves. So revert it for now. All known bugs will be addressed in next
commits.

This patch should be backported as far as 2.2.

(cherry picked from commit 5037c06d91e0ed748932fcfaae11eeadf5949957)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

CLEANUP: tcp-rules: add missing actions in the tcp-request error message

The tcp-request error message only mentions "accept", "reject" and
track-sc*, but there are a few other ones that were missing, so let's
add them.

This could be backported, though it's not likely that it will help anyone
with an existing config.

(cherry picked from commit 72d012fbd971bfd8b2bf6ab022572a816b759cbd)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcpcheck: Fix double free on error path when parsing tcp/http-check

When a "tcp-check" or a "http-check" rule is parsed, we try to get the
previous rule in the ruleset to get its index. We must take care to reset
the pointer on this rule in case an error is triggered later on the
parsing. Otherwise, the same rule may be released twice. For instance, it
happens with such line :

http-check meth GET uri / ## note there is no "send" parameter

This patch must be backported as far as 2.2.

(cherry picked from commit cd03be73d59ad2c5d9ad43b28a1b0fda9f32526c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: session: Add some forgotten tests on session's listener

During backport of commit 36119de18 ("BUG/MEDIUM: session: NULL dereference
possible when accessing the listener"), I missed some tests on the session's
listener because of the context changes.

It is specific to the 2.3, thus there is no upstream commit ID. It must
backported with the above commit as far as 1.8.

BUG/MINOR: proxy/session: Be sure to have a listener to increment its counters

It is possible to have a session without a listener. It happens for applets
on the client side. Thus all accesses to the listener info from the session
must be guarded. It was the purpose of the commit 36119de18 ("BUG/MEDIUM:
session: NULL dereference possible when accessing the listener"). However,
some tests on the session's listener existence are missing in proxy_inc_*
functions.

This patch should fix the issues #1171, #1172, #1173, #1174 and #1175. It
must be backported with the above commit as far as 1.8.

(cherry picked from commit 77e376783e891fe5dda4fe4e1198cf8dff0b118d)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcpcheck: Update .health threshold of agent inside an agent-check

If an agent-check is configured for a server, When the response is parsed,
the .health threshold of the agent must be updated on up/down/stopped/fail
command and not the threshold of the health-check. Otherwise, the
agent-check will compete with the health-check and may mark a DOWN server as
UP.

This patch should fix the issue #1176. It must be backported as far as 2.2.

(cherry picked from commit 24ec9434271345857b42cc5bd9c6b497ab01a7e4)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: filters: Set CF_FL_ANALYZE on channels when filters are attached

CF_FL_ANALYZE flag is used to know a channel is filtered. It is important to
synchronize request and response channels when the filtering ends.

However, it is possible to call all request analyzers before starting the
filtering on the response channel. This means flt_end_analyze() may be
called for the request channel before flt_start_analyze() on the response
channel. Thus because CF_FL_ANALYZE flag is not set on the response channel,
we consider the filtering is finished on both sides. The consequence is that
flt_end_analyze() is not called for the response and backend filters are
unregistered before their execution on the response channel.

It is possible to encounter this bug on TCP frontend or CONNECT request on
HTTP frontend if the client shutdown is reveiced with the first read.

To fix this bug, CF_FL_ANALYZE is set when filters are attached to the
stream. It means, on the request channel when the stream is created, in
flt_stream_start(). And on both channels when the backend is set, in
flt_set_stream_backend().

This patch must be backported as far as 1.7.

(cherry picked from commit 5647fbacdf822f2a6812a07dbdf27660f436f103)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUILD: atomic/arm64: force the register pairs to use in __ha_cas_dw()

Since commit f8fb4f75f ("MINOR: atomic: implement a more efficient arm64
__ha_cas_dw() using pairs"), on some modern arm64 (armv8.1+) compiled
with -march=armv8.1-a under gcc-7.5.0, a build error may appear on ev_poll.o :

/tmp/ccHD2lN8.s:1771: Error: reg pair must start from even reg at operand 1 -- `casp x27,x28,x22,x23,[x12]'
Makefile:927: recipe for target 'src/ev_poll.o' failed

It appears that the compiler cannot always assign register pairs there
for a structure made of two u64. It was possibly later addressed since
gcc-9.3 never caused this, but there's no trivially available info on
the subject in the changelogs. Unsuprizingly, using a u128 instead does
fix this, but it significantly inflates the code (+4kB for just 6 places,
very likely that it loaded some extra stubs) and the comparison is ugly,
involving two slower conditional jumps instead of a single one and a
conditional comparison. For example, ha_random64() grew from 144 bytes
to 232.

However, simply forcing the base register does work pretty well, and
makes the code even cleaner and more efficient by further reducing it
by about 4.5kB, possibly because it helps the compiler to pick suitable
registers for the pair there. And the perf on 64-cores looks steadily
0.5% above the previous one, so let's do this.

Note that the commit above was backported to 2.3 to fix scalability
issues on AWS Graviton2 platform, so this one will need to be as well.

(cherry picked from commit 3b728a92bbe35b0c3ef34c3c8942ac5541a1713c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: stick-tables: fix ref counter in table entry using multiple http tracksc.

Setting multiple http-request track-scX rules generates entries
which never expires.

If there was already an entry registered by a previous http rule
'stream_track_stkctr(&s->stkctr[rule->action], t, ts)' didn't
register the new 'ts' into the stkctr. And function is left
with no reference on 'ts' whereas refcount had been increased
by the '_get_entry'

The patch applies the same policy as the one showed on tcp track
rules and if there is successive rules the track counter keep the
first entry registered in the counter and nothing more is computed.

After validation this should be backported in all versions.

(cherry picked from commit 362d25e50708adf2b273bf9ee04f6566f113815e)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

OPTIM: task: automatically adjust the default runqueue-depth to the threads

The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.

(cherry picked from commit 060a7612487c244175fa1dc1e5b224015cbcf503)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: give the scheduler a bit more flexibility in the runqueue size

Instead of setting a hard-limit on runqueue-depth and keeping it short
to maintain fairness, let's allow the scheduler to automatically cut
the existing one in two equal halves if its size is between the configured
size and its double. This will allow to increase the default value while
keeping a low latency.

(cherry picked from commit 1691ba3693d4b11cc85aed606d39ffbd23d8533f)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: task: remove the tasks_run_queue counter and have one per thread

This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!

In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.

It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
  - global run queue (tree)
  - per-thread run queue (tree)
  - per-thread shared tasklets list
  - per-thread local lists

Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.

(cherry picked from commit 9c7b8085f4cad284642e7f67d2274f2fb568f243)
[wt: backported as it does have a significant impact on many-core machines;
with 48 threads, the request rate is simply doubled and the response time
halved; There were significant context updates, all verified to be OK; it
does not seem reasonable to backport it to 2.2]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks

Currently the SSL layer checks the validity of its tasklet's context just
in case it would have been stolen, had the connection been idle. Now it
will be able to be notified by the mux when this situation happens so as
not to have to grab the idle connection lock on each pass. This reuses the
TASK_F_USR1 flag just as the muxes do.

(cherry picked from commit 4149168255b1dc252aece0e5bbab6f55310f1ad5)
[wt: this is the second part of the takeover lock reduction, as the SSL
layer was also hammering this lock and slowing everyone down, especially
during initial handshakes. The changes are the same as for the muxes]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: xprt: add new xprt_set_idle and xprt_set_used methods

These functions are used on the mux layer to indicate that the connection
is becoming idle and that the xprt ought to be careful before checking the
context or that it's not idle anymore and that the context is safe. The
purpose is to allow a mux which is going to release a connection to tell
the xprt to be careful when touching it. At the moment, the xprt are
always careful and that's costly so we want to have the ability to relax
this a bit.

No xprt layer uses this yet.

(cherry picked from commit 4f8cd4397f07389129dd23b07e94018404340731)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1

The muxes are touching the idle_conns_lock all the time now because
they need to be careful that no other thread has stolen their tasklet's
context.

This patch changes this a little bit by setting the TASK_F_USR1 flag on
the tasklet before marking a connection idle, and removing it once it's
not idle anymore. Thanks to this we have the guarantee that a tasklet
without this flag cannot be present in an idle list and does not need
to go through this costly lock. This is especially true for front
connections.

(cherry picked from commit e388f2fbca40197590bd15dce0f4eb4d6cded20a)
[wt: backported as really needed to address the high contention issues
in multi-threaded environments: all I/O tasklets queue up on the
takeover lock as soon as there's some activity on the reuse part,
sometimes causing "reuse always" to be slower than "reuse never"!
The context differs quite a bit due to the changes in tasks and idle
conns in 2.4, but the main principle is to bypass the lock when
TASK_F_USR1 is not set. ]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: add an application specific flag to the state: TASK_F_USR1

This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.

(cherry picked from commit 6fa8bcdc785c29cd3a29c14d2038f38b5547eff0)
[wt: needed for upcoming fixes. task->state is 16-bit in 2.3 and before
so we've reused bit 15 which was still available there. Ctx adjusments
since no TASK_F_TASKLET here]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: ssl: properly remove the TASK_HEAVY flag at end of handshake

Emeric found that SSL+keepalive traffic had dropped quite a bit in the
recent changes, which could be bisected to recent commit 9205ab31d
("MINOR: ssl: mark the SSL handshake tasklet as heavy"). Indeed, a
first incarnation of this commit made use of the TASK_SELF_WAKING
flag but the last version directly used TASK_HEAVY, but it would still
continue to remove the already absent TASK_SELF_WAKING one instead of
TASK_HEAVY. As such, the SSL traffic remained processed with low
granularity.

No backport is needed as this is only 2.4.

(cherry picked from commit 4c48edba4f45bb78f41af7d79d3c176710fe6a90)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: ssl: mark the SSL handshake tasklet as heavy

There's a fairness issue between SSL and clear text. A full end-to-end
cleartext connection can require up to ~7.7 wakeups on average, plus 3.3
for the SSL tasklet, one of which is particularly expensive. So if we
accept to process many handshakes taking 1ms each, we significantly
increase the processing time of regular tasks just by adding an extra
delay between their calls. Ideally in order to be fair we should have a
1:18 call ratio, but this requires a bit more accounting. With very little
effort we can mark the SSL handshake tasklet as TASK_HEAVY until the
handshake completes, and remove it once done.

Doing so reduces from 14 to 3.0 ms the total response time experienced
by HTTP clients running in parallel to 1000 SSL clients doing full
handshakes in loops. Better, when tune.sched.low-latency is set to "on",
the latency further drops to 1.8 ms.

The tasks latency distribution explain pretty well what is happening:

Without the patch:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              2785375   19.35m    416.9us   5.401h    6.980ms
    h1_io_cb                    1868949   9.853s    5.271us   4.829h    9.302ms
    process_stream              1864066   7.582s    4.067us   2.058h    3.974ms
    si_cs_io_cb                 1733808   1.932s    1.114us   26.83m    928.5us
    h1_timeout_task              935760      -         -      1.033h    3.975ms
    accept_queue_process         303606   4.627s    15.24us   16.65m    3.291ms
    srv_cleanup_toremove_connections452   64.31ms   142.3us   2.447s    5.415ms
    task_run_applet                  47   5.149ms   109.6us   57.09ms   1.215ms
    srv_cleanup_idle_connections     34   2.210ms   65.00us   87.49ms   2.573ms

With the patch:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              3000365   21.08m    421.6us   20.30h    24.36ms
    h1_io_cb                    2031932   9.278s    4.565us   46.70m    1.379ms
    process_stream              2010682   7.391s    3.675us   22.83m    681.2us
    si_cs_io_cb                 1702070   1.571s    922.0ns   8.732m    307.8us
    h1_timeout_task             1009594      -         -      17.63m    1.048ms
    accept_queue_process         339595   4.792s    14.11us   3.714m    656.2us
    srv_cleanup_toremove_connections779   75.42ms   96.81us   438.3ms   562.6us
    srv_cleanup_idle_connections     48   2.498ms   52.05us   178.1us   3.709us
    task_run_applet                  17   1.738ms   102.3us   11.29ms   663.9us
    other                             1   947.8us   947.8us   202.6us   202.6us

  => h1_io_cb() and process_stream() are divided by 6 while ssl_sock_io_cb() is
     multipled by 4

And with low-latency on:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              3000565   20.96m    419.1us   20.74h    24.89ms
    h1_io_cb                    2019702   9.294s    4.601us   49.22m    1.462ms
    process_stream              2009755   6.570s    3.269us   1.493m    44.57us
    si_cs_io_cb                 1997820   1.566s    783.0ns   2.985m    89.66us
    h1_timeout_task             1009742      -         -      1.647m    97.86us
    accept_queue_process         494509   4.697s    9.498us   1.240m    150.4us
    srv_cleanup_toremove_connections1120   92.32ms   82.43us   463.0ms   413.4us
    srv_cleanup_idle_connections     70   2.703ms   38.61us   204.5us   2.921us
    task_run_applet                  13   1.303ms   100.3us   85.12us   6.548us

  => process_stream() is divided by 100 while ssl_sock_io_cb() is
     multipled by 4

Interestingly, the total HTTPS response time doesn't increase and even very
slightly decreases, with an overall ~1% higher request rate. The net effect
here is a redistribution of the CPU resources between internal tasks, and
in the case of SSL, handshakes wait bit more but everything after completes
faster.

This was made simple enough to be backportable if it helps some users
suffering from high latencies in mixed traffic.

(cherry picked from commit 9205ab31d2383f33612e27d8ce74ab62a27695fc)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY

While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.

An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.

But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.

Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".

(cherry picked from commit 74dea8caeadfaf3cac262fd4f2c532788c396fb5)
[wt: tasklet_wakeup_on() is in task.h in 2.3]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: backend: use a trylock when trying to grab an idle connection

In conn_backend_get() we can cause some extreme contention due to the
idle_conns_lock. Indeed, even though it's per-thread, it still causes
high contention when running with many threads. The reason is that all
threads which do not have any idle connections are quickly skipped,
till the point where there are still some, so the first reaching that
point will grab the lock and the other ones wait behind. From this
point, all threads are synchronized waiting on the same lock, and
will follow the leader in small jumps, all hindering each other.

Here instead of doing this we're using a trylock. This way when a thread
is already checking a list, other ones will continue to next thread. In
the worst case, a high contention will lead to a few new connections to
be set up, but this may actually be what is required to avoid contention
in the first place. With this change, the contention has mostly
disappeared on this lock (it's still present in muxes and transport
layers due to the takeover).

Surprisingly, checking for emptiness of the tree root before taking
the lock didn't address any contention.

A few improvements are still possible and desirable here. The first
one would be to avoid seeing all threads jump to the next one. We
could have each thread use a different prime number as the increment
so as to spread them across the entire table instead of keeping them
synchronized. The second one is that the lock in the muck layers
shouldn't be needed to check for the tasklet's context availability.

(cherry picked from commit b1adf03df970958e82ad33efadba3c2586ffc02f)
[wt: this is a major cause of contention in highly threaded environments,
hence the backport. Minor ctx adjustments (list vs tree) and lock name]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: pools: double the local pool cache size to 1 MB

The reason is that H2 can already require 32 16kB buffers for the mux
output at once, which will deplete the local cache. Thus it makes sense
to go further to leave some time to other connection to release theirs.
In addition, the L2 cache on modern CPUs is already 1 MB, so this change
is welcome in any case.

(cherry picked from commit f587003fe916315e77e130339e21fc8d3a4a7248)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS

We've reached a point where the global pools represent a significant
bottleneck with threads. On a 64-core machine, the performance was
divided by 8 between 32 and 64 H2 connections only because there were
not enough entries in the local caches to avoid picking from the global
pools, and the contention on the list there was very high. It becomes
obvious that we need to have an array of lists, but that will require
more changes.

In parallel, standard memory allocators have improved, with tcmalloc
and jemalloc finding their ways through mainstream systems, and glibc
having upgraded to a thread-aware ptmalloc variant, keeping this level
of contention here isn't justified anymore when we have both the local
per-thread pool caches and a fast process-wide allocator.

For these reasons, this patch introduces a new compile time setting
CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are
enabled with thread local pool caches, and we know we have a fast
thread-aware memory allocator (currently set for glibc>=2.26). In this
case we entirely bypass the global pool and directly use the standard
memory allocator when missing objects from the local pools. It is also
possible to force it at compile time when a good allocator is used with
another setup.

It is still possible to re-enable the global pools using
CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the
operating system's default allocator, or when building with a recent
libc but a different allocator which provides other benefits but does
not scale well with threads.

(cherry picked from commit 0bae075928250ba036cb1d96485a6e72bdb6283c)
[wt: backported to 2.3 since it addresses some serious contention in
h2 where streams are allocated in bursts]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: streams: do not use the streams lock anymore

The lock was still used exclusively to deal with the concurrency between
the "show sess" release handler and a stream_new() or stream_free() on
another thread. All other accesses made by "show sess" are already done
under thread isolation. The release handler only requires to unlink its
node when stopping in the middle of a dump (error, timeout etc). Let's
just isolate the thread to deal with this case so that it's compatible
with the dump conditions, and remove all remaining locking on the streams.

This effectively kills the streams lock. The measured gain here is around
1.6% with 4 threads (374krps -> 380k).

(cherry picked from commit 49de68520e9dae3817885397373151aee4901c1b)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: streams: use one list per stream instead of a global one

The global streams list is exclusively used for "show sess", to look up
a stream to shut down, and for the hard-stop. Having all of them in a
single list is extremely expensive in terms of locking when using threads,
with performance losses as high as 7% having been observed just due to
this.

This patch makes the list per-thread, since there's no need to have a
global one in this situation. All call places just iterate over all
threads. The most "invasive" changes was in "show sess" where the end
of list needs to go back to the beginning of next thread's list until
the last thread is seen. For now the lock was maintained to keep the
code auditable but a next commit should get rid of it.

The observed performance gain here with only 4 threads is already 7%
(350krps -> 374krps).

(cherry picked from commit a698eb6739b502d957da773e121ef40833f818dc)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: cli/streams: make "show sess" dump all streams till the new epoch

Instead of placing the current stream at the end of the stream list when
issuing a "show sess" on the CLI as was done in 2.2 with commit c6e7a1b8e
("MINOR: cli: make "show sess" stop at the last known session"), now we
compare the listed stream's epoch with the dumping stream's and stop on
more recent ones.

This way we're certain to always only dump known streams at the moment we
issue the dump command without having to modify the list. In theory we
could miss some streams if more than 2^31 "show sess" requests are issued
while an old stream remains present, but that's 68 years at 1 "show sess"
per second and it's unlikely we'll keep a process, let alone a stream, that
long.

It could be verified that the count of dumped streams still matches the
one before this change.

(cherry picked from commit 5d533e2bad668f10e5b00c24b7593847149a6be2)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: stream: add an "epoch" to figure which streams appeared when

The "show sess" CLI command currently lists all streams and needs to
stop at a given position to avoid dumping forever. Since 2.2 with
commit c6e7a1b8e ("MINOR: cli: make "show sess" stop at the last known
session"), a hack consists in unlinking the stream running the applet
and linking it again at the current end of the list, in order to serve
as a delimiter. But this forces the stream list to be global, which
affects scalability.

This patch introduces an epoch, which is a global 32-bit counter that
is incremented by the "show sess" command, and which is copied by newly
created streams. This way any stream can know whether any other one is
newer or older than itself.

For now it's only stored and not exploited.

(cherry picked from commit b981318c11ce8c1969b931b01d51be38dbbaa4a7)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: pass offer_buffers() the number of buffers instead of a threshold

Historically this function would try to wake the most accurate number of
process_stream() waiters. But since the introduction of filters which could
also require buffers (e.g. for compression), things started not to be as
accurate anymore. Nowadays muxes and transport layers also use buffers, so
the runqueue size has nothing to do anymore with the number of supposed
users to come.

In addition to this, the threshold was compared to the number of free buffer
calculated as allocated minus used, but this didn't work anymore with local
pools since these counts are not updated upon alloc/free!

Let's clean this up and pass the number of released buffers instead, and
consider that each waiter successfully called counts as one buffer. This
is not rocket science and will not suddenly fix everything, but at least
it cannot be as wrong as it is today.

This could have been marked as a bug given that the current situation is
totally broken regarding this, but this probably doesn't completely fix
it, it only goes in a better direction. It is possible however that it
makes sense in the future to backport this as part of a larger series if
the situation significantly improves.

(cherry picked from commit 4d77bbf8560b5d5b32409be131e4975811bfec28)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: use regular lists instead of mt_lists for buffer_wait

There's no point anymore in keeping mt_lists for the buffer_wait and
buffer_wq since it's thread-local now.

(cherry picked from commit 90f366b59556f9e64c927286bfc8e9c06d511744)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: make the buffer wait queue per thread

The buffer wait queue used to be global historically but this doest not
make any sense anymore given that the most common use case is to have
thread-local pools. Thus there's no point waking up waiters of other
threads after releasing an entry, as they won't benefit from it.

Let's move the queue head to the thread_info structure and use
ti->buffer_wq from now on.

(cherry picked from commit e8e5091510f667feeb36f68b8d2f4d9e8cf00dc0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-leastconn: do not unlink the server if it did not change

Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.

Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.

This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).

This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:

16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)

One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.

A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.

(cherry picked from commit 5064ab6a982a82c275393220716b75c13789de89)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-leastconn: do not take the server lock on take_conn/drop_conn

The operations are only an insert and a delete into the LB tree, which
doesn't require the server's lock at all as the lbprm lock is already
held. Let's drop it. Just for the sake of cleanness, given that the
served and nbpend values used to be atomically updated, we'll use an
atomic load to read them.

(cherry picked from commit 85b2fb03580f60d20d20ecbda8c9b63804d8e900)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-first: do not take the server lock on take_conn/drop_conn

The operations are only an insert and a delete into the LB tree, which
doesn't require the server's lock at all as the lbprm lock is already
held. Let's drop it.

(cherry picked from commit 6b96e0e9d204bb7e30222964d641cc28a0e4d0c0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: lb/api: let callers of take_conn/drop_conn tell if they have the lock

The two algos defining these functions (first and leastconn) do not need the
server's lock. However it's already present in pendconn_process_next_strm()
so the API must be updated so that the functions may take it if needed and
that the callers indicate whether they already own it.

As such, the call places (backend.c and stream.c) now do not take it
anymore, queue.c was unchanged since it's already held, and both "first"
and "leastconn" were updated to take it if not already held.

A quick test on the "first" algo showed a jump from 432 to 565k rps by
just dropping the lock in stream.c!

(cherry picked from commit 59b0fecfd9f2fe96f6a9c2d1d4fbe34f71adb6e3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: server: move actconns to the per-thread structure

The actconns list creates massive contention on low server counts because
it's in fact a list of streams using a server, all threads compete on the
list's head and it's still possible to see some watchdog panics on 48
threads under extreme contention with 47 threads trying to add and one
thread trying to delete.

Moving this list per thread is trivial because it's only used by
srv_shutdown_streams(), which simply required to iterate over the list.

The field was renamed to "streams" as it's really a list of streams
rather than a list of connections.

(cherry picked from commit d4e78d873cecf9938885c90e736f8b761a35fb55)
[wt: for 2.3 and older, this is a simplified version. We don't want to
touch all the server initialization code and sequence with the risks
of causing new trouble, so instead we declare actconns as an array
of the maximum number of supported threads, this will eat a bit more
memory on smaller machines but will remain safe. Those building with
MAX_THREADS close or equal to their target number of threads may even
save some RAM compared to 2.4. The only servers not initialized via
new_server() are the two Lua socket servers, and they've been handled
here]
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: server: switch the actconn list to an mt-list

The remaining contention on the server lock solely comes from
sess_change_server() which takes the lock to add and remove a
stream from the server's actconn list. This is both expensive
and pointless since we have mt-lists, and this list is only
used by the CLI's "shutdown server sessions" command!

Let's migrate to an mt-list and remove the need for this costly
lock. By doing so, the request rate increased by ~1.8%.

(cherry picked from commit 751153e0f119bec90455cda95166f1b29d8b0326)
[wt: the contention is extreme on high threads counts, thus this
actconn series is backported; ctx adjustments]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: listener: refine the default MAX_ACCEPT from 64 to 4

The maximum number of connections accepted at once by a thread for a single
listener used to default to 64 divided by the number of processes but the
tasklet-based model is much more scalable and benefits from smaller values.
Experimentation has shown that 4 gives the highest accept rate for all
thread values, and that 3 and 5 come very close, as shown below (HTTP/1
connections forwarded per second at multi-accept 4 and 64):

ac\thr|    1     2    4     8     16
------+------------------------------
      4|   80k  106k  168k  270k  336k
     64|   63k   89k  145k  230k  274k

Some tests were also conducted on SSL and absolutely no change was observed.

The value was placed into a define because it used to be spread all over the
code.

It might be useful at some point to backport this to 2.3 and 2.2 to help
those who observed some performance regressions from 1.6.

(cherry picked from commit 66161326fd120652557d419810b1323405d7859d)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: tasks: refine the default run queue depth

Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:

rq\thr|    1     2     4     8    16
------+------------------------------
     32|  120k  159k  276k  477k  698k
     40|  122k  160k  276k  478k  722k
     48|  121k  159k  274k  482k  720k
     64|  121k  160k  274k  469k  710k
    200|  114k  150k  247k  415k  613k  <-- default

It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.

The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.

After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.

(cherry picked from commit 4327d0ac00fd29622e8558a206dfbf23eaeb9fe9)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: session: NULL dereference possible when accessing the listener

When implementing a client applet, a NULL dereference was encountered on
the error path which increment the counters.

Indeed, the counters incremented are the one in the listener which does
not exist in the case of client applets, so in sess->listener->counters,
listener is NULL.

This patch fixes the access to the listener structure when accessing
from a sesssion, most of the access are the counters in error paths.

Must be backported as far as 1.8.

(cherry picked from commit 36119de182154b1f87e0cdf4bd1efba9e2e64113)
[wt: minor ctx adjustments in http_ana and mux_h1]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs

There finally is a way to support register pairs on aarch64 assembly
under gcc, it's just undocumented, like many of the options there :-(
As indicated below, it's possible to pass "%H" to mention the high
part of a register pair (e.g. "%H0" to go with "%0"):

https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/

By making local variables from pairs of registers via a struct (as
is used in IST for example), we can let gcc choose the correct register
pairs and avoid a few moves in certain situations. The code is now slightly
more efficient than the previous one on AWS' Graviton2 platform, and
noticeably smaller (by 4.5kB approx). A few tests on older releases show
that even Linaro's gcc-4.7 used to support such register pairs and %H,
and by then ATOMICS were not supported so this should not cause build
issues, and as such this patch replaces the earlier implementation.

(cherry picked from commit f8fb4f75f147053fc939ab9bcc456c0e4e5b28a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: atomic: add armv8.1-a atomics variant for cas-dw

This variant uses the CASP instruction available on armv8.1-a CPU cores,
which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7,
mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS'
Graviton2 CPUs.

The instruction performs way better on high thread counts since it
guarantees some forward progress when facing extreme contention while
the original LL/SC approach is light on low-thread counts but doesn't
guarantee progress.

The implementation is not the most optimal possible. In particular since
the instruction requires to work on register pairs and there doesn't seem
to be a way to force gcc to emit register pairs, we have to decide to force
to use the pair (x0,x1) to store the old value, and (x2,x3) to store the
new one, and this necessarily involves some extra moves. But at least it
does improve the situation with 16 threads and more. See issue #958 for
more context.

Note, a first implementation of this function was making use of an
input/output constraint passed using "+Q"(*(void**)target), which was
resulting in smaller overall code than passing "target" as an input
register only. It turned out that the cause was directly related to
whether the function was inlined or not, hence the "forceinline"
attribute. Any changes to this code should still pay attention to this
important factor.

(cherry picked from commit 46cca8690029dcdd13c86894aefdf2ef2d6f6b07)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: mt-list: always perform a cpu_relax call on failure

On highly threaded machines it is possible to occasionally trigger the
watchdog on certain contended areas like the server's connection list,
because while the mechanism inherently cannot guarantee a constant
progress, it lacks CPU relax calls which are absolutely necessary in
this situation to let a thread finish its job.

The loop's "while (1)" was changed to use a "for" statement calling
__ha_cpu_relax() as its continuation expression. This way the "continue"
statements jump to the unique place containing the pause without
excessively inflating the code.

This was sufficient to definitely fix the problem on 64-core ARM Graviton2
machines. This patch should probably be backported once it's confirmed it
also helps on many-cores x86 machines since some people are facing
contention in these environments. This patch depends on previous commit
"REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h".

An attempt was made to first read the value before exchanging, and it
significantly degraded the performance. It's very likely that this caused
other cores to lose exclusive ownership on their line and slow down their
next xchg operation.

In addition it was found that MT_LIST_ADD is significantly faster than
MT_LIST_ADDQ under high contention, because it fails one step earlier
when conflicting with an adjacent MT_LIST_DEL(). It might be worth
switching some operations' order to favor MT_LIST_ADDQ() instead.

(cherry picked from commit 168fc5332c7b3f43c8841a999fc40a3acef85223)
Signed-off-by: Willy Tarreau <w@1wt.eu>

REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h

There is some confusion here as we need to place some cpu_relax statements
in some loops where it's not easily possible to condition them on the use
of threads. That's what atomic.h already does. So let's take the various
pl_cpu_relax() implementations from there and place them in atomic.h under
the name __ha_cpu_relax() and let them adapt to the presence or absence of
threads and to the architecture (currently only x86 and aarch64 use a barrier
instruction), though it's very likely that arm would work well with a cache
flushing ISB instruction as well).

This time they were implemented as expressions returning 1 rather than
statements, in order to ease their placement as the loop condition or the
continuation expression inside "for" loops. We should probably do the same
with barriers and a few such other ones.

(cherry picked from commit 958ae26c3558f0a5cdcb7a92cc535f1cd1ac9a64)
[wt: will be used by later fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode

Errors reported by ssl_sock_dump_errors() to stderr would only report the
16 lower bits of the file descriptor because it used to be casted to ushort.
This can be backported to all versions but has really no importance in
practice since this is never seen.

(cherry picked from commit 566cebc1fc4f9908a47a1924c80ff32460543a49)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: hlua: Don't strip last non-LWS char in hlua_pushstrippedstring()

hlua_pushstrippedstring() function strips leading and trailing LWS
characters. But the result length it too short by 1 byte. Thus the last
non-LWS character is stripped. Note that a string containing only LWS
characters resulting to a stipped string with an invalid length (-1). This
leads to a lua runtime error.

This bug was reported in the issue #1155. It must be backported as far as
1.7.

(cherry picked from commit 2ec4e3c1acf95bcdc56028bbefe1a355c457b978)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: backend: fix condition for reuse on mode HTTP

This commit is a fix/complement to the following one :
08d87b3f49867440f66aee09173c84bf58cbc859
BUG/MEDIUM: backend: never reuse a connection for tcp mode

It fixes the check for the early insertion of backend connections in
the reuse lists if the backend mode is HTTP.

The impact of this bug seems limited because :
- in tcp mode, no insertion is done in the avail list as mux_pt does not
  support multiple streams.
- in http mode, muxes are also responsible to insert backend connections
  in lists in their detach functions. Prior to this fix the reuse rate
  could be slightly inferior.

It can be backported to 2.3.

(cherry picked from commit 249f0562cf2654488100d83d66c73902a2b1eb6c)
Signed-off-by: Willy Tarreau <w@1wt.eu>

[RELEASE] Released version 2.3.6

Released version 2.3.6 with the following main changes :
    - MINOR: check: do not ignore a connection header for http-check send
    - BUILD: ssl: fix typo in HAVE_SSL_CTX_ADD_SERVER_CUSTOM_EXT macro
    - BUILD: ssl: guard SSL_CTX_add_server_custom_ext with special macro
    - BUILD: ssl: guard SSL_CTX_set_msg_callback with SSL_CTRL_SET_MSG_CALLBACK macro
    - BUG/MINOR: intops: fix mul32hi()'s off-by-one
    - BUG/MINOR: http-ana: Don't increment HTTP error counter on internal errors
    - BUG/MEDIUM: mux-h1: Always set CS_FL_EOI for response in MSG_DONE state
    - BUG/MINOR: server: re-align state file fields number
    - BUG/MINOR: tools: Fix a memory leak on error path in parse_dotted_uints()
    - BUG/MINOR: backend: hold correctly lock when killing idle conn
    - BUG/MINOR: server: Fix server-state-file-name directive
    - CLEANUP: deinit: release global and per-proxy server-state variables on deinit
    - BUG/MEDIUM: config: don't pick unset values from last defaults section
    - BUG/MINOR: stats: revert the change on ST_CONVDONE
    - BUG/MINOR: cfgparse: do not mention "addr:port" as supported on proxy lines
    - BUG/MINOR: server: Don't call fopen() with server-state filepath set to NULL
    - DOC: tune: explain the origin of block size for ssl.cachesize
    - CLEANUP: channel: fix comment in ci_putblk.
    - BUG/MINOR: server: Remove RMAINT from admin state when loading server state
    - BUG/MINOR: session: atomically increment the tracked sessions counter
    - BUG/MINOR: checks: properly handle wrapping time in __health_adjust()
    - BUG/MEDIUM: checks: don't needlessly take the server lock in health_adjust()
    - BUG/MINOR: sample: Always consider zero size string samples as unsafe
    - BUILD: ssl: introduce fine guard for OpenSSL specific SCTL functions
    - DOC: explain the relation between pool-low-conn and tune.idle-pool.shared
    - BUG/MEDIUM: lists: Avoid an infinite loop in MT_LIST_TRY_ADDQ().
    - BUG/MEDIUM: spoe: Resolve the sink if a SPOE logs in a ring buffer
    - BUG/MINOR: http-rules: Always replace the response status on a return action
    - BUG/MINOR: server: Init params before parsing a new server-state line
    - BUG/MINOR: server: Be sure to cut the last parsed field of a server-state line
    - BUG/MEDIUM: mux-h1: Fix handling of responses to CONNECT other than 200-ok
    - BUG/MINOR: ssl/cli: potential null pointer dereference in "set ssl cert"
    - MINOR: Configure the `cpp` userdiff driver for *.[ch] in .gitattributes
    - BUG/MINOR: sample: secure convs that accept base64 string and var name as args
    - BUG/MEDIUM: vars: make functions vars_get_by_{name,desc} thread-safe
    - BUG/MEDIUM: proxy: use thread-safe stream killing on hard-stop
    - BUG/MEDIUM: cli/shutdown sessions: make it thread-safe
    - BUG/MINOR: proxy: wake up all threads when sending the hard-stop signal
    - BUG/MINOR: fd: properly wait for !running_mask in fd_set_running_excl()
    - BUG/MINOR: resolvers: Fix condition to release received ARs if not assigned
    - BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record
    - BUG/MINOR: resolvers: new callback to properly handle SRV record errors
    - BUG/MEDIUM: resolvers: Reset server address and port for obselete SRV records
    - BUG/MEDIUM: resolvers: Reset address for unresolved servers
    - BUG/MINOR: ssl: potential null pointer dereference in ckchs_dup()
    - CLEANUP: muxes: Remove useless if condition in show_fd function
    - BUG/MINOR: stats: fix compare of no-maint url suffix
    - BUG/MINOR: mux-h1: Immediately report H1C errors from h1_snd_buf()
    - BUG/MINOR: http-ana: Only consider dst address to process originalto option
    - BUG/MINOR: tcp-act: Don't forget to set the original port for IPv4 set-dst rule
    - BUG/MINOR: connection: Use the client's dst family for adressless servers
    - BUG/MEDIUM: spoe: Kill applets if there are pending connections and nbthread > 1
    - DOC: spoe: Add a note about fragmentation support in HAProxy
    - BUG/MINOR: mux-h2: Fix typo in scheme adjustment
    - BUG/MINOR: http-ana: Don't increment HTTP error counter on read error/timeout

BUG/MINOR: http-ana: Don't increment HTTP error counter on read error/timeout

This should have been fixed when the commit "BUG/MINOR: http-ana: Don't
increment HTTP error counter on internal errors" was backported but I forgot
to do so. The HTTP error counter must not be incremented if a read error or
a read timeout is encountered. Parsing error are already reported by the
mux.

This patch must be backported as far as 2.0, on the HTX part only.

BUG/MINOR: mux-h2: Fix typo in scheme adjustment

That comma should've been a semicolon. Fortunately, as it is now there
is no impact thanks to operators precedence, and all expressions are
properly evaluated. But this is troubling and the risk is high to
turn it into an effective bug with a minor change.

Introduced in b8ce8905cf63ecd06b36af39c05103fadf3cc347 which first
appeared in 2.1-dev3. This fix must be backported to 2.1+.

(cherry picked from commit a3298023b04923ba12429d79c559dc7a850ae122)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

DOC: spoe: Add a note about fragmentation support in HAProxy

Add a note in SPOE.txt to make it clear that HAPRoxy does not support the
fragmentation. It can send fragmented frames if an agent supports it but it
cannot receives and handles fragmented frames.

This patch should fix the issue #659. It may be backported as far as 1.8.

(cherry picked from commit 9536ad707f390c8026777ec8e074a5edaa2f7c7a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: spoe: Kill applets if there are pending connections and nbthread > 1

When the processing stage is finished for a SPOE applet, before returning it
into the idle list, we check if the assigned server appears as full or if
there are some pending connections on the backend or the assigned server. If
yes, it means we reach a maxconn and we close the applet to free a
slot. Otherwise, the applet can be reused. This test is only performed if
there are more than one thread.

It is important to close SPOE applets when there are pending connections for
multithreaded instances because connections with the SPOE agents are
persistent and local to a thread (applets are local to a thread). If a
maxconn is configured, some threads may take all available slots for a
while, leaving remaining threads without any free slot to process SPOE
messages. It is especially true if the maxconn is low.

This patch should fix the issue #705. It must be backported as far as
1.8. However, the code in 1.8 is quite different, a test must be performed
to be sure it works well.

(cherry picked from commit 9e647e5af77faa6a95dd511d78ed8763781e764a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: connection: Use the client's dst family for adressless servers

When the selected server has no address, the destination address of the
client is used. However, for now, only the address is set, not the
family. Thus depending on how the server is configured and the client's
destination address, the server address family may be wrong.

For instance, with such server :

server srv 0.0.0.0:0

The server address family is AF_INET. The server connection will fail if a
client is asking for an IPv6 destination.

To fix the bug, we take care to set the rigth family, the family of the
client destination address.

This patch should fix the issue #202. It must be backported to all stable
versions.

(cherry picked from commit ae3056157c17b181c754556836a5d8336fb9823f)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcp-act: Don't forget to set the original port for IPv4 set-dst rule

If an IPv4 is set via a TCP/HTTP set-dst rule, the original port must be
preserved or set to 0 if the previous family was neither AF_INET nor
AF_INET6. The first case is not an issue because the port remains the
same. But if the previous family was, for instance, AF_UNIX, the port is not
set to 0 and have an undefined value.

This patch must be backported as far as 1.7.

(cherry picked from commit e01ca0fbc9c72de95514816e016a58c5a28ab2a8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: http-ana: Only consider dst address to process originalto option

When an except parameter is used for originalto option, only the destination
address must be evaluated. Especially, the address family of the destination
must be tested and not the source one.

This patch must be backported to all stable versions. However be careful,
depending the versions the code may be slightly different.

(cherry picked from commit cccded98c7b3f2281fe4481895b1392e78cc5df0)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: mux-h1: Immediately report H1C errors from h1_snd_buf()

In case an H1 stream tries to send while on error occurred on its underlying
H1 connection, we must report the error. The way the stream-interface is
synchronously notified of the error. It seems to only be a problem on the
2.0. Probably because the scheduling has changed in upper versions. On the
2.0, it prevent the stream to be notified of errors, when for instance, a
payload is found in a response to a HEAD request. Not always though.

This patch must be backported as far as 2.0 because, on 2.0, it should fix the
issue #1101. There is no upstream ID for this commit because on the 2.4, this
fix already exists, it is part of non-backportable commit.

BUG/MINOR: stats: fix compare of no-maint url suffix

Only the first 3 characters are compared for ';no-maint' suffix in
http_handle_stats. Fix it by doing a full match over the entire suffix.

As a side effect, the ';norefresh' suffix matched the inaccurate
comparison, so the maintenance servers were always hidden on the stats
page in this case.

no-maint suffix is present since commit
3e320367014c742814ba494594cdb8340b1161f1
MINOR: stats: also support a "no-maint" show stat modifier

It should be backported up to 2.3.

This fixes github issue #1147.

(cherry picked from commit 91e55ea3f3f6277d91c0fd85ea924ea0e444f7f3)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

CLEANUP: muxes: Remove useless if condition in show_fd function

In H1, H2 and FCGI muxes, in the show_fd function, there is duplicated test on
the stream's subs field.

This patch fixes the issue #1142. It may be backported as far as 2.2.

(cherry picked from commit 6c93c4ef089fc79ce4e85fa0d9e7f61720291dba)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: ssl: potential null pointer dereference in ckchs_dup()

A potential null pointer dereference was reported with an old gcc
version (6.5)

    src/ssl_ckch.c: In function 'cli_parse_set_cert':
    src/ssl_ckch.c:844:7: error: potential null pointer dereference [-Werror=null-dereference]
      if (!ssl_sock_copy_cert_key_and_chain(src->ckch, dst->ckch))
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    src/ssl_ckch.c:844:7: error: potential null pointer dereference [-Werror=null-dereference]
    src/ssl_ckch.c: In function 'ckchs_dup':
    src/ssl_ckch.c:844:7: error: potential null pointer dereference [-Werror=null-dereference]
      if (!ssl_sock_copy_cert_key_and_chain(src->ckch, dst->ckch))
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    src/ssl_ckch.c:844:7: error: potential null pointer dereference [-Werror=null-dereference]

This could happen if ckch_store_new() fails to allocate memory and returns NULL.

This patch must be backported with 8f71298 since it was wrongly fixed and
the bug could happen.

Must be backported as far as 2.2.

(cherry picked from commit 6ac61e39c491f6ca5a1843c787da2e92818ee02a)
Signed-off-by: William Lallemand <wlallemand@haproxy.org>

BUG/MEDIUM: resolvers: Reset address for unresolved servers

If the DNS resolution failed for a server, its ip address must be
removed. Otherwise, the server is stopped but keeps its ip. This may be
confusing when the servers state are retrieved on the CLI and it may lead to
undefined behavior if HAproxy is configured to load its servers state from a
file.

This patch should be backported as far as 2.0.

(cherry picked from commit d127ffa9f439a4eb05455f371191694f22dd7a5b)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Reset server address and port for obselete SRV records

When a SRV record expires, the ip/port assigned to the associated server are
now removed. Otherwise, the server is stopped but keeps its ip/port while
the server hostname is removed. It is confusing when the servers state are
retrieve on the CLI and may be a problem if saved in a server-state
file. Because the reload may fail because of this inconsistency.

Here is an example:

* Declare a server template in a backend, using the resolver <dns>

server-template test 2 _http._tcp.example.com resolvers dns check

* 2 SRV records are announced with the corresponding additional
   records. Thus, 2 servers are filled. Here is the "show servers state"
   output :

2 frt 1 test1 192.168.1.1 2 64 0 1 2 15 3 4 6 0 0 0 http1.example.com 8001 _http._tcp.example.com 0 0 - - 0
2 frt 2 test2 192.168.1.2 2 64 0 1 1 15 3 4 6 0 0 0 http2.example.com 8002 _http._tcp.example.com 0 0 - - 0

* Then, one additional record is removed (or a SRV record is removed, the
   result is the same). Here is the new "show servers state" output :

2 frt 1 test1 192.168.1.1 2 64 0 1 38 15 3 4 6 0 0 0 http1.example.com 8001 _http._tcp.example.com 0 0 - - 0
2 frt 2 test2 192.168.1.2 0 96 0 1 19 15 3 0 14 0 0 0 - 8002 _http._tcp.example.com 0 0 - - 0

On reload, if a server-state file is used, this leads to undefined behaviors
depending on the configuration.

This patch should be backported as far as 2.0.

(cherry picked from commit 52d4d3010991e851b8e9e4d9f923ad1f74d30d69)
[cf: Changes applied in src/dns.c]
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: new callback to properly handle SRV record errors

When a SRV record was created, it used to register the regular server name
resolution callbacks. That said, SRV records and regular server name
resolution don't work the same way, furthermore on error management.

This patch introduces a new call back to manage DNS errors related to
the SRV queries.

this fixes github issue #50.

Backport status: 2.3, 2.2, 2.1, 2.0

(cherry picked from commit b4badf720ce484001f606011aee7cd216e5ce4e3)
[cf: Changes applied in src/dns.c and structures renamed]
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record

If no additional record is associated to a SRV record, its TTL must not be
renewed. Otherwise the entry never expires. Thus once announced a first
time, the entry remains blocked on the same IP/port except if a new announce
replaces the old one.

Now, the TTL is updated if a SRV record is received while a matching
existing one is found with an additional record or when an new additional
record is assigned to an existing SRV record.

This patch should be backported as far as 2.2.

(cherry picked from commit a331a1e8eb2ad4750711a477ca3e22d940495faf)
[cf: Changes applied in src/dns.c]
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Fix condition to release received ARs if not assigned

At the end of resolv_validate_dns_response(), if a received additionnal
record is not assigned to an existing server record, it is released. But the
condition to do so is buggy. If "answer_record" (the received AR) is not
assigned, "tmp_record" is not a valid record object. It is just a dummy
record "representing" the head of the record list.

Now, the condition is far cleaner. This patch must be backported as far as
2.2.

(cherry picked from commit 9c246a4b6ce3fa0e70399e0158866d41b8662a7f)
[cf: Changes applied in src/dns.c]
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: fd: properly wait for !running_mask in fd_set_running_excl()

In fd_set_running_excl() we don't reset the old mask in the CAS loop,
so if we fail on the first round, we'll forcefully take the FD on the
next one.

In practice it's used bu fd_insert() and fd_delete() only, none of which
is supposed to be passed an FD which is still in use since in practice,
given that for now only listeners may be enabled on multiple threads at
once.

This can be backported to 2.2 but shouldn't result in fixing any user
visible bug for now.

(cherry picked from commit 5926e384e62d6e47272062f1fbec235bd11cc517)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: proxy: wake up all threads when sending the hard-stop signal

The hard-stop event didn't wake threads up. In the past it wasn't an issue
as the poll timeout was limited to 1 second, but since commit 4f59d3861
("MINOR: time: increase the minimum wakeup interval to 60s") it has become
a problem because old processes can remain live for up to one minute after
the hard-stop-after delay. Let's just wake them up.

This may be backported to older releases, though before 2.4 the extra
delay was only one second.

(cherry picked from commit 0d03825b93cc59a289e838105f9d83d53ccdfc8b)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: cli/shutdown sessions: make it thread-safe

There's no locking around the lookup of a stream nor its shutdown
when issuing "shutdown sessions" over the CLI so the risk of crashing
the process is particularly high.

Let's use a thread_isolate() there which is suitable for this task, and
there are not that many alternatives.

This must be backported to 1.8.

(cherry picked from commit 3f5dd2945ccbed6b4baf23a453e9d1d071a2d835)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: proxy: use thread-safe stream killing on hard-stop

When setting hard-stop-after, hard_stop() is called at the end to kill
last pending streams. Unfortunately there's no locking there while
walking over the streams list nor when shutting them down, so it's
very likely that some old processes have been crashing or gone wild
due to this. Let's use a thread_isolate() call for this as we don't
have much other choice (and it happens once in the process' life,
that's OK).

This must be backported to 1.8.

(cherry picked from commit 92b887e20a995323d3fbd42d2be035733cefd6ba)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: vars: make functions vars_get_by_{name,desc} thread-safe

This patch adds a lock to functions vars_get_by_name() and
vars_get_by_desc() to protect accesses to the list of variables.

After the variable is fetched, a sample data is duplicated by using
smp_dup() because the variable may be modified by another thread.

This should be backported to all versions supporting vars along with
"BUG/MINOR: sample: secure convs that accept base64 string and var name
as args" which this patch depends on.

(cherry picked from commit 14518f2305027dfd537c1be0f88350337b5fba23)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: sample: secure convs that accept base64 string and var name as args

This patch adds a few improvements in order to secure the use of
converters that accept base64 string and variable name as arguments.

The first change is within related function sample_conv_var2smp_str()
which now flags the sample as SMP_F_CONST if the argument is of type
ARGT_STR. This makes the sample more safe for later use.

A new function sample_check_arg_base64() is added. It checks an argument
and fills it with a variable type if the argument string contains a
valid variable name. If failed, it tries to perform a base64 decode
operation on a non-empty string, and fills the argument with the decoded
content which can be used later, without any additional base64dec()
function calls during runtime. This means that haproxy configuration
check may fail if variable lookup fails and an invalid base64 encoded
string is specified as an argument for such converters.

Both converters, "aes_gcm_dec" and "hmac", now use alloc_trash_chunk()
in order to allocate additional buffers for various conversions, and
avoid the use of a pre-allocated trash chunks directly (usually returned
by get_trash_chunk()). The function sample_check_arg_base64() is used
for both converters in order to check their arguments specified within
the haproxy configuration.

This patch should be backported as far as 2.0. However, it is important
to keep in mind a few things. The "hmac" converter is only available
starting with 2.2. In versions prior to 2.2, the "aes_gcm_dec" converter
and sample_conv_var2smp_str() are implemented in src/ssl_sock.c. Thus
the patch will have to be adapted on these versions.

Note that this patch is required for a subsequent, more important fix.

(cherry picked from commit 9e8db138c9e50262f2aae898bbc9b9b0b9a93449)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: Configure the `cpp` userdiff driver for *.[ch] in .gitattributes

This might improve the output of `git diff` in certain cases. Especially
`git diff --word-diff` will be much more useful.

Does not affect the generated code, may be backported for consistency if
desired.

(cherry picked from commit 6bcdc6530a437bb4eca24fb41439e34690376972)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: ssl/cli: potential null pointer dereference in "set ssl cert"

A potential null pointer dereference was reported with an old gcc
version (6.5)

    src/ssl_ckch.c: In function 'cli_parse_set_cert':
    src/ssl_ckch.c:838:7: error: potential null pointer dereference [-Werror=null-dereference]
      if (!ssl_sock_copy_cert_key_and_chain(src->ckch, dst->ckch))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    src/ssl_ckch.c:838:7: error: potential null pointer dereference [-Werror=null-dereference]
    src/ssl_ckch.c: In function 'ckchs_dup':
    src/ssl_ckch.c:838:7: error: potential null pointer dereference [-Werror=null-dereference]
      if (!ssl_sock_copy_cert_key_and_chain(src->ckch, dst->ckch))
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    src/ssl_ckch.c:838:7: error: potential null pointer dereference [-Werror=null-dereference]
    cc1: all warnings being treated as errors

This case does not actually happen but it's better to fix the ckch API
with a NULL check.

Could be backported as far as 2.1.

(cherry picked from commit 6c0961442c5e19a1bfc706374f96cfbd42feaeb2)
Signed-off-by: William Lallemand <wlallemand@haproxy.org>

BUG/MEDIUM: mux-h1: Fix handling of responses to CONNECT other than 200-ok

For a CONNECT request, if the tunnel establishment is refused by the server,
the connection is always closed on the client side. This happen because we
fail to detect the end of the tunnel. Now, when a reponse other than 200-ok
is received, the request is switch back to MSG_DONE state and the end of the
transaction is handled as a classical request/response exchange.

This patch should fix the issue #1140. It must be backported as far as
2.0. There is no upstream commit ID because tunnel management was already
fixed in a non-backportable way in 2.4.

BUG/MINOR: server: Be sure to cut the last parsed field of a server-state line

If a line of a server-state file has too many fields, the last one is not
cut on the first following space, as all other fileds. It contains all the
end of the line. It is not the expected behavior. So, now, we cut it on the
next following space, if any. The parsing loop was slighly rewritten.

Note that for now there is no error reported if the line is too long.

This patch may be backported at least as far as 2.1. On 2.0 and prior the
code is not the same. The line parsing is inlined in apply_server_state()
function.

(cherry picked from commit 868a5757e584431fafe713546c8ef8e799865476)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: server: Init params before parsing a new server-state line

Same static arrays of parameters are used to parse all server-state
lines. Thus it is important to reinit them to be sure to not get params from
the previous line, eventually from the previous loaded file.

This patch should be backported to all stable branches. However, in 2.0 and
prior, the parsing of server-state lines are inlined in apply_server_state()
function. Thus the patch will have to be adapted on these versions.

(cherry picked from commit 06cd2569786d31ebb15832cb1053be9aaaadc3f7)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: http-rules: Always replace the response status on a return action

When a HTTP return action is triggered, HAProxy is responsible to return the
response, based on the configured status code. On the request side, there is
no problem because there is no server response to replace. But on the
response side, we must take care to override the server response status
code, if any, to be sure to use the rigth status code to get the http reply
message.

In short, we must always set the configured status code of the HTTP return
action before returning the http reply to be sure to get the right reply,
the one base on the http return action status code and not a reply based on
the server response status code..

This patch should fix the issue #1139. It must be backported as far as 2.2.

(cherry picked from commit 2d36df275b1323cfae8105b5a39f6cfb96f50811)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: spoe: Resolve the sink if a SPOE logs in a ring buffer

If a SPOE filter is configured to send its logs to a ring buffer, the
corresponding sink must be resolved during the configuration post
parsing. Otherwise, the sink is undefined when a log message is emitted,
crashing HAProxy.

This patch must be backported as far as 2.2.

(cherry picked from commit 1d7d0f86b8a7f0d74b47e2afe08f87d9b963d9fb)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: lists: Avoid an infinite loop in MT_LIST_TRY_ADDQ().

In MT_LIST_TRY_ADDQ(), deal with the "prev" field of the element before the
"next". If the element is the first in the list, then its next will
already have been locked when we locked list->prev->next, so locking it
again will fail, and we'll start over and over.

This should be backported to 2.3.

(cherry picked from commit 5567f41d0ab61dd6843535edc8081407d599024d)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

DOC: explain the relation between pool-low-conn and tune.idle-pool.shared

Disabling idle-pool sharing can result in awful performance in presence
of a not so high number of threads, because the number of available idle
connections will be shared among threads, resulting in most of them
abandonning their connections after a request is done if there are already
enough total available. This is a case where pool-low-conn ought to be
used to preserve a number of connections for each thread, but this relation
isn't obvious as is. Let's add mentions about this with both keywords.

(cherry picked from commit 0784db8566b5c4eea5b18bf533ef978a1909b97a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUILD: ssl: introduce fine guard for OpenSSL specific SCTL functions

SCTL (signed certificate timestamp list) specified in RFC6962
was implemented in c74ce24cd22e8c683ba0e5353c0762f8616e597d, let
us introduce macro HAVE_SSL_SCTL for the HAVE_SSL_SCTL sake,
which in turn is based on SN_ct_cert_scts, which comes in the same commit

(cherry picked from commit c47d676bd75e339f92faae430486ee337c45f134)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: sample: Always consider zero size string samples as unsafe

smp_is_safe() function is used to be sure a sample may be safely
modified. For string samples, a test is performed to verify if there is a
null-terminated byte. If not, one is added, if possible. It means if the
sample is not const and if there is some free space in the buffer, after
data. However, we must not try to read the null-terminated byte if the
string sample is too long (data >= size) or if the size is equal to
zero. This last test was not performed. Thus it was possible to consider a
string sample as safe by testing a byte outside the buffer.

Now, a zero size string sample is always considered as unsafe and is
duplicated when smp_make_safe() is called.

This patch must be backported in all stable versions.

(cherry picked from commit 8dd40fbde9d51cf7bf0ee622a5bc5c1f56048d84)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>