Repositories - haproxy-2.3.git/log

BUG/MINOR: payload: Wait for more data if buffer is empty in payload/payload_lv

In payload() and payload_lv() sample fetches, if the buffer is empty, we
must wait for more data by setting SMP_F_MAY_CHANGE flag on the sample.
Otherwise, when it happens in an ACL, nothing is returned (because the
buffer is empty) and the ACL is considered as finished (success or failure
depending on the test).

As a workaround, the buffer length may be tested first. For instance :

    tcp-request inspect-delay 1s
    tcp-request content reject unless { req.len gt 0 } { req.payload(0,0),fix_is_valid }

instead of :

    tcp-request inspect-delay 1s
    tcp-request content reject if ! { req.payload(0,0),fix_is_valid }

This patch must be backported as far as 2.2.

(cherry picked from commit 50623029f8171f7f499529f8002d3093fca12667)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: backend: use a trylock to grab a connection on high FD counts as well

Commit b1adf03df ("MEDIUM: backend: use a trylock when trying to grab an
idle connection") solved a contention issue on the backend under normal
condition, but there is another one further, which only happens when the
number of FDs in use is considered too high, and which obviously causes
random crashes with just 16 threads once the number of FDs is about to be
exhausted.

Like the aforementioned patch, this one should be backported to 2.3.

(cherry picked from commit 9b9f8477f8c751e366a526e2177a8aab34c80f6d)
[backported for the same reason as the first one above -- managed to
crash on a 8c16t Xeon without it; minor ctx adjustments (list vs tree)
and lock name]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: mux-h1: make h1_shutw_conn() idempotent

In issue #1197, Stéphane Graber reported a rare case of crash that
results from an attempt to close an already closed H1 connection. It
indeed looks like under some circumstances it should be possible to
call the h1_shutw_conn() function more than once, though these
conditions are not very clear.

Without going through a deep analysis of all possibilities, one
potential case seems to be a detach() called with pending output data,
causing H1C_F_ST_SHUTDOWN to be set on the connection, then h1_process()
being immediately called on I/O, causing h1_send() to flush these data
and call h1_shutw_conn(), and finally the upper stream calling cs_shutw()
hence h1_shutw(), which itself will call h1_shutw_conn() again while the
transport and control layers have already been released. But the whole
sequence is not certain as it's not very clear in which case it's
possible to leave h1_send() without the connection anymore (at least
the obuf is empty).

However what is certain is that a shutdown function must be idempotent,
so let's fix h1_shutw_conn() regarding this point. Stéphane reported the
issue as far back as 2.0, so this patch should be backported this far.

(cherry picked from commit 62592ad967d6d24be2aabb664a5e1d594ab35415)
Signed-off-by: Willy Tarreau <w@1wt.eu>

[RELEASE] Released version 2.3.8

Released version 2.3.8 with the following main changes :
    - MINOR: time: export the global_now variable
    - BUG/MINOR: freq_ctr/threads: make use of the last updated global time
    - BUG/MEDIUM: mux-fcgi: Fix locking of idle_conns lock in the FCGI I/O callback
    - MINOR: time: also provide a global, monotonic global_now_ms timer
    - BUG/MEDIUM: freq_ctr/threads: use the global_now_ms variable
    - BUG/MINOR: protocol: add missing support of dgram unix socket.
    - MINOR/BUG: mworker/cli: do not use the unix_bind prefix for the master CLI socket
    - MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua
    - BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable
    - BUG/MINOR: ssl: Prevent disk access when using "add ssl crt-list"
    - BUILD: ssl: guard ecdh functions with SSL_CTX_set_tmp_ecdh macro
    - MINOR: lua: Slightly improve function dumping the lua traceback
    - BUG/MEDIUM: debug/lua: Use internal hlua function to dump the lua traceback
    - BUG/MEDIUM: lua: Always init the lua stack before referencing the context
    - MINOR: fd: make fd_clr_running() return the remaining running mask
    - MINOR: fd: remove the unneeded running bit from fd_insert()
    - BUG/MEDIUM: fd: do not wait on FD removal in fd_delete()
    - CLEANUP: fd: remove unused fd_set_running_excl()
    - BUG/MEDIUM: fd: Take the fd_mig_lock when closing if no DWCAS is available.
    - BUG/MEDIUM: thread: Fix a deadlock if an isolated thread is marked as harmless
    - MINOR: tools: make url2ipv4 return the exact number of bytes parsed
    - BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters

BUG/MINOR: http_fetch: make hdr_ip() reject trailing characters

The hdr_ip() sample fetch function will try to extract IP addresses
from a header field. These IP addresses are parsed using url2ipv4()
and if it fails it will fall back to inet_pton(AF_INET6), otherwise
will fail.

There is a small problem there which is that if a field starts with
an IP address and is immediately followed by some garbage, the IP
address part is still returned. This is a problem with fields such
as x-forwarded-for because it prevents detection of accidental
corruption or bug along the chain. For example, the following string:

   x-forwarded-for: 1.2.3.4; 5.6.7.8

or this one:

   x-forwarded-for: 1.2.3.4O    ( the last one being the letter 'O')

would still return "1.2.3.4" despite the trailing characters. This is
bad because it will silently cover broken code running on intermediary
proxies and may even in some cases allow haproxy to pass improperly
formatted headers after they were apparently validated, for example,
if someone extracts the address from this field to place it into
another one.

This issue would only affect the IPv4 parser, because the IPv6 parser
already uses inet_pton() which fails at the first invalid character and
rejects trailing port numbers.

In strict compliance with RFC7239, let's make sure that if there are any
characters left in the string, the parsing fails and makes hdr_ip()
return nothing. However, a special case has to be handled to support
IPv4 addresses followed by a colon and a valid port number, because till
now the parser used to implicitly accept them and it appears that this
practice, though rare, does exist at least in Azure:
   https://docs.microsoft.com/en-us/azure/application-gateway/how-application-gateway-works

This issue has always been there so the fix may be backported to all
versions. It will need the following commit in order to work as expected:

    MINOR: tools: make url2ipv4 return the exact number of bytes parsed

Many thanks to https://twitter.com/melardev and the BitMEX Security Team
for their detailed report.

(cherry picked from commit 7b0e00d943feaa6320522c376fbbff18965b1345)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: tools: make url2ipv4 return the exact number of bytes parsed

The function's return value is currently used as a boolean but we'll
need it to return the number of bytes parsed. Right now it returns
it minus one, unless the last char doesn't match what is permitted.
Let's update this to make it more usable.

(cherry picked from commit 12e1027aa6c7687b5638c0c9260696266c9e400b)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: thread: Fix a deadlock if an isolated thread is marked as harmless

If an isolated thread is marked as harmless, it will loop forever in
thread_harmless_till_end() waiting no threads are isolated anymore. It never
happens because the current thread is isolated. To fix the bug, we exclude
the current thread for the test. We now wait for all other threads to leave
the rendez-vous point.

This bug only seems to occurr if HAProxy is compiled with DEBUG_UAF, when
pool_gc() is called. pool_gc() isolates the current thread, while
pool_free_area() set the thread as harmless when munmap is called.

This patch must be backported as far as 2.0.

(cherry picked from commit a9a9e9aac93bae4d21d22e02fd436717a5a4a07a)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: fd: Take the fd_mig_lock when closing if no DWCAS is available.

In fd_delete(), if we're running with no double-width cas, take the
fd_mig_lock before setting thread_mask to 0 to make sure that
another thread calling fd_set_running() won't miss the new value of
thread_mask and set its bit in running_mask after we checked it.

This should be backported to 2.2 as part of the series fixing fd_delete().

(cherry picked from commit c23b33764ea88587193472283c3e2d7a88448392)
Signed-off-by: Willy Tarreau <w@1wt.eu>

CLEANUP: fd: remove unused fd_set_running_excl()

This one is no longer used and was the origin of the previously mentioned
deadlock.

(cherry picked from commit 6cf13119e2adeeed2520c66d2fc0cea6f6f23acb)
[wt: not strictly needed but will catch offenders if any out-of-tree
code were to rely on it]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: fd: do not wait on FD removal in fd_delete()

Christopher discovered an issue mostly affecting 2.2 and to a less extent
2.3 and above, which is that it's possible to deadlock a soft-stop when
several threads are using a same listener:

          thread1                             thread2

      unbind_listener()                   fd_set_running()
          lock(listener)                  listener_accept()
          fd_delete()                          lock(listener)
             while (running_mask);  -----> deadlock
          unlock(listener)

This simple case disappeared from 2.3 due to the removal of some locked
operations at the end of listener_accept() on the regular path, but the
architectural problem is still here and caused by a lock inversion built
around the loop on running_mask in fd_clr_running_excl(), because there
are situations where the caller of fd_delete() may hold a lock that is
preventing other threads from dropping their bit in running_mask.

The real need here is to make sure the last user deletes the FD. We have
all we need to know the last one, it's the one calling fd_clr_running()
last, or entering fd_delete() last, both of which can be summed up as
the last one calling fd_clr_running() if fd_delete() calls fd_clr_running()
at the end. And we can prevent new threads from appearing in running_mask
by removing their bits in thread_mask.

So what this patch does is that it sets the running_mask for the thread
in fd_delete(), clears the thread_mask, thus marking the FD as orphaned,
then clears the running mask again, and completes the deletion if it was
the last one. If it was not, another thread will pass through fd_clr_running
and will complete the deletion of the FD.

The bug is easily reproducible in 2.2 under high connection rates during
soft close. When the old process stops its listener, occasionally two
threads will deadlock and the old process will then be killed by the
watchdog. It's strongly believed that similar situations do exist in 2.3
and 2.4 (e.g. if the removal attempt happens during resume_listener()
called from listener_accept()) but if so, they should be much harder to
trigger.

This should be backported to 2.2 as the issue appeared with the FD
migration. It requires previous patches "fd: make fd_clr_running() return
the remaining running mask" and "MINOR: fd: remove the unneeded running
bit from fd_insert()".

Notes for backport: in 2.2, the fd_dodelete() function requires an extra
argument "do_close" indicating whether we want to remove and close the FD
(fd_delete) or just delete it (fd_remove). While this information is not
conveyed along the chain, we know that late calls always imply do_close=1
become do_close=0 exclusively results from fd_remove() which is only used
by the config parser and the master, both of which are single-threaded,
hence are always the last ones in the running_mask. Thus it is safe to
assume that a postponed FD deletion always implies do_close=1.

Thanks to Olivier for his help in designing this optimal solution.

(cherry picked from commit 2c3f9818e883318688e4bba0c912b305e2e2659c)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: fd: remove the unneeded running bit from fd_insert()

There's no point taking the running bit in fd_insert() since by
definition there will never be more than one thread inserting the FD,
and that fd_insert() may only be done after the fd was allocated by
the system, indicating the end of use by any other thread.

This will need to be backported to 2.2 to fix an issue.

(cherry picked from commit 71bada5ca45cdf6d433e699ac0c7f7d78092ed02)
[wt: minor context adjustment]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: fd: make fd_clr_running() return the remaining running mask

We'll need to know that a thread is the last one to use an fd, so let's
make fd_clr_running() return the remaining bits after removal. Note that
in practice we're only interested in knowing if it's zero but the compiler
doesn't make use of the clags after the AND and emits a CMPXCHG anyway :-/

This will need to be backported to 2.2 to fix an issue.

(cherry picked from commit 6e8e10b4159f4a31e6aaa535e38dc035f8711787)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: lua: Always init the lua stack before referencing the context

When a lua context is allocated, its stack must be initialized to NULL
before attaching it to its owner (task, stream or applet). Otherwise, if
the watchdog is fired before the stack is really created, that may lead to a
segfault because we try to dump the traceback of an uninitialized lua stack.

It is easy to trigger this bug if a lua script do a blocking call while
another thread try to initialize a new lua context. Because of the global
lua lock, the init is blocked before the stack creation. Of course, it only
happens if the script is executed in the shared global context.

This patch must be backported as far as 2.0.

(cherry picked from commit 1e8433f594de4b860e5205fdd6cb40d91ff58f17)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: debug/lua: Use internal hlua function to dump the lua traceback

The commit reverts following commits:
* 83926a04 BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable
* a61789a1 MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Instead of relying on a Lua function to print the lua traceback into the
debugger, we are now using our own internal function (hlua_traceback()).
This one does not allocate memory and use a chunk instead. This avoids any
issue with a possible deadlock in the memory allocator because the thread
processing was interrupted during a memory allocation.

This patch relies on the commit "BUG/MEDIUM: debug/lua: Use internal hlua
function to dump the lua traceback". Both must be backported wherever the
patches above are backported, thus as far as 2.0

(cherry picked from commit cc2c4f8f4c1d8613b481d1b346e083a9d2462811)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: lua: Slightly improve function dumping the lua traceback

The separator string is now configurable, passing it as parameter when the
function is called. In addition, the message have been slightly changed to
be a bit more readable.

(cherry picked from commit d09cc519bd0c04a0d1ec5bf23ed053c1e596851a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUILD: ssl: guard ecdh functions with SSL_CTX_set_tmp_ecdh macro

let us use feature macro SSL_CTX_set_tmp_ecdh instead of comparing openssl
version

(cherry picked from commit a0fd35b05476b45d8a10a299a6b32c8cca0264d9)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: ssl: Prevent disk access when using "add ssl crt-list"

If an unknown CA file was first mentioned in an "add ssl crt-list" CLI
command, it would result in a call to X509_STORE_load_locations which
performs a disk access which is forbidden during runtime. The same would
happen if a "ca-verify-file" or "crl-file" was specified. This was due
to the fact that the crt-list file parsing and the crt-list related CLI
commands parsing use the same functions.
The patch simply adds a new parameter to all the ssl_bind parsing
functions so that they know if the call is made during init or by the
CLI, and the ssl_store_load_locations function can then reject any new
cafile_entry creation coming from a CLI call.

It can be backported as far as 2.2.

(cherry picked from commit fb00f31af4ba67c69a12807729514a2bdcd47efa)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: debug/lua: Don't dump the lua stack if not dumpable

When we try to dump the stack of a lua context, if it is not dumpable,
nothing is performed and a message is emitted instead. This happens when a
lua execution was interrupted inside a non-reentrant part.

This patch depends on following commit :

* MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Thanks to this patch, we avoid a possible deadllock if the lua is
interrupted by the watchdog in the lua memory allocator, because realloc()
is not async-signal-safe.

Both patches must be backported as far as 2.0.

(cherry picked from commit 83926a04febe94f332c6f45c98773286678346d3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: lua: Use a per-thread counter to track some non-reentrant parts of lua

Some parts of the Lua are non-reentrant. We must be sure to carefully track
these parts to not dump the lua stack when it is interrupted inside such
parts. For now, we only identified the custom lua allocator. If the thread
is interrupted during the memory allocation, we must not try to print the
lua stack wich also allocate memory. Indeed, realloc() is not
async-signal-safe.

In this patch we introduce a thread-local counter. It is incremented before
entering in a non-reentrant part and decremented when exiting. It is only
performed in hlua_alloc() for now.

(cherry picked from commit a61789a1d62fd71c751189faf5371740dd375f33)
[wt: the code uses malloc/realloc/free depending on the inputs in 2.3,
let's place the counter around these 3 calls]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR/BUG: mworker/cli: do not use the unix_bind prefix for the master CLI socket

If the configuration file contains a 'unix-bind prefix' directive, and
if we use the -S option and specify a UNIX socket path, the path of the
socket will be prepended with the value of the unix-bind prefix.

For instance, if we have 'unix-bind prefix /tmp/sockets/' and we use
'-S /tmp/master-socket' on the command line, we will get this error:

Starting proxy MASTER:
cannot bind UNIX socket (No such file or directory) [/tmp/sockets/tmp/master-socket]

So this patch adds an exception, and will ignore the unix-bind prefix
for the master CLI socket.

This patch can be backported as far as 1.9.

(cherry picked from commit 1b8dacc858d7cb43488fb2ffca053274c571da1d)
[wt: cli_fe was called stats_fe in 2.3 and older]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: protocol: add missing support of dgram unix socket.

The proto "uxdg" (UNIX DGRAM) was not declared, causing an error trying
to put a socket unix on "dgram-bind" into a log-forward section.

This patch introduces the missing "uxdg" protocol by adding proto_uxdg.c
which was fully created based on the code available for the other
protocols.

This patch should be backported to version 2.3 and above.

(cherry picked from commit 8af3bb0abf1b808fde2c944d768431377e6a1526)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: freq_ctr/threads: use the global_now_ms variable

In commit a1ecbca0a ("BUG/MINOR: freq_ctr/threads: make use of the last
updated global time"), for period-based counters, the millisecond part
of the global_now variable was used as the date for the new period. But
it's wrong, it only works with sub-second periods as it wraps every
second, and for other periods the counters never rotate anymore.

Let's make use of the newly introduced global_now_ms variable instead,
which contains the global monotonic time expressed in milliseconds.

This patch needs to be backported wherever the patch above is backported.
It depends on previous commit "MINOR: time: also provide a global,
monotonic global_now_ms timer".

(cherry picked from commit 8cc586c73fefd96f4be1f7820e38a1263f6252ca)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: time: also provide a global, monotonic global_now_ms timer

The period-based freq counters need the global date in milliseconds,
so better calculate it and expose it rather than letting all call
places incorrectly retrieve it.

Here what we do is that we maintain a new globally monotonic timer,
global_now_ms, which ought to be very close to the global_now one,
but maintains the monotonic approach of now_ms between all threads
in that global_now_ms is always ahead of any now_ms.

This patch is made simple to ease backporting (it will be needed for
a subsequent fix), but it also opens the way to some simplifications
on the time handling: instead of computing the local time and trying
to force it to the global one, we should soon be able to proceed in
the opposite way, that is computing the new global time an making the
local one just the latest snapshot of it. This will bring the benefit
of making sure that the global time is always ahead of the local one.

(cherry picked from commit 6064b34be0e761de881a5bfca287d01f69122602)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: mux-fcgi: Fix locking of idle_conns lock in the FCGI I/O callback

When the commit e388f2fbc ("MEDIUM: muxes: mark idle conns tasklets with
TASK_F_USR1") was backported (commit id b360bb88 on the 2.3), a call to
HA_SPIN_UNLOCK() was missed and not moved in the right code block. Thus it
is possible to unlock the idle_conns lock while it was never acquired.

This patch should fix the issue #1191. It is 2.3 specific, thus there is no
upstream commid id. No backport is needed except if commit b360bb88 is
backported.

BUG/MINOR: freq_ctr/threads: make use of the last updated global time

The freq counters were using the thread's own time as the start of the
current period. The problem is that in case of contention, it was
occasionally possible to perform non-monotonic updates on the edge of
the next second, because if the upfront thread updates a counter first,
it causes a rotation, then the second thread loses the race from its
older time, and tries again, and detects a different time again, but
in the past so it only updates the counter, then a third thread on the
new date would detect a change again, thus provoking a rotation again.

The effect was triple:
  - rare loss of stored values during certain transitions from one
    period to the next one, causing counters to report 0
  - half of the threads forced to go through the slow path every second
  - difficult convergence when using many threads where the CAS can fail
    a lot and we can observe N(N-1) attempts for N threads to complete

This patch fixes this issue in two ways:
  - first, it now makes use og the monotonic global_now value which also
    happens to be volatile and to carry the latest known time; this way
    time will never jump backwards anymore and only the first thread
    updates it on transition, the other ones do not need to.

  - second, re-read the time in the loop after each failure, because
    if the date changed in the counter, it means that one thread knows
    a more recent one and we need to update. In this case if it matches
    the new current second, the fast path is usable.

This patch relies on previous patch "MINOR: time: export the global_now
variable" and must be backported as far as 1.8.

(cherry picked from commit a1ecbca0a50ccb273bb5cdc9f031408d782a7bcf)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: time: export the global_now variable

This is the process-wide monotonic time that is used to update each
thread's own time. It may be required at a few places where a strictly
monotonic clock is required such as freq_ctr. It will be have to be
backported as a dependency of a forthcoming fix.

(cherry picked from commit 650f374f24c4dd14e90d336c72c2370242be1459)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

[RELEASE] Released version 2.3.7

Released version 2.3.7 with the following main changes :
    - BUG/MINOR: backend: fix condition for reuse on mode HTTP
    - BUG/MINOR: hlua: Don't strip last non-LWS char in hlua_pushstrippedstring()
    - BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode
    - REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h
    - BUG/MINOR: mt-list: always perform a cpu_relax call on failure
    - MINOR: atomic: add armv8.1-a atomics variant for cas-dw
    - MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs
    - BUG/MEDIUM: session: NULL dereference possible when accessing the listener
    - MINOR: tasks: refine the default run queue depth
    - MINOR: listener: refine the default MAX_ACCEPT from 64 to 4
    - OPTIM: server: switch the actconn list to an mt-list
    - MINOR: server: move actconns to the per-thread structure
    - MINOR: lb/api: let callers of take_conn/drop_conn tell if they have the lock
    - OPTIM: lb-first: do not take the server lock on take_conn/drop_conn
    - OPTIM: lb-leastconn: do not take the server lock on take_conn/drop_conn
    - OPTIM: lb-leastconn: do not unlink the server if it did not change
    - MINOR: dynbuf: make the buffer wait queue per thread
    - MINOR: dynbuf: use regular lists instead of mt_lists for buffer_wait
    - MINOR: dynbuf: pass offer_buffers() the number of buffers instead of a threshold
    - MINOR: stream: add an "epoch" to figure which streams appeared when
    - MINOR: cli/streams: make "show sess" dump all streams till the new epoch
    - MINOR: streams: use one list per stream instead of a global one
    - MEDIUM: streams: do not use the streams lock anymore
    - MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS
    - MINOR: pools: double the local pool cache size to 1 MB
    - MEDIUM: backend: use a trylock when trying to grab an idle connection
    - MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY
    - MINOR: ssl: mark the SSL handshake tasklet as heavy
    - BUG/MEDIUM: ssl: properly remove the TASK_HEAVY flag at end of handshake
    - MINOR: task: add an application specific flag to the state: TASK_F_USR1
    - MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1
    - MINOR: xprt: add new xprt_set_idle and xprt_set_used methods
    - MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks
    - MEDIUM: task: remove the tasks_run_queue counter and have one per thread
    - MINOR: task: give the scheduler a bit more flexibility in the runqueue size
    - OPTIM: task: automatically adjust the default runqueue-depth to the threads
    - BUG/MEDIUM: stick-tables: fix ref counter in table entry using multiple http tracksc.
    - BUILD: atomic/arm64: force the register pairs to use in __ha_cas_dw()
    - BUG/MEDIUM: filters: Set CF_FL_ANALYZE on channels when filters are attached
    - BUG/MINOR: tcpcheck: Update .health threshold of agent inside an agent-check
    - BUG/MINOR: proxy/session: Be sure to have a listener to increment its counters
    - BUG/MINOR: session: Add some forgotten tests on session's listener
    - BUG/MINOR: tcpcheck: Fix double free on error path when parsing tcp/http-check
    - CLEANUP: tcp-rules: add missing actions in the tcp-request error message
    - Revert "BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record"
    - BUG/MINOR: resolvers: Consider server to have no IP on DNS resolution error
    - BUG/MINOR: resolvers: Reset server address on DNS error only on status change
    - BUG/MINOR: resolvers: Unlink DNS resolution to set RMAINT on SRV resolution
    - BUG/MEDIUM: resolvers: Don't set an address-less server as UP
    - BUG/MEDIUM: resolvers: Fix the loop looking for an existing ADD item
    - MINOR: resolvers: new function find_srvrq_answer_record()
    - BUG/MINOR; resolvers: Ignore DNS resolution for expired SRV item
    - BUG/MEDIUM: resolvers: Trigger a DNS resolution if an ADD item is obsolete
    - MINOR: resolvers: Use a function to remove answers attached to a resolution
    - MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
    - MINOR: resolvers: Add function to change the srv status based on SRV resolution
    - MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
    - BUG/MEDIUM: resolvers: Don't release resolution from a requester callbacks
    - BUG/MEDIUM: resolvers: Skip DNS resolution at startup if SRV resolution is set
    - MINOR: resolvers: Use milliseconds for cached items in resolver responses
    - MINOR: resolvers: Don't try to match immediatly renewed ADD items
    - BUG/MINOR: resolvers: Add missing case-insensitive comparisons of DNS hostnames

BUG/MINOR: resolvers: Add missing case-insensitive comparisons of DNS hostnames

DNS hostname comparisons were fixed to be case-insensitive (see b17b88487
"BUG/MEDIUM: dns: Consider the fact that dns answers are
case-insensitive"). However 2 comparisons are still case-sensitive.

This patch must be backported as far as 1.8.

(cherry picked from commit 59b29257338f8e7fb6def6bddda45d08651ffe1c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Don't try to match immediatly renewed ADD items

The loop looking for existing ADD items to renew their last_seen must ignore
the items already renewed in the same loop. To do so, we rely on the
last_seen time. because it is now based on now_ms, it is safe.

Doing so avoid to match several time the same ADD item when the same IP
address is found in several ADD item. This reduces the number of extra DNS
resolutions.

This patch depends on "MINOR: resolvers: Use milliseconds for cached items
in resolver responses". Both may be backported as far as 2.2 if necessary.

(cherry picked from commit e8674c71840cd783d6ba1cc073c0d074aea09fd8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Use milliseconds for cached items in resolver responses

The last time when an item was seen in a resolver responses is now stored in
milliseconds instead of seconds. This avoid some corner-cases at the
edges. This also simplifies time comparisons.

(cherry picked from commit 55c1c4053f70bf53adfcd5067fa959e20f2b44fe)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Skip DNS resolution at startup if SRV resolution is set

At startup, if a SRV resolution is set for a server, no DNS resolution is
created. We must wait the first SRV resolution to know if it must be
triggered. It is important to do so for two reasons.

First, during a "classical" startup, a server based on a SRV resolution has
no hostname. Thus the created DNS resolution is useless. Best waiting the
first SRV resolution. It is not really a bug at this stage, it is just
useless.

Second, in the same situation, if the server state is loaded from a file,
its hosname will be set a bit later. Thus, if there is no additionnal record
for this server, because there is already a DNS resolution, it inhibits any
new DNS resolution. But there is no hostname attached to the existing DNS
resolution. So no resolution is performed at all for this server.

To avoid any problem, it is fairly easier to handle this special case during
startup. But this means we must be prepared to have no "resolv_requester"
field for a server at runtime.

This patch must be backported as far as 2.2.

(cherry picked from commit d83a6df5cde5f2e5417e9f995ab966ba01d17501)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Don't release resolution from a requester callbacks

Another way to say it: "Safely unlink requester from a requester callbacks".

Requester callbacks must never try to unlink a requester from a resolution, for
the current requester or another one. First, these callback functions are called
in a loop on a request list, not necessarily safe. Thus unlink resolution at
this place, may be unsafe. And it is useless to try to make these loops safe
because, all this stuff is placed in a loop on a resolution list. Unlink a
requester may lead to release a resolution if it is the last requester.

However, the unkink is necessary because we cannot reset the server state
(hostname and IP) with some pending DNS resolution on it. So, to workaround
this issue, we introduce the "safe" unlink. It is only performed from a
requester callback. In this case, the unlink function never releases the
resolution, it only reset it if necessary. And when a resolution is found
with an empty requester list, it is released.

This patch depends on the following commits :

* MINOR: resolvers: Purge answer items when a SRV resolution triggers an error
* MINOR: resolvers: Use a function to remove answers attached to a resolution
* MINOR: resolvers: Directly call srvrq_update_srv_state() when possible
* MINOR: resolvers: Add function to change the srv status based on SRV resolution

All the series must be backported as far as 2.2. It fixes a regression
introduced by the commit b4badf720 ("BUG/MINOR: resolvers: new callback to
properly handle SRV record errors").

don't release resolution from requester cb

(cherry picked from commit 0efc0993ec5210c8cf5a98eb9c0220014e030fd6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Directly call srvrq_update_srv_state() when possible

When the server status must be updated from the result of a SRV resolution,
we can directly call srvrq_update_srv_state(). It is simpler and this avoid
a test on the server DNS resolution.

This patch is mandatory for the next commit. It also rely on "MINOR:
resolvers: Directly call srvrq_update_srv_state() when possible".

(cherry picked from commit 6b117aed492d9916a5c656b4fbfcac6159133c0f)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Add function to change the srv status based on SRV resolution

srvrq_update_srv_status() update the server status based on result of SRV
resolution. For now, it is only used from snr_update_srv_status() when
appropriate.

(cherry picked from commit 5efdef24c1753d7a68d1f6c8dc8cb6b4b84a3361)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Purge answer items when a SRV resolution triggers an error

When a SRV request trigger an error, if we decide to handle the error
because last_valid duration is expired, the answer list may be purged. All
items are considered as obsolete.

(cherry picked from commit 51d5e3bda7fb07a5f03cb878dd66f673c3ed1a59)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: Use a function to remove answers attached to a resolution

resolv_purge_resolution_answer_records() must be used to removed all answers
attached to a resolution. For now, it is only used when a resolution is
released.

(cherry picked from commit 1dec5c793474027ddffbfca3849e5ca4e9e51083)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Trigger a DNS resolution if an ADD item is obsolete

When a ADD item attached to a SRV item is removed because it is obsolete, we
must trigger a DNS resolution to be sure the hostname still resolves or
not. There is no other way to be the entry is still valid. And we cannot set
the server in RMAINT immediatly, because a DNS server may be inconsitent and
may stop to add some additionnal records.

The opposite is also true. If a valid ADD item is still attached to a SRV
item, any DNS resolution must be stopped. There is no reason to perform
extra resolution in this case.

This patch must be backported as far as 2.2.

(cherry picked from commit 3e0600fbbfee7daf957033505d56bc8dcc874185)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR; resolvers: Ignore DNS resolution for expired SRV item

If no ADD item is found for a SRV item in a SRV response, a DNS resolution
is triggered. When it succeeds, we must be sure the SRV item is still
alive. Otherwise the DNS resolution must be ignored.

This patch depends on the commit "MINOR: resolvers: Move last_seen time of
an ADD into its corresponding SRV item". Both must be backported as far as
2.2.

(cherry picked from commit 49531e8471a1c1ae9e1c7a6c1aeeb00713024beb)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: resolvers: new function find_srvrq_answer_record()

This function search for a SRV answer item associated to a requester
whose type is server.
This is mainly useful to "link" a server to its SRV record when no
additional record were found to configure the IP address.

This patch is required by a bug fix.

(cherry picked from commit 6a8d11dc8051f4fc40c9ab1266cd58367ea1ca19)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Fix the loop looking for an existing ADD item

For each ADD item found in a SRV response, we try to find a corresponding
ADD item already attached to an existing SRV item. If found, the ADD
last_seen time is updated, otherwise we try to find a SRV item with no ADD
to attached the new one.

However, the loop is buggy. Instead of comparing 2 ADD items, it compares
the new ADD item with the SRV item. Because of this bug, we are unable to
renew last_seen time of existing ADD.

This patch must be backported as far as 2.2.

(cherry picked from commit 77f860699c17426ee6e727fa07a287b32276f1dd)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: resolvers: Don't set an address-less server as UP

when a server status is updated based on a SRV item, it is always set to UP,
regardless it has an IP address defined or not. For instance, if only a SRV
item is received, with no additional record, only the server hostname is
defined. We must wait to have an IP address to set the server as UP.

This patch must be backported as far as 2.2.

(cherry picked from commit ab177ac1f31b5c6c2e4f7d023923dd668cce3076)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Unlink DNS resolution to set RMAINT on SRV resolution

When a server is set in RMAINT becaues of a SRV resolution failure, the
server DNS resolution, if any, must be unlink first. It is mandatory to
handle the change in the context of a SRV resolution.

This patch must be backported as far as 2.2.

(cherry picked from commit bca680ba90787fc57b9711e268f6fb0c74e7e49c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Reset server address on DNS error only on status change

When a DNS resolution error is detected, in snr_resolution_error_cb(), the
server address must be reset only if the server status has changed. It this
case, it means the server is set to RMAINT. Thus the server address may by
reset.

This patch fixes a bug introduced by commit d127ffa9f ("BUG/MEDIUM:
resolvers: Reset address for unresolved servers"). It must be backported as
far as 2.0.

(cherry picked from commit 5130c21fbb806a28ab4e535b700df872a80798d8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: resolvers: Consider server to have no IP on DNS resolution error

When an error is received for a DNS resolution, for instance a NXDOMAIN
error, the server must be considered to have no address when its status is
updated, not the opposite.

Concretly, because this parameter is not used on error path in
snr_update_srv_status(), there is no impact.

This patch must be backported as far as 1.8.

(cherry picked from commit bd0227c1096ecc9e11da315c399b178b16ad7a9d)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

Revert "BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record"

This reverts commit a331a1e8eb2ad4750711a477ca3e22d940495faf.

This commit fixes a real bug, but it also reveals some hidden bugs, mostly
because of some design issues. Thus, in itself, it create more problem than
it solves. So revert it for now. All known bugs will be addressed in next
commits.

This patch should be backported as far as 2.2.

(cherry picked from commit 5037c06d91e0ed748932fcfaae11eeadf5949957)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

CLEANUP: tcp-rules: add missing actions in the tcp-request error message

The tcp-request error message only mentions "accept", "reject" and
track-sc*, but there are a few other ones that were missing, so let's
add them.

This could be backported, though it's not likely that it will help anyone
with an existing config.

(cherry picked from commit 72d012fbd971bfd8b2bf6ab022572a816b759cbd)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcpcheck: Fix double free on error path when parsing tcp/http-check

When a "tcp-check" or a "http-check" rule is parsed, we try to get the
previous rule in the ruleset to get its index. We must take care to reset
the pointer on this rule in case an error is triggered later on the
parsing. Otherwise, the same rule may be released twice. For instance, it
happens with such line :

http-check meth GET uri / ## note there is no "send" parameter

This patch must be backported as far as 2.2.

(cherry picked from commit cd03be73d59ad2c5d9ad43b28a1b0fda9f32526c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: session: Add some forgotten tests on session's listener

During backport of commit 36119de18 ("BUG/MEDIUM: session: NULL dereference
possible when accessing the listener"), I missed some tests on the session's
listener because of the context changes.

It is specific to the 2.3, thus there is no upstream commit ID. It must
backported with the above commit as far as 1.8.

BUG/MINOR: proxy/session: Be sure to have a listener to increment its counters

It is possible to have a session without a listener. It happens for applets
on the client side. Thus all accesses to the listener info from the session
must be guarded. It was the purpose of the commit 36119de18 ("BUG/MEDIUM:
session: NULL dereference possible when accessing the listener"). However,
some tests on the session's listener existence are missing in proxy_inc_*
functions.

This patch should fix the issues #1171, #1172, #1173, #1174 and #1175. It
must be backported with the above commit as far as 1.8.

(cherry picked from commit 77e376783e891fe5dda4fe4e1198cf8dff0b118d)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcpcheck: Update .health threshold of agent inside an agent-check

If an agent-check is configured for a server, When the response is parsed,
the .health threshold of the agent must be updated on up/down/stopped/fail
command and not the threshold of the health-check. Otherwise, the
agent-check will compete with the health-check and may mark a DOWN server as
UP.

This patch should fix the issue #1176. It must be backported as far as 2.2.

(cherry picked from commit 24ec9434271345857b42cc5bd9c6b497ab01a7e4)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: filters: Set CF_FL_ANALYZE on channels when filters are attached

CF_FL_ANALYZE flag is used to know a channel is filtered. It is important to
synchronize request and response channels when the filtering ends.

However, it is possible to call all request analyzers before starting the
filtering on the response channel. This means flt_end_analyze() may be
called for the request channel before flt_start_analyze() on the response
channel. Thus because CF_FL_ANALYZE flag is not set on the response channel,
we consider the filtering is finished on both sides. The consequence is that
flt_end_analyze() is not called for the response and backend filters are
unregistered before their execution on the response channel.

It is possible to encounter this bug on TCP frontend or CONNECT request on
HTTP frontend if the client shutdown is reveiced with the first read.

To fix this bug, CF_FL_ANALYZE is set when filters are attached to the
stream. It means, on the request channel when the stream is created, in
flt_stream_start(). And on both channels when the backend is set, in
flt_set_stream_backend().

This patch must be backported as far as 1.7.

(cherry picked from commit 5647fbacdf822f2a6812a07dbdf27660f436f103)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUILD: atomic/arm64: force the register pairs to use in __ha_cas_dw()

Since commit f8fb4f75f ("MINOR: atomic: implement a more efficient arm64
__ha_cas_dw() using pairs"), on some modern arm64 (armv8.1+) compiled
with -march=armv8.1-a under gcc-7.5.0, a build error may appear on ev_poll.o :

/tmp/ccHD2lN8.s:1771: Error: reg pair must start from even reg at operand 1 -- `casp x27,x28,x22,x23,[x12]'
Makefile:927: recipe for target 'src/ev_poll.o' failed

It appears that the compiler cannot always assign register pairs there
for a structure made of two u64. It was possibly later addressed since
gcc-9.3 never caused this, but there's no trivially available info on
the subject in the changelogs. Unsuprizingly, using a u128 instead does
fix this, but it significantly inflates the code (+4kB for just 6 places,
very likely that it loaded some extra stubs) and the comparison is ugly,
involving two slower conditional jumps instead of a single one and a
conditional comparison. For example, ha_random64() grew from 144 bytes
to 232.

However, simply forcing the base register does work pretty well, and
makes the code even cleaner and more efficient by further reducing it
by about 4.5kB, possibly because it helps the compiler to pick suitable
registers for the pair there. And the perf on 64-cores looks steadily
0.5% above the previous one, so let's do this.

Note that the commit above was backported to 2.3 to fix scalability
issues on AWS Graviton2 platform, so this one will need to be as well.

(cherry picked from commit 3b728a92bbe35b0c3ef34c3c8942ac5541a1713c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: stick-tables: fix ref counter in table entry using multiple http tracksc.

Setting multiple http-request track-scX rules generates entries
which never expires.

If there was already an entry registered by a previous http rule
'stream_track_stkctr(&s->stkctr[rule->action], t, ts)' didn't
register the new 'ts' into the stkctr. And function is left
with no reference on 'ts' whereas refcount had been increased
by the '_get_entry'

The patch applies the same policy as the one showed on tcp track
rules and if there is successive rules the track counter keep the
first entry registered in the counter and nothing more is computed.

After validation this should be backported in all versions.

(cherry picked from commit 362d25e50708adf2b273bf9ee04f6566f113815e)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

OPTIM: task: automatically adjust the default runqueue-depth to the threads

The recent default runqueue size reduction appeared to have significantly
lowered performance on low-thread count configs. Testing various values
runqueue values on different workloads under thread counts ranging from
1 to 64, it appeared that lower values are more optimal for high thread
counts and conversely. It could even be drawn that the optimal value for
various workloads sits around 280/sqrt(nbthread), and probably has to do
with both the L3 cache usage and how to optimally interlace the threads'
activity to minimize contention. This is much easier to optimally
configure, so let's do this by default now.

(cherry picked from commit 060a7612487c244175fa1dc1e5b224015cbcf503)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: give the scheduler a bit more flexibility in the runqueue size

Instead of setting a hard-limit on runqueue-depth and keeping it short
to maintain fairness, let's allow the scheduler to automatically cut
the existing one in two equal halves if its size is between the configured
size and its double. This will allow to increase the default value while
keeping a low latency.

(cherry picked from commit 1691ba3693d4b11cc85aed606d39ffbd23d8533f)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: task: remove the tasks_run_queue counter and have one per thread

This counter is solely used for reporting in the stats and is the hottest
thread contention point to date. Moving it to the scheduler and having a
separate one for the global run queue dramatically improves the performance,
showing a 12% boost on the request rate on 16 threads!

In addition, the thread debugging output which used to rely on rqueue_size
was not totally accurate as it would only report task counts. Now we can
return the exact thread's run queue length.

It is also interesting to note that there are still a few other task/tasklet
counters in the scheduler that are not efficiently updated because some cover
a single area and others cover multiple areas. It looks like having a distinct
counter for each of the following entries would help and would keep the code
a bit cleaner:
  - global run queue (tree)
  - per-thread run queue (tree)
  - per-thread shared tasklets list
  - per-thread local lists

Maybe even splitting the shared tasklets lists between pure tasklets and
tasks instead of having the whole and tasks would simplify the code because
there remain a number of places where several counters have to be updated.

(cherry picked from commit 9c7b8085f4cad284642e7f67d2274f2fb568f243)
[wt: backported as it does have a significant impact on many-core machines;
with 48 threads, the request rate is simply doubled and the response time
halved; There were significant context updates, all verified to be OK; it
does not seem reasonable to backport it to 2.2]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: ssl: implement xprt_set_used and xprt_set_idle to relax context checks

Currently the SSL layer checks the validity of its tasklet's context just
in case it would have been stolen, had the connection been idle. Now it
will be able to be notified by the mux when this situation happens so as
not to have to grab the idle connection lock on each pass. This reuses the
TASK_F_USR1 flag just as the muxes do.

(cherry picked from commit 4149168255b1dc252aece0e5bbab6f55310f1ad5)
[wt: this is the second part of the takeover lock reduction, as the SSL
layer was also hammering this lock and slowing everyone down, especially
during initial handshakes. The changes are the same as for the muxes]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: xprt: add new xprt_set_idle and xprt_set_used methods

These functions are used on the mux layer to indicate that the connection
is becoming idle and that the xprt ought to be careful before checking the
context or that it's not idle anymore and that the context is safe. The
purpose is to allow a mux which is going to release a connection to tell
the xprt to be careful when touching it. At the moment, the xprt are
always careful and that's costly so we want to have the ability to relax
this a bit.

No xprt layer uses this yet.

(cherry picked from commit 4f8cd4397f07389129dd23b07e94018404340731)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: muxes: mark idle conns tasklets with TASK_F_USR1

The muxes are touching the idle_conns_lock all the time now because
they need to be careful that no other thread has stolen their tasklet's
context.

This patch changes this a little bit by setting the TASK_F_USR1 flag on
the tasklet before marking a connection idle, and removing it once it's
not idle anymore. Thanks to this we have the guarantee that a tasklet
without this flag cannot be present in an idle list and does not need
to go through this costly lock. This is especially true for front
connections.

(cherry picked from commit e388f2fbca40197590bd15dce0f4eb4d6cded20a)
[wt: backported as really needed to address the high contention issues
in multi-threaded environments: all I/O tasklets queue up on the
takeover lock as soon as there's some activity on the reuse part,
sometimes causing "reuse always" to be slower than "reuse never"!
The context differs quite a bit due to the changes in tasks and idle
conns in 2.4, but the main principle is to bypass the lock when
TASK_F_USR1 is not set. ]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: add an application specific flag to the state: TASK_F_USR1

This flag will be usable by any application. It will be preserved across
wakeups so the application can use it to do various stuff. Some I/O
handlers will soon benefit from this.

(cherry picked from commit 6fa8bcdc785c29cd3a29c14d2038f38b5547eff0)
[wt: needed for upcoming fixes. task->state is 16-bit in 2.3 and before
so we've reused bit 15 which was still available there. Ctx adjusments
since no TASK_F_TASKLET here]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: ssl: properly remove the TASK_HEAVY flag at end of handshake

Emeric found that SSL+keepalive traffic had dropped quite a bit in the
recent changes, which could be bisected to recent commit 9205ab31d
("MINOR: ssl: mark the SSL handshake tasklet as heavy"). Indeed, a
first incarnation of this commit made use of the TASK_SELF_WAKING
flag but the last version directly used TASK_HEAVY, but it would still
continue to remove the already absent TASK_SELF_WAKING one instead of
TASK_HEAVY. As such, the SSL traffic remained processed with low
granularity.

No backport is needed as this is only 2.4.

(cherry picked from commit 4c48edba4f45bb78f41af7d79d3c176710fe6a90)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: ssl: mark the SSL handshake tasklet as heavy

There's a fairness issue between SSL and clear text. A full end-to-end
cleartext connection can require up to ~7.7 wakeups on average, plus 3.3
for the SSL tasklet, one of which is particularly expensive. So if we
accept to process many handshakes taking 1ms each, we significantly
increase the processing time of regular tasks just by adding an extra
delay between their calls. Ideally in order to be fair we should have a
1:18 call ratio, but this requires a bit more accounting. With very little
effort we can mark the SSL handshake tasklet as TASK_HEAVY until the
handshake completes, and remove it once done.

Doing so reduces from 14 to 3.0 ms the total response time experienced
by HTTP clients running in parallel to 1000 SSL clients doing full
handshakes in loops. Better, when tune.sched.low-latency is set to "on",
the latency further drops to 1.8 ms.

The tasks latency distribution explain pretty well what is happening:

Without the patch:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              2785375   19.35m    416.9us   5.401h    6.980ms
    h1_io_cb                    1868949   9.853s    5.271us   4.829h    9.302ms
    process_stream              1864066   7.582s    4.067us   2.058h    3.974ms
    si_cs_io_cb                 1733808   1.932s    1.114us   26.83m    928.5us
    h1_timeout_task              935760      -         -      1.033h    3.975ms
    accept_queue_process         303606   4.627s    15.24us   16.65m    3.291ms
    srv_cleanup_toremove_connections452   64.31ms   142.3us   2.447s    5.415ms
    task_run_applet                  47   5.149ms   109.6us   57.09ms   1.215ms
    srv_cleanup_idle_connections     34   2.210ms   65.00us   87.49ms   2.573ms

With the patch:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              3000365   21.08m    421.6us   20.30h    24.36ms
    h1_io_cb                    2031932   9.278s    4.565us   46.70m    1.379ms
    process_stream              2010682   7.391s    3.675us   22.83m    681.2us
    si_cs_io_cb                 1702070   1.571s    922.0ns   8.732m    307.8us
    h1_timeout_task             1009594      -         -      17.63m    1.048ms
    accept_queue_process         339595   4.792s    14.11us   3.714m    656.2us
    srv_cleanup_toremove_connections779   75.42ms   96.81us   438.3ms   562.6us
    srv_cleanup_idle_connections     48   2.498ms   52.05us   178.1us   3.709us
    task_run_applet                  17   1.738ms   102.3us   11.29ms   663.9us
    other                             1   947.8us   947.8us   202.6us   202.6us

  => h1_io_cb() and process_stream() are divided by 6 while ssl_sock_io_cb() is
     multipled by 4

And with low-latency on:
  $ socat - /tmp/sock1 <<< "show profiling"
  Per-task CPU profiling              : on      # set profiling tasks {on|auto|off}
  Tasks activity:
    function                      calls   cpu_tot   cpu_avg   lat_tot   lat_avg
    ssl_sock_io_cb              3000565   20.96m    419.1us   20.74h    24.89ms
    h1_io_cb                    2019702   9.294s    4.601us   49.22m    1.462ms
    process_stream              2009755   6.570s    3.269us   1.493m    44.57us
    si_cs_io_cb                 1997820   1.566s    783.0ns   2.985m    89.66us
    h1_timeout_task             1009742      -         -      1.647m    97.86us
    accept_queue_process         494509   4.697s    9.498us   1.240m    150.4us
    srv_cleanup_toremove_connections1120   92.32ms   82.43us   463.0ms   413.4us
    srv_cleanup_idle_connections     70   2.703ms   38.61us   204.5us   2.921us
    task_run_applet                  13   1.303ms   100.3us   85.12us   6.548us

  => process_stream() is divided by 100 while ssl_sock_io_cb() is
     multipled by 4

Interestingly, the total HTTPS response time doesn't increase and even very
slightly decreases, with an overall ~1% higher request rate. The net effect
here is a redistribution of the CPU resources between internal tasks, and
in the case of SSL, handshakes wait bit more but everything after completes
faster.

This was made simple enough to be backportable if it helps some users
suffering from high latencies in mixed traffic.

(cherry picked from commit 9205ab31d2383f33612e27d8ce74ab62a27695fc)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: task: limit the number of subsequent heavy tasks with flag TASK_HEAVY

While the scheduler is priority-aware and class-aware, and consistently
tries to maintain fairness between all classes, it doesn't make use of a
fine execution budget to compensate for high-latency tasks such as TLS
handshakes. This can result in many subsequent calls adding multiple
milliseconds of latency between the various steps of other tasklets that
don't even depend on this.

An ideal solution would be to add a 4th queue, have all tasks announce
their estimated cost upfront and let the scheduler maintain an auto-
refilling budget to pick from the most suitable queue.

But it turns out that a very simplified version of this already provides
impressive gains with very tiny changes and could easily be backported.
The principle is to reserve a new task flag "TASK_HEAVY" that indicates
that a task is expected to take a lot of time without yielding (e.g. an
SSL handshake typically takes 700 microseconds of crypto computation).
When the scheduler sees this flag when queuing a tasklet, it will place
it into the bulk queue. And during dequeuing, we accept only one of
these in a full round. This means that the first one will be accepted,
will not prevent other lower priority tasks from running, but if a new
one arrives, then the queue stops here and goes back to the polling.
This will allow to collect more important updates for other tasks that
will be batched before the next call of a heavy task.

Preliminary tests consisting in placing this flag on the SSL handshake
tasklet show that response times under SSL stress fell from 14 ms
before the patch to 3.0 ms with the patch, and even 1.8 ms if
tune.sched.low-latency is set to "on".

(cherry picked from commit 74dea8caeadfaf3cac262fd4f2c532788c396fb5)
[wt: tasklet_wakeup_on() is in task.h in 2.3]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: backend: use a trylock when trying to grab an idle connection

In conn_backend_get() we can cause some extreme contention due to the
idle_conns_lock. Indeed, even though it's per-thread, it still causes
high contention when running with many threads. The reason is that all
threads which do not have any idle connections are quickly skipped,
till the point where there are still some, so the first reaching that
point will grab the lock and the other ones wait behind. From this
point, all threads are synchronized waiting on the same lock, and
will follow the leader in small jumps, all hindering each other.

Here instead of doing this we're using a trylock. This way when a thread
is already checking a list, other ones will continue to next thread. In
the worst case, a high contention will lead to a few new connections to
be set up, but this may actually be what is required to avoid contention
in the first place. With this change, the contention has mostly
disappeared on this lock (it's still present in muxes and transport
layers due to the takeover).

Surprisingly, checking for emptiness of the tree root before taking
the lock didn't address any contention.

A few improvements are still possible and desirable here. The first
one would be to avoid seeing all threads jump to the next one. We
could have each thread use a different prime number as the increment
so as to spread them across the entire table instead of keeping them
synchronized. The second one is that the lock in the muck layers
shouldn't be needed to check for the tasklet's context availability.

(cherry picked from commit b1adf03df970958e82ad33efadba3c2586ffc02f)
[wt: this is a major cause of contention in highly threaded environments,
hence the backport. Minor ctx adjustments (list vs tree) and lock name]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: pools: double the local pool cache size to 1 MB

The reason is that H2 can already require 32 16kB buffers for the mux
output at once, which will deplete the local cache. Thus it makes sense
to go further to leave some time to other connection to release theirs.
In addition, the L2 cache on modern CPUs is already 1 MB, so this change
is welcome in any case.

(cherry picked from commit f587003fe916315e77e130339e21fc8d3a4a7248)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: pools: add CONFIG_HAP_NO_GLOBAL_POOLS and CONFIG_HAP_GLOBAL_POOLS

We've reached a point where the global pools represent a significant
bottleneck with threads. On a 64-core machine, the performance was
divided by 8 between 32 and 64 H2 connections only because there were
not enough entries in the local caches to avoid picking from the global
pools, and the contention on the list there was very high. It becomes
obvious that we need to have an array of lists, but that will require
more changes.

In parallel, standard memory allocators have improved, with tcmalloc
and jemalloc finding their ways through mainstream systems, and glibc
having upgraded to a thread-aware ptmalloc variant, keeping this level
of contention here isn't justified anymore when we have both the local
per-thread pool caches and a fast process-wide allocator.

For these reasons, this patch introduces a new compile time setting
CONFIG_HAP_NO_GLOBAL_POOLS which is set by default when threads are
enabled with thread local pool caches, and we know we have a fast
thread-aware memory allocator (currently set for glibc>=2.26). In this
case we entirely bypass the global pool and directly use the standard
memory allocator when missing objects from the local pools. It is also
possible to force it at compile time when a good allocator is used with
another setup.

It is still possible to re-enable the global pools using
CONFIG_HAP_GLOBAL_POOLS, if a corner case is discovered regarding the
operating system's default allocator, or when building with a recent
libc but a different allocator which provides other benefits but does
not scale well with threads.

(cherry picked from commit 0bae075928250ba036cb1d96485a6e72bdb6283c)
[wt: backported to 2.3 since it addresses some serious contention in
h2 where streams are allocated in bursts]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MEDIUM: streams: do not use the streams lock anymore

The lock was still used exclusively to deal with the concurrency between
the "show sess" release handler and a stream_new() or stream_free() on
another thread. All other accesses made by "show sess" are already done
under thread isolation. The release handler only requires to unlink its
node when stopping in the middle of a dump (error, timeout etc). Let's
just isolate the thread to deal with this case so that it's compatible
with the dump conditions, and remove all remaining locking on the streams.

This effectively kills the streams lock. The measured gain here is around
1.6% with 4 threads (374krps -> 380k).

(cherry picked from commit 49de68520e9dae3817885397373151aee4901c1b)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: streams: use one list per stream instead of a global one

The global streams list is exclusively used for "show sess", to look up
a stream to shut down, and for the hard-stop. Having all of them in a
single list is extremely expensive in terms of locking when using threads,
with performance losses as high as 7% having been observed just due to
this.

This patch makes the list per-thread, since there's no need to have a
global one in this situation. All call places just iterate over all
threads. The most "invasive" changes was in "show sess" where the end
of list needs to go back to the beginning of next thread's list until
the last thread is seen. For now the lock was maintained to keep the
code auditable but a next commit should get rid of it.

The observed performance gain here with only 4 threads is already 7%
(350krps -> 374krps).

(cherry picked from commit a698eb6739b502d957da773e121ef40833f818dc)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: cli/streams: make "show sess" dump all streams till the new epoch

Instead of placing the current stream at the end of the stream list when
issuing a "show sess" on the CLI as was done in 2.2 with commit c6e7a1b8e
("MINOR: cli: make "show sess" stop at the last known session"), now we
compare the listed stream's epoch with the dumping stream's and stop on
more recent ones.

This way we're certain to always only dump known streams at the moment we
issue the dump command without having to modify the list. In theory we
could miss some streams if more than 2^31 "show sess" requests are issued
while an old stream remains present, but that's 68 years at 1 "show sess"
per second and it's unlikely we'll keep a process, let alone a stream, that
long.

It could be verified that the count of dumped streams still matches the
one before this change.

(cherry picked from commit 5d533e2bad668f10e5b00c24b7593847149a6be2)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: stream: add an "epoch" to figure which streams appeared when

The "show sess" CLI command currently lists all streams and needs to
stop at a given position to avoid dumping forever. Since 2.2 with
commit c6e7a1b8e ("MINOR: cli: make "show sess" stop at the last known
session"), a hack consists in unlinking the stream running the applet
and linking it again at the current end of the list, in order to serve
as a delimiter. But this forces the stream list to be global, which
affects scalability.

This patch introduces an epoch, which is a global 32-bit counter that
is incremented by the "show sess" command, and which is copied by newly
created streams. This way any stream can know whether any other one is
newer or older than itself.

For now it's only stored and not exploited.

(cherry picked from commit b981318c11ce8c1969b931b01d51be38dbbaa4a7)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: pass offer_buffers() the number of buffers instead of a threshold

Historically this function would try to wake the most accurate number of
process_stream() waiters. But since the introduction of filters which could
also require buffers (e.g. for compression), things started not to be as
accurate anymore. Nowadays muxes and transport layers also use buffers, so
the runqueue size has nothing to do anymore with the number of supposed
users to come.

In addition to this, the threshold was compared to the number of free buffer
calculated as allocated minus used, but this didn't work anymore with local
pools since these counts are not updated upon alloc/free!

Let's clean this up and pass the number of released buffers instead, and
consider that each waiter successfully called counts as one buffer. This
is not rocket science and will not suddenly fix everything, but at least
it cannot be as wrong as it is today.

This could have been marked as a bug given that the current situation is
totally broken regarding this, but this probably doesn't completely fix
it, it only goes in a better direction. It is possible however that it
makes sense in the future to backport this as part of a larger series if
the situation significantly improves.

(cherry picked from commit 4d77bbf8560b5d5b32409be131e4975811bfec28)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: use regular lists instead of mt_lists for buffer_wait

There's no point anymore in keeping mt_lists for the buffer_wait and
buffer_wq since it's thread-local now.

(cherry picked from commit 90f366b59556f9e64c927286bfc8e9c06d511744)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: dynbuf: make the buffer wait queue per thread

The buffer wait queue used to be global historically but this doest not
make any sense anymore given that the most common use case is to have
thread-local pools. Thus there's no point waking up waiters of other
threads after releasing an entry, as they won't benefit from it.

Let's move the queue head to the thread_info structure and use
ti->buffer_wq from now on.

(cherry picked from commit e8e5091510f667feeb36f68b8d2f4d9e8cf00dc0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-leastconn: do not unlink the server if it did not change

Due to the two-phase server reservation, there are 3 calls to
fwlc_srv_reposition() per request, one during assign_server() to reserve
the slot, one in connect_server() to commit it, and one in process_stream()
to release it. However only one of the first two will change the key, so
it's needlessly costly to take the lock, remove a server and insert it
again at the same place when we can already figure we ought not to even
have taken the lock.

Furthermore, even when the server needs to move, there can be quite some
contention on the lbprm lock forcing the thread to wait. During this time
the served and nbpend server values might have changed, just like the
lb_node.key itself. Thus we measure the values again under the lock
before updating the tree. Measurements have shown that under contention
with 16 servers and 16 threads, 50% of the updates can be avoided there.

This patch makes the function compute the new key and compare it to
the current one before deciding to move the entry (and does it again
under the lock forthe second test).

This removes between 40 and 50% of the tree updates depending on the
thread contention and the number of servers. The performance gain due
to this (on 16 threads) was:

16 servers: 415 krps -> 440 krps (6%, contention on lbprm)
4 servers: 554 krps -> 714 krps (+29%, little contention)

One point worth thinking about is that it's not logic to update the
tree 2-3 times per request while it's only read once. half to 2/3 of
these updates are not needed. An experiment consisting in subscribing
the server to a list and letting the readers reinsert them on the fly
showed further degradation instead of an improvement.

A better approach would probably consist in avoinding writes to shared
cache lines by having the leastconn nodes distinct from the servers,
with one node per value, and having them hold an mt-list with all the
servers having that number of connections. The connection count tree
would then be read-mostly instead of facing heavy writes, and most
write operations would be performed on 1-3 list heads which are way
cheaper to migrate than a tree node, and do not require updating the
last two updated neighbors' cache lines.

(cherry picked from commit 5064ab6a982a82c275393220716b75c13789de89)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-leastconn: do not take the server lock on take_conn/drop_conn

The operations are only an insert and a delete into the LB tree, which
doesn't require the server's lock at all as the lbprm lock is already
held. Let's drop it. Just for the sake of cleanness, given that the
served and nbpend values used to be atomically updated, we'll use an
atomic load to read them.

(cherry picked from commit 85b2fb03580f60d20d20ecbda8c9b63804d8e900)
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: lb-first: do not take the server lock on take_conn/drop_conn

The operations are only an insert and a delete into the LB tree, which
doesn't require the server's lock at all as the lbprm lock is already
held. Let's drop it.

(cherry picked from commit 6b96e0e9d204bb7e30222964d641cc28a0e4d0c0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: lb/api: let callers of take_conn/drop_conn tell if they have the lock

The two algos defining these functions (first and leastconn) do not need the
server's lock. However it's already present in pendconn_process_next_strm()
so the API must be updated so that the functions may take it if needed and
that the callers indicate whether they already own it.

As such, the call places (backend.c and stream.c) now do not take it
anymore, queue.c was unchanged since it's already held, and both "first"
and "leastconn" were updated to take it if not already held.

A quick test on the "first" algo showed a jump from 432 to 565k rps by
just dropping the lock in stream.c!

(cherry picked from commit 59b0fecfd9f2fe96f6a9c2d1d4fbe34f71adb6e3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: server: move actconns to the per-thread structure

The actconns list creates massive contention on low server counts because
it's in fact a list of streams using a server, all threads compete on the
list's head and it's still possible to see some watchdog panics on 48
threads under extreme contention with 47 threads trying to add and one
thread trying to delete.

Moving this list per thread is trivial because it's only used by
srv_shutdown_streams(), which simply required to iterate over the list.

The field was renamed to "streams" as it's really a list of streams
rather than a list of connections.

(cherry picked from commit d4e78d873cecf9938885c90e736f8b761a35fb55)
[wt: for 2.3 and older, this is a simplified version. We don't want to
touch all the server initialization code and sequence with the risks
of causing new trouble, so instead we declare actconns as an array
of the maximum number of supported threads, this will eat a bit more
memory on smaller machines but will remain safe. Those building with
MAX_THREADS close or equal to their target number of threads may even
save some RAM compared to 2.4. The only servers not initialized via
new_server() are the two Lua socket servers, and they've been handled
here]
Signed-off-by: Willy Tarreau <w@1wt.eu>

OPTIM: server: switch the actconn list to an mt-list

The remaining contention on the server lock solely comes from
sess_change_server() which takes the lock to add and remove a
stream from the server's actconn list. This is both expensive
and pointless since we have mt-lists, and this list is only
used by the CLI's "shutdown server sessions" command!

Let's migrate to an mt-list and remove the need for this costly
lock. By doing so, the request rate increased by ~1.8%.

(cherry picked from commit 751153e0f119bec90455cda95166f1b29d8b0326)
[wt: the contention is extreme on high threads counts, thus this
actconn series is backported; ctx adjustments]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: listener: refine the default MAX_ACCEPT from 64 to 4

The maximum number of connections accepted at once by a thread for a single
listener used to default to 64 divided by the number of processes but the
tasklet-based model is much more scalable and benefits from smaller values.
Experimentation has shown that 4 gives the highest accept rate for all
thread values, and that 3 and 5 come very close, as shown below (HTTP/1
connections forwarded per second at multi-accept 4 and 64):

ac\thr|    1     2    4     8     16
------+------------------------------
      4|   80k  106k  168k  270k  336k
     64|   63k   89k  145k  230k  274k

Some tests were also conducted on SSL and absolutely no change was observed.

The value was placed into a define because it used to be spread all over the
code.

It might be useful at some point to backport this to 2.3 and 2.2 to help
those who observed some performance regressions from 1.6.

(cherry picked from commit 66161326fd120652557d419810b1323405d7859d)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: tasks: refine the default run queue depth

Since a lot of internal callbacks were turned to tasklets, the runqueue
depth had not been readjusted from the default 200 which was initially
used to favor batched processing. But nowadays it appears too large
already based on the following tests conducted on a 8c16t machine with
a simple config involving "balance leastconn" and one server. The setup
always involved the two threads of a same CPU core except for 1 thread,
and the client was running over 1000 concurrent H1 connections. The
number of requests per second is reported for each (runqueue-depth,
nbthread) couple:

rq\thr|    1     2     4     8    16
------+------------------------------
     32|  120k  159k  276k  477k  698k
     40|  122k  160k  276k  478k  722k
     48|  121k  159k  274k  482k  720k
     64|  121k  160k  274k  469k  710k
    200|  114k  150k  247k  415k  613k  <-- default

It's possible to save up to about 18% performance by lowering the
default value to 40. One possible explanation to this is that checking
I/Os more frequently allows to flush buffers faster and to smooth the
I/O wait time over multiple operations instead of alternating phases
of processing, waiting for locks and waiting for new I/Os.

The total round trip time also fell from 1.62ms to 1.40ms on average,
among which at least 0.5ms is attributed to the testing tools since
this is the minimum attainable on the loopback.

After some observation it would be nice to backport this to 2.3 and
2.2 which observe similar improvements, since some users have already
observed some perf regressions between 1.6 and 2.2.

(cherry picked from commit 4327d0ac00fd29622e8558a206dfbf23eaeb9fe9)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: session: NULL dereference possible when accessing the listener

When implementing a client applet, a NULL dereference was encountered on
the error path which increment the counters.

Indeed, the counters incremented are the one in the listener which does
not exist in the case of client applets, so in sess->listener->counters,
listener is NULL.

This patch fixes the access to the listener structure when accessing
from a sesssion, most of the access are the counters in error paths.

Must be backported as far as 1.8.

(cherry picked from commit 36119de182154b1f87e0cdf4bd1efba9e2e64113)
[wt: minor ctx adjustments in http_ana and mux_h1]
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: atomic: implement a more efficient arm64 __ha_cas_dw() using pairs

There finally is a way to support register pairs on aarch64 assembly
under gcc, it's just undocumented, like many of the options there :-(
As indicated below, it's possible to pass "%H" to mention the high
part of a register pair (e.g. "%H0" to go with "%0"):

https://patchwork.ozlabs.org/project/gcc/patch/59368A74.2060908@foss.arm.com/

By making local variables from pairs of registers via a struct (as
is used in IST for example), we can let gcc choose the correct register
pairs and avoid a few moves in certain situations. The code is now slightly
more efficient than the previous one on AWS' Graviton2 platform, and
noticeably smaller (by 4.5kB approx). A few tests on older releases show
that even Linaro's gcc-4.7 used to support such register pairs and %H,
and by then ATOMICS were not supported so this should not cause build
issues, and as such this patch replaces the earlier implementation.

(cherry picked from commit f8fb4f75f147053fc939ab9bcc456c0e4e5b28a9)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: atomic: add armv8.1-a atomics variant for cas-dw

This variant uses the CASP instruction available on armv8.1-a CPU cores,
which is detected when __ARM_FEATURE_ATOMICS is set (gcc-linaro >= 7,
mainline >= 9). This one was tested on cortex-A55 (S905D3) and on AWS'
Graviton2 CPUs.

The instruction performs way better on high thread counts since it
guarantees some forward progress when facing extreme contention while
the original LL/SC approach is light on low-thread counts but doesn't
guarantee progress.

The implementation is not the most optimal possible. In particular since
the instruction requires to work on register pairs and there doesn't seem
to be a way to force gcc to emit register pairs, we have to decide to force
to use the pair (x0,x1) to store the old value, and (x2,x3) to store the
new one, and this necessarily involves some extra moves. But at least it
does improve the situation with 16 threads and more. See issue #958 for
more context.

Note, a first implementation of this function was making use of an
input/output constraint passed using "+Q"(*(void**)target), which was
resulting in smaller overall code than passing "target" as an input
register only. It turned out that the cause was directly related to
whether the function was inlined or not, hence the "forceinline"
attribute. Any changes to this code should still pay attention to this
important factor.

(cherry picked from commit 46cca8690029dcdd13c86894aefdf2ef2d6f6b07)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: mt-list: always perform a cpu_relax call on failure

On highly threaded machines it is possible to occasionally trigger the
watchdog on certain contended areas like the server's connection list,
because while the mechanism inherently cannot guarantee a constant
progress, it lacks CPU relax calls which are absolutely necessary in
this situation to let a thread finish its job.

The loop's "while (1)" was changed to use a "for" statement calling
__ha_cpu_relax() as its continuation expression. This way the "continue"
statements jump to the unique place containing the pause without
excessively inflating the code.

This was sufficient to definitely fix the problem on 64-core ARM Graviton2
machines. This patch should probably be backported once it's confirmed it
also helps on many-cores x86 machines since some people are facing
contention in these environments. This patch depends on previous commit
"REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h".

An attempt was made to first read the value before exchanging, and it
significantly degraded the performance. It's very likely that this caused
other cores to lose exclusive ownership on their line and slow down their
next xchg operation.

In addition it was found that MT_LIST_ADD is significantly faster than
MT_LIST_ADDQ under high contention, because it fails one step earlier
when conflicting with an adjacent MT_LIST_DEL(). It might be worth
switching some operations' order to favor MT_LIST_ADDQ() instead.

(cherry picked from commit 168fc5332c7b3f43c8841a999fc40a3acef85223)
Signed-off-by: Willy Tarreau <w@1wt.eu>

REORG: atomic: reimplement pl_cpu_relax() from atomic-ops.h

There is some confusion here as we need to place some cpu_relax statements
in some loops where it's not easily possible to condition them on the use
of threads. That's what atomic.h already does. So let's take the various
pl_cpu_relax() implementations from there and place them in atomic.h under
the name __ha_cpu_relax() and let them adapt to the presence or absence of
threads and to the architecture (currently only x86 and aarch64 use a barrier
instruction), though it's very likely that arm would work well with a cache
flushing ISB instruction as well).

This time they were implemented as expressions returning 1 rather than
statements, in order to ease their placement as the loop condition or the
continuation expression inside "for" loops. We should probably do the same
with barriers and a few such other ones.

(cherry picked from commit 958ae26c3558f0a5cdcb7a92cc535f1cd1ac9a64)
[wt: will be used by later fixes]
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: ssl: don't truncate the file descriptor to 16 bits in debug mode

Errors reported by ssl_sock_dump_errors() to stderr would only report the
16 lower bits of the file descriptor because it used to be casted to ushort.
This can be backported to all versions but has really no importance in
practice since this is never seen.

(cherry picked from commit 566cebc1fc4f9908a47a1924c80ff32460543a49)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: hlua: Don't strip last non-LWS char in hlua_pushstrippedstring()

hlua_pushstrippedstring() function strips leading and trailing LWS
characters. But the result length it too short by 1 byte. Thus the last
non-LWS character is stripped. Note that a string containing only LWS
characters resulting to a stipped string with an invalid length (-1). This
leads to a lua runtime error.

This bug was reported in the issue #1155. It must be backported as far as
1.7.

(cherry picked from commit 2ec4e3c1acf95bcdc56028bbefe1a355c457b978)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: backend: fix condition for reuse on mode HTTP

This commit is a fix/complement to the following one :
08d87b3f49867440f66aee09173c84bf58cbc859
BUG/MEDIUM: backend: never reuse a connection for tcp mode

It fixes the check for the early insertion of backend connections in
the reuse lists if the backend mode is HTTP.

The impact of this bug seems limited because :
- in tcp mode, no insertion is done in the avail list as mux_pt does not
  support multiple streams.
- in http mode, muxes are also responsible to insert backend connections
  in lists in their detach functions. Prior to this fix the reuse rate
  could be slightly inferior.

It can be backported to 2.3.

(cherry picked from commit 249f0562cf2654488100d83d66c73902a2b1eb6c)
Signed-off-by: Willy Tarreau <w@1wt.eu>

[RELEASE] Released version 2.3.6

Released version 2.3.6 with the following main changes :
    - MINOR: check: do not ignore a connection header for http-check send
    - BUILD: ssl: fix typo in HAVE_SSL_CTX_ADD_SERVER_CUSTOM_EXT macro
    - BUILD: ssl: guard SSL_CTX_add_server_custom_ext with special macro
    - BUILD: ssl: guard SSL_CTX_set_msg_callback with SSL_CTRL_SET_MSG_CALLBACK macro
    - BUG/MINOR: intops: fix mul32hi()'s off-by-one
    - BUG/MINOR: http-ana: Don't increment HTTP error counter on internal errors
    - BUG/MEDIUM: mux-h1: Always set CS_FL_EOI for response in MSG_DONE state
    - BUG/MINOR: server: re-align state file fields number
    - BUG/MINOR: tools: Fix a memory leak on error path in parse_dotted_uints()
    - BUG/MINOR: backend: hold correctly lock when killing idle conn
    - BUG/MINOR: server: Fix server-state-file-name directive
    - CLEANUP: deinit: release global and per-proxy server-state variables on deinit
    - BUG/MEDIUM: config: don't pick unset values from last defaults section
    - BUG/MINOR: stats: revert the change on ST_CONVDONE
    - BUG/MINOR: cfgparse: do not mention "addr:port" as supported on proxy lines
    - BUG/MINOR: server: Don't call fopen() with server-state filepath set to NULL
    - DOC: tune: explain the origin of block size for ssl.cachesize
    - CLEANUP: channel: fix comment in ci_putblk.
    - BUG/MINOR: server: Remove RMAINT from admin state when loading server state
    - BUG/MINOR: session: atomically increment the tracked sessions counter
    - BUG/MINOR: checks: properly handle wrapping time in __health_adjust()
    - BUG/MEDIUM: checks: don't needlessly take the server lock in health_adjust()
    - BUG/MINOR: sample: Always consider zero size string samples as unsafe
    - BUILD: ssl: introduce fine guard for OpenSSL specific SCTL functions
    - DOC: explain the relation between pool-low-conn and tune.idle-pool.shared
    - BUG/MEDIUM: lists: Avoid an infinite loop in MT_LIST_TRY_ADDQ().
    - BUG/MEDIUM: spoe: Resolve the sink if a SPOE logs in a ring buffer
    - BUG/MINOR: http-rules: Always replace the response status on a return action
    - BUG/MINOR: server: Init params before parsing a new server-state line
    - BUG/MINOR: server: Be sure to cut the last parsed field of a server-state line
    - BUG/MEDIUM: mux-h1: Fix handling of responses to CONNECT other than 200-ok
    - BUG/MINOR: ssl/cli: potential null pointer dereference in "set ssl cert"
    - MINOR: Configure the `cpp` userdiff driver for *.[ch] in .gitattributes
    - BUG/MINOR: sample: secure convs that accept base64 string and var name as args
    - BUG/MEDIUM: vars: make functions vars_get_by_{name,desc} thread-safe
    - BUG/MEDIUM: proxy: use thread-safe stream killing on hard-stop
    - BUG/MEDIUM: cli/shutdown sessions: make it thread-safe
    - BUG/MINOR: proxy: wake up all threads when sending the hard-stop signal
    - BUG/MINOR: fd: properly wait for !running_mask in fd_set_running_excl()
    - BUG/MINOR: resolvers: Fix condition to release received ARs if not assigned
    - BUG/MINOR: resolvers: Only renew TTL for SRV records with an additional record
    - BUG/MINOR: resolvers: new callback to properly handle SRV record errors
    - BUG/MEDIUM: resolvers: Reset server address and port for obselete SRV records
    - BUG/MEDIUM: resolvers: Reset address for unresolved servers
    - BUG/MINOR: ssl: potential null pointer dereference in ckchs_dup()
    - CLEANUP: muxes: Remove useless if condition in show_fd function
    - BUG/MINOR: stats: fix compare of no-maint url suffix
    - BUG/MINOR: mux-h1: Immediately report H1C errors from h1_snd_buf()
    - BUG/MINOR: http-ana: Only consider dst address to process originalto option
    - BUG/MINOR: tcp-act: Don't forget to set the original port for IPv4 set-dst rule
    - BUG/MINOR: connection: Use the client's dst family for adressless servers
    - BUG/MEDIUM: spoe: Kill applets if there are pending connections and nbthread > 1
    - DOC: spoe: Add a note about fragmentation support in HAProxy
    - BUG/MINOR: mux-h2: Fix typo in scheme adjustment
    - BUG/MINOR: http-ana: Don't increment HTTP error counter on read error/timeout

BUG/MINOR: http-ana: Don't increment HTTP error counter on read error/timeout

This should have been fixed when the commit "BUG/MINOR: http-ana: Don't
increment HTTP error counter on internal errors" was backported but I forgot
to do so. The HTTP error counter must not be incremented if a read error or
a read timeout is encountered. Parsing error are already reported by the
mux.

This patch must be backported as far as 2.0, on the HTX part only.

BUG/MINOR: mux-h2: Fix typo in scheme adjustment

That comma should've been a semicolon. Fortunately, as it is now there
is no impact thanks to operators precedence, and all expressions are
properly evaluated. But this is troubling and the risk is high to
turn it into an effective bug with a minor change.

Introduced in b8ce8905cf63ecd06b36af39c05103fadf3cc347 which first
appeared in 2.1-dev3. This fix must be backported to 2.1+.

(cherry picked from commit a3298023b04923ba12429d79c559dc7a850ae122)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

DOC: spoe: Add a note about fragmentation support in HAProxy

Add a note in SPOE.txt to make it clear that HAPRoxy does not support the
fragmentation. It can send fragmented frames if an agent supports it but it
cannot receives and handles fragmented frames.

This patch should fix the issue #659. It may be backported as far as 1.8.

(cherry picked from commit 9536ad707f390c8026777ec8e074a5edaa2f7c7a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MEDIUM: spoe: Kill applets if there are pending connections and nbthread > 1

When the processing stage is finished for a SPOE applet, before returning it
into the idle list, we check if the assigned server appears as full or if
there are some pending connections on the backend or the assigned server. If
yes, it means we reach a maxconn and we close the applet to free a
slot. Otherwise, the applet can be reused. This test is only performed if
there are more than one thread.

It is important to close SPOE applets when there are pending connections for
multithreaded instances because connections with the SPOE agents are
persistent and local to a thread (applets are local to a thread). If a
maxconn is configured, some threads may take all available slots for a
while, leaving remaining threads without any free slot to process SPOE
messages. It is especially true if the maxconn is low.

This patch should fix the issue #705. It must be backported as far as
1.8. However, the code in 1.8 is quite different, a test must be performed
to be sure it works well.

(cherry picked from commit 9e647e5af77faa6a95dd511d78ed8763781e764a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: connection: Use the client's dst family for adressless servers

When the selected server has no address, the destination address of the
client is used. However, for now, only the address is set, not the
family. Thus depending on how the server is configured and the client's
destination address, the server address family may be wrong.

For instance, with such server :

server srv 0.0.0.0:0

The server address family is AF_INET. The server connection will fail if a
client is asking for an IPv6 destination.

To fix the bug, we take care to set the rigth family, the family of the
client destination address.

This patch should fix the issue #202. It must be backported to all stable
versions.

(cherry picked from commit ae3056157c17b181c754556836a5d8336fb9823f)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: tcp-act: Don't forget to set the original port for IPv4 set-dst rule

If an IPv4 is set via a TCP/HTTP set-dst rule, the original port must be
preserved or set to 0 if the previous family was neither AF_INET nor
AF_INET6. The first case is not an issue because the port remains the
same. But if the previous family was, for instance, AF_UNIX, the port is not
set to 0 and have an undefined value.

This patch must be backported as far as 1.7.

(cherry picked from commit e01ca0fbc9c72de95514816e016a58c5a28ab2a8)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: http-ana: Only consider dst address to process originalto option

When an except parameter is used for originalto option, only the destination
address must be evaluated. Especially, the address family of the destination
must be tested and not the source one.

This patch must be backported to all stable versions. However be careful,
depending the versions the code may be slightly different.

(cherry picked from commit cccded98c7b3f2281fe4481895b1392e78cc5df0)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: mux-h1: Immediately report H1C errors from h1_snd_buf()

In case an H1 stream tries to send while on error occurred on its underlying
H1 connection, we must report the error. The way the stream-interface is
synchronously notified of the error. It seems to only be a problem on the
2.0. Probably because the scheduling has changed in upper versions. On the
2.0, it prevent the stream to be notified of errors, when for instance, a
payload is found in a response to a HEAD request. Not always though.

This patch must be backported as far as 2.0 because, on 2.0, it should fix the
issue #1101. There is no upstream ID for this commit because on the 2.4, this
fix already exists, it is part of non-backportable commit.

BUG/MINOR: stats: fix compare of no-maint url suffix

Only the first 3 characters are compared for ';no-maint' suffix in
http_handle_stats. Fix it by doing a full match over the entire suffix.

As a side effect, the ';norefresh' suffix matched the inaccurate
comparison, so the maintenance servers were always hidden on the stats
page in this case.

no-maint suffix is present since commit
3e320367014c742814ba494594cdb8340b1161f1
MINOR: stats: also support a "no-maint" show stat modifier

It should be backported up to 2.3.

This fixes github issue #1147.

(cherry picked from commit 91e55ea3f3f6277d91c0fd85ea924ea0e444f7f3)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

CLEANUP: muxes: Remove useless if condition in show_fd function

In H1, H2 and FCGI muxes, in the show_fd function, there is duplicated test on
the stream's subs field.

This patch fixes the issue #1142. It may be backported as far as 2.2.

(cherry picked from commit 6c93c4ef089fc79ce4e85fa0d9e7f61720291dba)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>