Repositories - haproxy-3.1.git/log

CLEANUP: quic: Rename some BBR functions in relation with bw probing

Rename bbr_is_probing_bw() to bbr_is_in_a_probe_state() and
bbr_is_accelerating_probing_bw() to bbr_is_probing_bw() to match
the function names of the BBR v3 internet draft.

Must be backported to 3.1 to ease any further backport to come.

(cherry picked from commit 949bc18f660d860b9102360d2b98d0c96e60266c)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: missing Startup accelerating probing bw states

Startup state is also a probing with acceleration bandwidth state.
This modification should have come with this previous one:

BUG/MINOR: quic: reduce packet losses at least during ProbeBW_CRUISE (BBR)

Must be backported to 3.1.

(cherry picked from commit 0dc0c890ea20f3af76ae4f4f7422718bcd8f6605)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

REGTESTS: ssl: add a PEM with mix of LF and CRLF line endings

User tried to update a PEM, generated automatically. Part of this PEM has LF
line endings, and another part (CA certificate), added by some API, has CRLF
line endings. This has revealed a bug in cli_snd_buf(), see more
details in issue GitHUB #2818. So, let's add an example of such PEM in our
SSL regtest.

(cherry picked from commit ea4a148a7d9ead2132757bffd4699146e80da2df)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: cli: cli_snd_buf: preserve \r\n for payload lines

cli_snd_buf() analyzez input line by line. Before this patch it has always
scanned a given line for the presence of '\r' followed by '\n'.

This is only needed for strings, that contain the commands itself like
"show ssl cert\n", "set ssl cert test.pem <<\n".

In case of strings, which contain the command's payload, like
"-----BEGIN CERTIFICATE-----\r\n", '\r\n' should be preserved
as is.

This patch fixes the GitHub issue #2818.

This patch should be backported in v3.1 and in v3.0.

(cherry picked from commit d60c893991699af1103bcdcd33c915885c056559)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: too permissive exit condition for high loss detection in Startup (BBR)

This bug fixes the 3rd condition used by bbr_check_startup_high_loss() to decide
it has detected some high loss as mentioned by the BBR v3 RFC draft:

   4.3.1.3. Exiting Startup Based on Packet Loss
   ...
   There are at least BBRStartupFullLossCnt=6 discontiguous sequence ranges lost in that round trip.

where a <= operator was used in place of <.

Must be backported to 3.1.

(cherry picked from commit 178109f608ae1e446d9ea898bf5c83e4e8313bb5)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: fix the wrong tracked recovery start time value

bbr_congestion_event() role is to track the start time of recovery periods.
This was done using <ts> passed as parameter. But this parameter is the
time the newest lost packet has been sent.
The timestamp value to store in ->recovery_start_ts is <now_ms>.

Must be backported to 3.1.

(cherry picked from commit e61b4189078f97d036952371c7fc9e5ceefdfff0)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

CLEANUP: quic: remove a wrong comment about ->app_limited (drs)

->app_limited quic_drs struct member is not a boolean. This is
the index of the last transmitted packet marked as application-limited, or 0 if
the connection is not currently application-limited (see C.app_limited
definition in BBR v3 draft).

(cherry picked from commit e1d25cdbddfa344ceee7dcecd6a84eff4a209122)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: quic: reduce the private data size of QUIC cc algos

After these commits:

BUG/MINOR: quic: remove max_bw filter from delivery rate sampling
BUG/MINOR: quic: fix BBB max bandwidth oscillation issue

where some members were removed from bbr struct, the private data
size of QUIC cc algorithms may be reduced from 160 to 144 uint32_t.

Should be easily backported to 3.1 alonside the commits mentioned above.

(cherry picked from commit eeaeb412dc978bdc9663061d9d2542dc1f7310cb)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: reduce packet losses at least during ProbeBW_CRUISE (BBR)

Upon congestion events (for a instance packet loss),
bbr_adapt_lower_bounds_from_congestion() role is to adapt some BBR internal
variables in relation with the estimated bandwidth (BBR.bw).

According to the BBR v3 draft, this function should do nothing
if BBRIsProbingBW() pseudo-code returns true. That said, this function
is not defined by the BBR v3 draft. But according to this part mentioned before
defining the pseudo-code for BBRAdaptLowerBoundsFromCongestion():

4.5.10.3. When not Probing for Bandwidth
When not explicitly accelerating to probe for bandwidth (Drain, ProbeRTT,
ProbeBW_DOWN, ProbeBW_CRUISE), BBR responds to loss by slowing down to some extent.
This is because loss suggests that the available bandwidth and safe volume of
in-flight data may have decreased recently, and the flow needs to adapt, slowing
down toward the latest delivery process. BBR flows implement this response by
reducing the short-term model parameters, BBR.bw_lo and BBR.inflight_lo.

BBRIsProbingBW() should concern the accelerating probe for bandwidth states
which are BBR_ST_PROBE_BW_REFILL and BBR_ST_PROBE_BW_UP.

Adapt the code to match this latter assumption. At least this reduce
drastically the packet loss volumes at least during ProbeBW_CRUISE.

As an example, on a 100MBits/s internet link with ~94ms as RTT, before
this patch, 4329640 sent packets were needed with 1617119 lost packets (!!!) to
download a 3GB object. After this patch, 2843952 sent packets vs 144134 lost packets
are needed. There may be some packet loss issue. I suspect the maximum bandwidth
which may be overestimated. More this is the case, more the packet loss is big.
That said, at this time, it remains below 5% depending on the size of the objects,
5% being for more than 2GB objects.

Must be backported to 3.1.

(cherry picked from commit 9813de0537882355514794eed7b738bed1806954)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: underflow issue for bbr_inflight_hi_from_lost_packet()

Add a test to ensure that values of a local variable used by
bbr_inflight_hi_from_lost_packet() is not be impacted by underflow issues
when subtracting too big numbers and make this function return a correct value.

Must be backported to 3.1.

(cherry picked from commit ebfc301d5d8dbae0c07a9d937bef18d5c3f841c6)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: remove max_bw filter from delivery rate sampling

This filter is no more needed after this commit:

BUG/MINOR: quic: fix BBB max bandwidth oscillation issue.

Indeed, one added this filter at delivery rate sampling level to filter
the BBR max bandwidth estimations and was inspired from ngtcp2 code source when
trying to fix the oscillation issue. But this BBR max bandwidth oscillation issue
was fixed by the aforementioned commit.

Furthermore this code tends to always increment the BBR max bandwidth. From my point
of view, this is not a good idea at all.

Must be backported to 3.1.

(cherry picked from commit 22ab45a3a84d0a9a61cc90c6efedd893dd1cbe4a)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: wrong bbr_target_inflight() implementation

This bug arrived with this commit:

6404b7a18a BUG/MINOR: quic: fix bbr_inflight() calls with wrong gain value

This patch partially reverts after having checked the BBR v3 draft.
This bug was invisible when testing long BBR flows.

Must be backported to 3.1.

(cherry picked from commit 2bcd5b4cba4e9f123860560de0807312b909b306)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: fix BBB max bandwidth oscillation issue.

Remove the code in relation with BBR.ack_phase as per this commit:
https://github.com/ietf-wg-ccwg/draft-ietf-ccwg-bbr/pull/5/commits/ee98c12ad6f0e93153656218a7df1b1ef92618d7

I do now kwow at this time why such a request was pushed on GH for the BBR v3 draft
pseudo-code. That said, the use of such an ack phase seemed confusing, adding much
more information about a BBR flow state than needed. Indeed, the ack phase
state is modified several times in the BBR draft pseudo-code but only used to
decide if the max bandwidth filter virtual clock had to be incremented by
BBRAdvanceMaxBwFilter().

In addition to this, when discussing about haproxy BBR implementation with
Neal Cardwell on the BBR development google group about an oscillation issue
of the max bandwidth (BBR.max_bw), I concluded that this was due to the fact
that its filter virutal clock was too often update, due to the ack phase wich
was stalled in BBR_ACK_PHASE_ACKS_PROBE_STOPPING state for too long. This is
where Neal asked me to test the aforementioned commit. This definitively
makes the max bandwidth (BBR.max_bw) oscillation issue disappear.

Another solution would have been to add a new ack phase enum afer
BBR_ACK_PHASE_ACKS_PROBE_STOPPING. BBR_ACK_PHASE_ACKS_PROBE_STOPPED
would have been a good candidate.

Remove the code in relation with BBR.ack_phase.

Must be backported to 3.1.

(cherry picked from commit b47e1e65df1b468bcd84f02fbc1eb4e1fd7e2bc9)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

BUG/MINOR: quic: wrong logical statement in in_recovery_period() (BBR)

A && logical operator was badly replaced by a || in this function which decides
if BBR is in a recovery period.

Must be backported to 3.1.

(cherry picked from commit 1dbf6b8bedadd5d287ab44e1a387a89e2fc6c70b)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

MINOR: window_filter: rely on the time to update the filter samples (QUIC/BBR)

The windowed filters are used only the BBR implementation for QUIC to filter
the maximum bandwidth samples for its estimation over a virtual time interval
tracked by counting the cyclical progression through ProbeBW cycles. ngtcp2
and quiche use such windowed filters in their BBR implementation. But in a
slightly different way. When updating the 2nd or 3rd filter samples, this
is done based on their values in place of the time they have been sampled.
It seems more logical to rely on the sample timestamps even if this has no
implication because when a sample is updated using another sample because it
has the same value, they have both the same timestamps!

This patch modifies two statements which compare two consecutive filter samples
based on their values (smp[]->v) by statements which compare them based on the
virtual time they have been sampled (smp[]->t). This fully complies which the
code used by the Linux kernel in lib/win_minmax.c.

Alo take the opportunity of this patch to shorten some statements using <smp>
local variable value to update smp[2] sample in place of initializing its two
members with the <smp> member values.

This patch SHOULD be easily backported to 3.1 where BBR was first implemented.

(cherry picked from commit a9a2f98f86a93dbefca636959234df161b572c21)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

[RELEASE] Released version 3.1.1

Released version 3.1.1 with the following main changes :
    - BUG/MEDIUM: sock: Remove FD_POLL_HUP during connect() if FD_POLL_ERR is not set
    - MINOR: proxy: Add support of 421-Misdirected-Request in retry-on status
    - BUG/MINOR: log: fix lf_text() behavior with empty string
    - BUG/MINOR: improve BBR throughput on very fast links
    - BUG/MEDIUM: event_hdl: fix uninitialized value in async mode when no data is provided
    - BUG/MEDIUM: quic: prevent stream freeze on pacing
    - BUG/MEDIUM: http-ana: Reset request flag about data sent to perform a L7 retry
    - BUG/MINOR: h1-htx: Use default reason if not set when formatting the response
    - BUILD: quic: fix a build error about an non initialized timestamp
    - BUG/MINOR: signal: register default handler for SIGINT in signal_init()
    - BUG/MINOR: startup: close pidfd and free global.pidfile in handle_pidfile()
    - BUG/MINOR: startup: fix pidfile creation
    - BUG/MINOR: quic: fix bbr_inflight() calls with wrong gain value
    - BUG/MEDIUM: init: make sure only daemonized processes change their session
    - BUG/MINOR: init: do not call fork_poller() for non-forked processes
    - BUG/MEDIUM: mux-quic: remove pacing status when everything is sent
    - BUG/MINOR: quic: remove startup alert if conn socket-owner unsupported
    - BUG/MINOR: quic: remove startup alert if GSO unsupported
    - BUG/MEDIUM: mux-h2: make sure not to touch dummy streams when sending WU
    - BUG/MINOR: config: Fix parsing of accept-invalid-http-{request,response}
    - DOC: config: fix confusing init-state examples
    - BUG/MINOR: debug: COUNT_IF() should return true/false
    - BUILD: debug: fix build issues in COUNT_IF() with -Wunused-value
    - MINOR: mux-h2/traces: add a missing trace on negative initial window size
    - CLEANUP: mux-h2/traces: reword certain ambiguous traces
    - MINOR: mux-h2/glitches: add a description to the H2 glitches
    - BUG/MINOR: mux-h2: fix expression when detecting excess of CONTINUATION frames
    - BUG/MINOR: mworker: don't save program PIDs in oldpids
    - BUG/MINOR: mworker: fix -D -W -sf/-st modes
    - BUG/MINOR: startup: fix error path for master, if can't open pidfile
    - BUG/MEDIUM: startup: don't daemonize if started with -c
    - BUG/MEDIUM: startup: report status if daemonized process fails
    - BUG/MEDIUM: mworker: report status, if daemonized master fails
    - BUG/MINOR: namespace: handle a possible strdup() failure
    - BUG/MINOR: ssl_crtlist: handle a possible strdup() failure
    - BUG/MINOR: resolvers: handle a possible strdup() failure
    - BUG/MINOR: stats: decrement srv refcount on stats-file release
    - MINOR: list: define a watcher type
    - BUG/MEDIUM: stats/server: use watcher to track server during stats dump
    - BUG/MINOR: http-fetch: Ignore empty argument string for query()
    - BUG/MINOR: server-state: Fix expiration date of srvrq_check tasks
    - BUG/MINOR: hlua_fcn: restore server pairs iterator pointer consistency

BUG/MINOR: hlua_fcn: restore server pairs iterator pointer consistency

Since 9c91b30 ("MINOR: server: remove prev_deleted server list"), hlua
server pair iterator may use and return invalid (stale) server pointer
if multiple servers were deleted between two iterations.

Indeed, the server refcount mechanism (using srv_take()) is no longer
sufficient as the prev_deleted mitigation was removed.

To ensure server pointer consistency between two yields, the new watcher
mechanism must be used (as it already the case for stats dumping).

Thus in this patch we slightly change the server iteration logic:
hlua_server_list_iterator_context struct now stores the next valid server
pointer, and a watcher is added to ensure this pointer is never stale.

Then in hlua_listable_servers_pairs_iterator(), this next pointer is used
to create the Lua server object, and the next valid pointer is obtained by
leveraging watcher_next().

No backport needed unless 9c91b30 ("MINOR: server: remove prev_deleted
server list") is. Please note that dynamic servers were not supported in
Lua prior to 2.8, so it doesn't make sense to backport this patch further
than 2.8.

(cherry picked from commit 358166ae6a3e98d36b378c7eeab1673cc8b4a4dd)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: server-state: Fix expiration date of srvrq_check tasks

"hold.timeout" was used as expiration date for srvrq_check tasks. But it is
not accurrate. The expiration date must be based on the resolution timeouts
instead (resolve and retry).

The purpose of srvrq_check task is to clean up the server resolution status
when outdated info are inherited from the state file. Using "hold.timeout"
is not accurrate here because hold timeouts concern the resolution response
items not the resolution status of servers. It may be set to a huge value or
0. The expiration date of these tasks must be based on the resolution
timeouts instead.

So now the ("timeout resolve" + resolve_retries * "timeout retry") value is
used.

This patch should fix the issue #2816. It must be backported to all stable
versions.

(cherry picked from commit 647a2906626e9c3d9c3349d338a35798325496f2)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: http-fetch: Ignore empty argument string for query()

query() sample fetch function takes an optional argument string. During
configuration parsing, empty string must be ignored. It is especially
important when the sample is used with empty parenthesis. The argument is
optional and it is a list of options to configure the behavior of the sample
fetch. So it is logical to ignore empty strings.

This patch should fix the issue #2815. It must be backported to 3.1.

(cherry picked from commit e1525e7b8fa15c59c1c6fc4063b5218d134b7afb)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: stats/server: use watcher to track server during stats dump

If a server A is deleted while a stats dump is currently on it, deletion
is delayed thanks to reference counting. Server A is nonetheless removed
from the proxy list. However, this list is a single linked list. If the
next server B is deleted and freed immediately, server A would still
point to it. This problem has been solved by the prev_deleted list in
servers.

This model seems correct, but it is difficult to ensure completely its
validity. In particular, it implies when stats dump is resumed, server A
elements will be accessed despite the server being in a half-deleted
state.

Thus, it has been decided to completely ditch the refcount mechanism for
stats dump. Instead, use the watcher element to register every stats
dump currently tracking a server instance. Each time a server is deleted
on the CLI, each stats dump element which may points to it are updated
to access the next server instance, or NULL if this is the last server.
This ensures that a server which was deleted via CLI but not completely
freed is never accessed on stats dump resumption.

Currently, no race condition related to dynamic servers and stats dump
is known. However, as described above, the previous model is deemed too
fragile, as such this patch is labelled as bug-fix. It should be
backported up to 2.6, after a reasonable period of observation. It
relies on the following patch :
MINOR: list: define a watcher type

(cherry picked from commit 071ae8ce3d1a318d2227fad2ebf63e78a05815f0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: list: define a watcher type

Define a new watcher type into list module. This type is similar to bref
and can be used to register an element which is currently tracking a
dynamic target. Contrary to bref, if the target is freed, every watcher
element are updated to point to a next valid entry or NULL.

This type will simplify handling of dynamic servers deletion, in
particular while stats dump are performed.

This patch is not a bug-fix. However, it is mandatory to fix a race
condition in dynamic servers. Thus, it should be backported along the
next commit up to 2.6.

(cherry picked from commit eafa8a32bb19ee567cb0a60a4042a8ef436e1e94)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: stats: decrement srv refcount on stats-file release

Servers instance may be removed at runtime. This can occurs during a
stat dump which currently references this server instance. This case is
protected by server refcount to prevent the server immediate release.

CLI output may be interrupted prior to stats dump completion, for
example if client CLI has been disconnected before the full response
transfer. As such, srv_drop() must be called in every stats dump release
callback.

srv_drop() was missing for stats-file dump release callback. This could
cause a race condition which would prevent a server instance to be fully
removed. Fix this by adding srv_drop() invokation into
cli_io_handler_release_dump_stat_file().

This should be backported up to 3.0.

(cherry picked from commit 2199179461957c86f57e39c4d97c37f3e4d10935)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: resolvers: handle a possible strdup() failure

This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to all supported branches.

(cherry picked from commit 193c94a53988f77b827e5f8bb5121d3dc3b2b5ed)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: ssl_crtlist: handle a possible strdup() failure

This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to all supported branches.

(cherry picked from commit ce30bc17305e4ac2dec7d641eedcb301a237d863)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: namespace: handle a possible strdup() failure

This defect was found by the coccinelle script "unchecked-strdup.cocci".
It can be backported to all supported branches.

(cherry picked from commit abee5468509a77537015a7e5dc37268d4d564e03)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: mworker: report status, if daemonized master fails

As daemonization fork happens now very early and before the master-worker
fork, if master or worker processes fail during the initialization, some
critical errors can't be reported to stdout. The launching (parent) process in
such cases exits with 0. This makes an impression, that master and his worker
have successfully started at background, which really complicates the
operations.

In the previous commit a pipe was added to make daemonized child communicate
with his parent. Let's add the same logic to master-worker mode. Up to
receiving the READY message from the worker, master will "forward" it via the
pipe to the launching process. Launching process can obtain master's exit
status, if the master fails to start and nothing has been written in the pipe.

This fix should be backported only in 3.1.

(cherry picked from commit 97aaf7671654ceba16727f8375104064538c40e6)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: startup: report status if daemonized process fails

Due to master-worker rework, daemonization fork happens now before parsing
and applying the configuration. This makes impossible to report correctly all
warnings and alerts to shell's stdout. Daemonzied process fails, while being
already in background, exit code reported by shell via '$?' equals to 0, as
it's the exit code of his parent.

To fix this, let's create a pipe between parent and daemonized child. The
child will send into this pipe a "READY" message, when it finishes his
initialization. The parent will wait on the "read" end of the pipe until
receiving something. If read() fails, parent obtains the status of the
exited child with waitpid(). So, the parent can correctly report the error to
the stdout and he can exit with child's exitcode.

This fix should be backported only in 3.1.

(cherry picked from commit 663d75e7a00a9c50185cca322cbc64de934b6631)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: startup: don't daemonize if started with -c

Due to master-worker refactoring, daemonization fork happens now very early,
before parsing and verifying the configuration. For the moment there is no
any specific syntax, which needs for the daemon mode to be really applied in
order to perform the tests.

So, it's better not to do the daemonization fork, if 'daemon' keyword is
presented in the config (or -D option), when we started with -c (MODE_CHECK).
Like this, during the config verification, the process will always stay in
foreground and all warning or errors will be delivered to the stdout.

This fix should be backported only in 3.1.

(cherry picked from commit 5f94e98d8958ff0003bfbb736427070d16270d6e)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: startup: fix error path for master, if can't open pidfile

If master process can't open a pidfile, there is no sense to send SIGTTIN to
oldpids, as it will exit. So, old workers will terminate as well. It's better
to send the last alert to the log about unrecoverable error, because master is
already in its polling loop.

For the standalone mode we should keep the previous logic in this case: send
SIGTTIN to old process and unbind listeners for the new one. So, it's better
to put this error path in main(), as it's done when other configuration settings
can't be applied.

This patch should be backported only in 3.1.

(cherry picked from commit cd0b58e23e12f12369bc1a6e65c6a97409ee8cd6)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: mworker: fix -D -W -sf/-st modes

When a new master process is launched like below:

./haproxy -W -D -p ha.pid -sf $(cat ha.pid)...

The old master process and its workers do not stop. Since the master-worker
refactoring, the code, which sends USR1/TERM to old pids from -sf, is called
only for the standalone mode. In master-worker mode we should receive the READY
message from the newly forked worker at first, in order to be able to terminate
the previous master.

So, to fix this, let's terminate the previous master in _send_status(), where
we parse the READY message from the newly forked worker. And let's continue to
use oldpids array, as it was in 3.0, in order to stop the workers, launched
before the reload.

This patch should be backported only in 3.1.

(cherry picked from commit ee111d2004ca653387515f340c504d18b71191c3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: mworker: don't save program PIDs in oldpids

After reload, previously launched programs are stopped explicitly in
mworker_ext_launch_all(). So, there is no longer need to save their PIDs in
oldpids array before the master reexec().

This also prepares the fix of "-D -W -sf/-st" modes, as we will need to
loop over this array in the master process context, in order to stop the
previous master, when the new one is ready.

This patch should be backported only in 3.1.

(cherry picked from commit 1fead6c0ca8f8ad2f354fb4dfab241ec5281737c)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: mux-h2: fix expression when detecting excess of CONTINUATION frames

Latest commit f0eca8fe7 ("MINOR: mux-h2/glitches: add a description to
the H2 glitches") misplaced the optional glitch description field, with
it appearing at the end of the if () condition and always reporting
an excess of CONTINUATION frames from the first exceeding one.

This needs to be backported along with that commit once it gets backported.

(cherry picked from commit cb21db04c748a77122c118a3d526fdc058ca564f)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: mux-h2/glitches: add a description to the H2 glitches

Since we can now list them using "debug counters" and now support a
description, better add the description to all glitches. This patch may
be backported to 3.1, but before this the following patches must also
be picked:

86823c828 MINOR: mux-h2/traces: add a missing trace on negative initial window size
7c8e9420a CLEANUP: mux-h2/traces: reword certain ambiguous traces

(cherry picked from commit f0eca8fe73b5ddfbb642d39bdb8c4d67c200575b)
Signed-off-by: Willy Tarreau <w@1wt.eu>

CLEANUP: mux-h2/traces: reword certain ambiguous traces

Some h2 traces were not very clear, let's reword them a bit.

(cherry picked from commit 7c8e9420a23584c7f366aacbfeb308d949f5c7b3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: mux-h2/traces: add a missing trace on negative initial window size

When a negative initial windows size is reported, we're going to close
the connection, so it's important to report a trace to explain why!
This should be backported at least to 3.1 and possibly 3.0 (adapting the
context since there's no glitches there).

(cherry picked from commit 86823c828f983bd986b150542c7a6482d60b291d)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUILD: debug: fix build issues in COUNT_IF() with -Wunused-value

Commit 7f64bb79fd ("BUG/MINOR: debug: COUNT_IF() should return true/false")
allowed the COUNT_IF() macro to return the evaluated value. This is handy
to place it in "if ()" conditions and count them at the same time. When
glitches are disabled, the condition is just returned as-is, but most call
places do not use the result, making some compilers complain. In addition,
while reviewing this, it was noticed that when DEBUG_STRICT=0, the macro
would still be replaced by a "do { } while (0)" statement, which not only
does not evaluate the expression, but also cannot return anything. Ditto
for COUNT_IF_HOT().

Let's make sure both are always properly evaluated now.

(cherry picked from commit d6dc8120c0b13f896bf2628e611517701aff7a38)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: debug: COUNT_IF() should return true/false

The COUNT_IF() macro was initially meant to return true/false to be used
in if() conditions but had an extra do { } while(0) that prevents it from
doing so. Let's get rid of the do { } while(0) before the code generalizes
to too many places. There's no impact on existing code, but may have to be
backported if future fixes rely on it.

(cherry picked from commit 7f64bb79fd35f22537ce5d05f59838e49db5cb10)
Signed-off-by: Willy Tarreau <w@1wt.eu>

DOC: config: fix confusing init-state examples

in 50322dff ("MEDIUM: server: add init-state"), some examples on how to
use init-state server keyword were added alongside with the keyword
documentation.

However, as reported by Nick Ramirez, there was an error because the
example that stated that haproxy will pass the traffic to the server after
3 successful health checks used the "init-state down" instead of the
"init-state fully-down". Thus the behavior wouldn't match what the
comment said (only 1 successful health check was required).

Here we fix the example in itself to match with the comment. Also the
following example ("# or") was also affected, but it is kind of
redundant as the main purpose of the examples are to illustrate the
feature in itself and not how to use server-template directive, so we
remove it.

This should be backported in 3.1 with 50322dff

(cherry picked from commit 7934eef25d1207392a1b4a915d21889babfb1fb0)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: config: Fix parsing of accept-invalid-http-{request,response}

These options are now deprectated, but the proxy capabilities are not
properly checked during the configuration parsing leading to always ignore
these options. This is now fixed by checking the frontend capability for
"accept-invalid-http-request" option and the backend capability for
"accept-invalid-http-response" option.

In addition, the messages about the deprecation of these options are now
emitted with ha_warning() instead of ha_alert() because they are only
warnings and not errors.

This patch should fix the issue #2806. It must be backported to 3.1.

(cherry picked from commit bc453c51066beb8ed05baa6636fcb471b9d103ca)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: mux-h2: make sure not to touch dummy streams when sending WU

Since commit 1cc851d9f2 ("MEDIUM: mux-h2: start to update stream when
sending WU") we started storing stream offsets in the h2s struct. These
offsets are updated at a few points, where it's safe to write to the
stream, and in h2c_send_strm_wu(), where the h2s->h2c was not performed.

Due to this, nothing protects the h2s from being updated when sending a
WU for a closed stream, which might only happen when acknowledging a
frame after resetting that stream, which is quite unlikely. In any case
if this happens, it will crash as in issue #2793 since the closed streams
are purposely read-only to catch such bugs.

The fix is trivial, just check h2s->h2c before deciding to update the
stream.

Thanks to @Wahnes for reporting this, and Christopher for spotting the
cause. This needs to be backported to 3.1 only.

(cherry picked from commit d649278fce9729415b31e873a3b91d6d0b259d16)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: quic: remove startup alert if GSO unsupported

This patch is similar to the previous one, but for GSO support. Remove
alert level message to a diag report only visible with argument -dD.

This must be backported up to 3.1.

(cherry picked from commit 3c239b2f80d80a744711e18640bbcc90bae4eea5)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: quic: remove startup alert if conn socket-owner unsupported

QUIC relies on several advanced network API features from the kernel to
perform optimally. Checks are performed during startup to ensure that
these features are supported. A fallback is automatically performed for
every incompatible feature.

Besides the automatic fallback mechanism, a message is also reported to
the user at the same time. Previously, alert level was used, but it is
incorrect as it is reserved for unrecoverable errors which should
prevent haproxy to start. Warning level could be used, but this can
annoy users running with zero-warning mode.

This patch removes the alert message when 'socket-owner connection' mode
cannot be activated. Convert the message to a diag level. This allows
users to start without forcing configuration modification to hide a
warning. Besides, several feature fallback such as the polling mechanism
does not emit any warning either, so it's better to adopt a similar
behavior for QUIC features.

This must be backported up to 2.8.

(cherry picked from commit 6fed219fd786f3fdca155f686cf2fa2f9e572697)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: mux-quic: remove pacing status when everything is sent

TASK_F_USR1 is used by MUX tasklet when emission has been interrupted
due to pacing. When the tasklet runs again, only qcc_purge_sending()
will be called as an optimization.

Pacing status is only removed via qcc_wakeup(). Until then, TASK_F_USR1
is not cleared. This causes an issue after emission with pacing
completion if the MUX tasklet is woken up for a recv subscribe, as
qcc_wakeup() is not used by quic-conn layer. The tasklet will
incorrectly run only for pacing emission, without handling reception
process. Worst, a crash will occur if QCC tx frames list is empty, due
to a BUG_ON() in qcc_purge_sending().

Recv subscribe is only used for 0-RTT, when QUIC MUX is instantiated
before quic-conn handshake completion. Thus, this bug can only be
reproduced with 0-rtt. Furthermore, MUX must already have emitted at
least a few response bytes with pacing, before QUIC handshake
completion. It cannot easily be reproduced, at least with CLI clients
where the handshake is always already completed before MUX exchanges.

To fix this, remove TASK_F_USR1 when pacing emission has been completed.
At least, this prevents BUG_ON() on qcc_purge_sending() as it won't be
called with an empty QCC Tx frame list anymore. However, this bug has
revealed that MUX tasklet architecture is not suitable when both
handling reception and emission part. This will be improved in a future
serie of patches.

This should fix github issue #2796.

This must be backported up to 3.1.

(cherry picked from commit 08f557f0c4e37ad69b6d55a247cbc9f89ddee995)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: init: do not call fork_poller() for non-forked processes

In 3.1-dev10, commit 8dd4efe42f ("MAJOR: mworker: move master-worker
fork in init()") made the fork_poller() code unconditional, while it
is only desirable for processes that have been forked from a parent
(standalone daemon mode) or from a master (master-worker mode). The
call can be expensive in some cases as it will create a new poller,
scan and try to migrate to it all existing FDs till the highest known
one. With very high numbers of FDs, this can take several seconds to
start.

This should be backported to 3.1.

(cherry picked from commit 8b16b725411d72d3bab184bcd0c0253dddeedeb1)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: init: make sure only daemonized processes change their session

Commit 8dd4efe42f ("MAJOR: mworker: move master-worker fork in init()")
introduced some sensitive changes to the startup code (which was
expected), and one sensitive change is that the second call to setsid()
was accidentally made unconditional. As such it even applies to foreground
processes, resulting in foreground processes being detached from the
terminal and no longer responding to Ctrl-C nor Ctrl-Z. An example of
this simply consists in start haproxy -db under sudo. Then a new shell
is required to stop it.

This patch removes this second setsid(), as it is already done in
apply_daemon_mode().

This must be backported to 3.1.

(cherry picked from commit 70e4938aec671c5bd1d69fee4ca1c8eea9f72bd3)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: quic: fix bbr_inflight() calls with wrong gain value

This patch fixes two wrong calls to bbr_inflight().

bbr_target_inflight() aim is to compute the number of bytes BBR has to put on
the network as bytes in flight (sent but not acked bytes). It must call
bbr_inflight() with the current window gain value (in place of a wrong fixed 100
gain value here, in percents).

bbr_is_time_to_cruise() also called bbr_inflight() with a wrong gain value
as parameter due to a confusion between the value mentioned by the RFC (1
meaning 100% of the current window) and our implementation which needs value in
percents (so 100 in place of 1 here). Note that bbr_is_time_to_cruise() aim is to
make BBR the decision to leave the probing_bw down state. The bug had as side
effect to make BBR stay in this state during too long periods of time during
which the bottleneck bandwidth is decreasing, leading to big oscillations
between the mininum and maximum bottleneck bandwidth estimations.

This patch must be backported to 3.1 where BBR was first implemented.

(cherry picked from commit 6404b7a18a518e04bf934c1635e8a0b34628149a)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: startup: fix pidfile creation

Pidfile should be created at the latest initialization stage, when we are
sure, that process is able to start successfully, otherwise PID value, written
in this file is no longer valid.

So, for the standalone mode, let's move the block, which opens the pidfile and
let's put it just before applying "chroot". In master-worker mode, master
doesn't perform chroot. So it creates the pidfile, only when the "READY"
message from the newly forked worker is received.

This should be backported only in 3.1

(cherry picked from commit 295071007b1bf83eb0d66ad88b05d6aef05c69eb)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: startup: close pidfd and free global.pidfile in handle_pidfile()

After master-worker mode refactoring, global.pidfile is only used in
handle_pidfile(), which opens the provided file and writes the PID into it. So,
it's more appropriate to perform the close(pidfd) and ha_free(&global.pidfile)
also in this function.

This commit prepares the fix of the pidfile creation, as it's created now very
early, when we are not sure, that process has successfully started. In
master-worker mode handle_pidfile() can be called in the master process context.
So, let's make it accessible from other compilation units via global.h.

This should be backported only in 3.1.

(cherry picked from commit a33977da48978299d2d71403593846b1b21ec768)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: signal: register default handler for SIGINT in signal_init()

When haproxy is launched in a background and in a subshell (see example below),
according to POSIX standard (2.11. Signals and Error Handling), it inherits
from the subshell SIG_IGN signal handler for SIGINT and SIGQUIT.

$ (./haproxy -f env4.cfg &)

So, when haproxy is lanched like this, it doesn't stop upon receiving
the SIGINT. This can be a root cause of some unexpected timeouts, when haproxy
is started under VTest, as VTest sends to the process SIGINT in order to
terminate it. To fix this, let's explicitly register the default signal
handler for the SIGINT in signal_init() initcall.

This should be backported in all stable versions.

(cherry picked from commit d3c20b02469dea6f46369bb91965d8b4924bb2b7)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUILD: quic: fix a build error about an non initialized timestamp

This is to please a non identified compilers which complains about an hypothetic
<time_ns> variable which would be not initialized even if this is the case only
when it is not used.

This build issue arrived with this commit:
BUG/MINOR: improve BBR throughput on very fast links

Should be backported to 3.1 with this previous commit.

(cherry picked from commit 7868dc9c45aeb29774baac98711f563755b610d6)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: h1-htx: Use default reason if not set when formatting the response

When the response status line is formatted before sending it to the client,
if there is no reason set, HAProxy should add one that matches the status
code, as stated in the configuration manual. However it is not performed.

It is possible to hit this bug when the response comes from a H2 server,
because there is no reason field in HTTP/2 and above.

This patch should fix the issue #2798. It should be backported to all stable
versions.

(cherry picked from commit 37487ada739fc86e3acb46c9949196f4f15cc9b1)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: http-ana: Reset request flag about data sent to perform a L7 retry

It is possible to loose the request after several L7 retries, leading to
crashes, because the request channel flag stating some data were sent is not
properly reset.

When a L7 retry is performed, some flags on different entities must be reset
to be sure a new connection will be properly retried, just like it was the
first one, mainly because there was no connection establishment failure. One
of them, on the request channel, is not reset. The flag stating some data
were already sent. It is annoying because this flag is used during the
connection establishment to know if an error is triggered at the connection
level or at the data level. In the last case, the error must be handled by
the HTTP response analyzer, to eventually perform another L7 retry.

Because CF_WROTE_DATA flag is not removed when a L7 retry is performed, a
subsequent connection establishment error may be handled as a L7 error while
in fact the request was never sent. It also means the request was never
saved in the buffer used to performed L7 retries. Thus, on the next L7
retires, the request is just lost. This forecefully leads to a bunch of
undefined behavior. One of them is a crash, when the request is used to
perform the load-balancing.

This patch should fix issue #2793. It must be backported to all stable
versions.

(cherry picked from commit 62f37801c881f68060cedb7a74b5b8cb5fcfec81)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: quic: prevent stream freeze on pacing

On snd_buf completion, QUIC MUX tasklet is scheduled if newly data has
been transferred from the stream layer. Thanks to qcc_wakeup(), pacing
status is removed from tasklet, which ensure next emission will reset Tx
frames and use the new data.

Tasklet is not scheduled if MUX is already subscribed on send due to a
previous blocking condition. This is an optimization to prevent an
unneeded IO handler execution. However, this causes a bug if an emission
is currently delayed due to pacing. As pacing status is not removed on
snd_buf, next emission process will continue emission with older data
without refreshing the newly transferred one.

This causes a transfer freeze. Unless there is some activity on the
connection, the transfer will be eventually aborted due to idle timeout.

To fix this, remove TASK_F_USR1 if tasklet wakeup is not called due to
send subscription. Note that this code is also duplicated in done_ff for
zero-copy transfer.

This must be backported up to 3.1.

(cherry picked from commit 9d4c26ebaa5c54714eae27a39422e0d47970c628)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: event_hdl: fix uninitialized value in async mode when no data is provided

In _event_hdl_publish(), when we prepare the asynchronous event and no
<data> was provided (set to NULL), we forgot to initialize the _data
event_hdl_async_event struct member to NULL, which leads to uninitialized
reads in event_hdl_async_free_event() when the event is freed:

==1002331== Conditional jump or move depends on uninitialised value(s)
==1002331==    at 0x35D9D1: event_hdl_async_free_event (event_hdl.c:224)
==1002331==    by 0x1CC8EC: hlua_event_runner (hlua.c:9917)
==1002331==    by 0x39AD3F: run_tasks_from_lists (task.c:641)
==1002331==    by 0x39B7B4: process_runnable_tasks (task.c:883)
==1002331==    by 0x314B48: run_poll_loop (haproxy.c:2976)
==1002331==    by 0x315218: run_thread_poll_loop (haproxy.c:3190)
==1002331==    by 0x18061D: main (haproxy.c:3747)

The bug severity was set to MEDIUM because of its nature, and it's best
if this patch can be backported up to 2.8. But in practise it can only be
triggered with events that don't provide optional data: since PAT_REF
events are the first native events making use of this feature, this bug
shouldn't be an issue before f72a66e ("MINOR: pattern: publish event_hdl
events on pat_ref updates")

(cherry picked from commit dd56616067d19060425940f6906cefe6efcd1955)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: improve BBR throughput on very fast links

This patch fixes the loss of information when computing the delivery rate
(quic_cc_drs.c) on links with very low latency due to usage of 32bits
variables with the millisecond as precision.

Initialize the quic_conn task with TASK_F_WANTS_TIME flag ask it to ask
the scheduler to update the call date of this task. This allows this task to get
a nanosecond resolution on the call date calling task_mono_time(). This is enabled
only for congestion control algorithms with delivery rate estimation support
(BBR only at this time).

Store the send date with nanosecond precision of each TX packet into
->time_sent_ns new quic_tx_packet struct member to store the date a packet was
sent in nanoseconds thanks to task_mono_time().

Make use of this new timestamp by the delivery rate estimation algorithm (quic_cc_drs.c).

Rename current ->time_sent member from quic_tx_packet struct to ->time_sent_ms to
distinguish the unit used by this variable (millisecond) and update the code which
uses this variable. The logic found in quic_loss.c is not modified at all.

Must be backported to 3.1.

(cherry picked from commit f8b697c19b3d8dfe9e12e0c03b0069357c013f20)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MINOR: log: fix lf_text() behavior with empty string

As reported by Baptiste in GH #2797, if a logformat alias leveraging
lf_text() ends up printing nothing (empty string), the whole logformat
evaluation stops, leading garbage log message.

This bug was introduced during 3.0 cycle in fcb7e4b ("MINOR: log: add
lf_rawtext{_len}() functions"). At that time I genuinely thought that
if strlcpy2() returned 0, it was due to a lack of space, actually
forgetting that the function may simply be called with an empty string.

Because of that, lf_text() would return NULL if called with an empty
string, and since all lf_*() helpers are expected to return NULL on
error, this explains why the logformat evaluation immediately stops in
this case.

To fix the issue, let's simply consider that strlcpy2() returning 0 is
not an error, like it was already the case before.

It should be backported in 3.1 and 3.0 with fcb7e4b.

(cherry picked from commit 3e470471b7e0ec113807f6981699fda9538e7ffc)
Signed-off-by: Willy Tarreau <w@1wt.eu>

MINOR: proxy: Add support of 421-Misdirected-Request in retry-on status

The "421" status can now be specified on retry-on directives. PR_RE_* flags
were updated to remains sorted.

This patch should fix the issue #2794. It is quite simple so it may safely
be backported to 3.1 if necessary.

(cherry picked from commit bc66d3198597c5731c25cab303d306850e79a346)
Signed-off-by: Willy Tarreau <w@1wt.eu>

BUG/MEDIUM: sock: Remove FD_POLL_HUP during connect() if FD_POLL_ERR is not set

epoll_wait() may return EPOLLUP and/or EPOLLRDHUP after an asynchronous
connect(), to indicate that the peer accepted the connection then
immediately closed before epoll_wait() returned. When this happens,
sock_conn_check() is called to check whether or not the connection correctly
established, and after that the receive channel of the socket is assumed to
already be closed. This lets haproxy send the request at best (if RDHUP and
not HUP) then immediately close.

Over the last two years, there were a few reports about this spuriously
happening on connections where network captures proved that the server had
not closed at all (and sometimes even received the request and responded to
it after haproxy had closed). The logs show that a successful connection is
immediately reported on error after the request was sent. After
investigations, it appeared that a EPOLLUP, or eventually a EPOLLRDHUP, can
be reported by epool_wait() during the connect() but in sock_conn_check(),
the connect() reports a success. So the connection is validated but the HUP
is handled on the first receive and an error is reported.

The same behavior could be observed on health-checks, leading HAProxy to
consider the server as DOWN while it is not.

The only explanation at this point is that it is a kernel bug, notably
because it does not even match the documentation for connect() nor epoll. In
addition for now it was only observed with Ubuntu kernels 5.4 and 5.15 and
was never reproduced on any other one.

We have no reproducer but here is the typical strace observed:

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 114
fcntl(114, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(114, SOL_TCP, TCP_NODELAY, [1], 4) = 0
connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(19, EPOLL_CTL_ADD, 114, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP, data={u32=114, u64=114}}) = 0
epoll_wait(19, [{events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=151, u64=151}}, {events=EPOLLIN, data={u32=59, u64=59}}, {events=EPOLLIN|EPOLLRDHUP, data={u32=114, u64=114}}], 200, 0) = 4
epoll_ctl(19, EPOLL_CTL_MOD, 114, {events=EPOLLOUT, data={u32=114, u64=114}}) = 0
epoll_wait(19, [{events=EPOLLOUT, data={u32=114, u64=114}}, {events=EPOLLIN, data={u32=15, u64=15}}, {events=EPOLLIN, data={u32=10, u64=10}}, {events=EPOLLIN, data={u32=165, u64=165}}], 200, 0) = 4
connect(114, {sa_family=AF_INET, sin_port=htons(11000), sin_addr=inet_addr("A.B.C.D")}, 16) = 0
sendto(114, "POST "..., 1009, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 1009
close(114)                              = 0

Some ressources about this issue:
  - https://www.spinics.net/lists/netdev/msg876470.html
  - https://github.com/haproxy/haproxy/issues/1863
  - https://github.com/haproxy/haproxy/issues/2368

So, to workaround the issue, we have decided to remove FD_POLL_HUP flag on
the FD during the connection establishement if FD_POLL_ERR is not reported
too in sock_conn_check(). This way, the call to connect() is able to
validate or reject the connection. At the end, if the HUP or RDHUP flags
were valid, either connect() would report the error itself, or the next
recv() would return 0 confirming the closure that the poller tried to
report. EPOLL_RDHUP is only an optimization to save a syscall anyway, and
this pattern is so rare that nobody will ever notice the extra call to
recv().

Please note that at least one reporter confirmed that using poll() instead
of epoll() also addressed the problem, so that can also be a temporary
workaround for those discovering the problem without the ability to
immediately upgrade.

The event is accounted via a COUNT_IF(), to be able to spot it in future
issue. Just in case.

This patch should fix the issue #1863 and #2368. It may be related
to #2751. It should be backported as far as 2.4. In 3.0 and below, the
COUNT_IF() must be removed.

(cherry picked from commit 7262433183f590377ace31ff96b1fafa4525b7c2)
Signed-off-by: Christopher Faulet <cfaulet@haproxy.com>

[RELEASE] Released version 3.1.0

Released version 3.1.0 with the following main changes :
    - BUG/MAJOR: mux-h1: Properly handle wrapping on obuf when dumping the first-line
    - BUILD: activity/memprofile: fix a build warning in the posix_memalign handler
    - BUG/MINOR: quic: Avoid BUG_ON() on ->on_pkt_lost() BBR callback call
    - CI: update to the latest AWS-LC version
    - CI: update to the latest WolfSSL version
    - DOC: ot: mention planned deprecation of the OT filter
    - Revert "CI: update to the latest WolfSSL version"
    - CI: github: add a WolfSSL job which tries the latest version
    - BUILD: systemd: fix usage of reserved name "sun" in the address field
    - BUILD: init: use the more portable FD_CLOEXEC for /dev/null
    - CI: github: improve the Wolfssl job
    - CI: github: improve the AWS-LC job
    - BUG/MINOR: mux-quic: fix show quic report of QCS prepared bytes
    - BUG/MEDIUM: quic: fix sending performance due to qc_prep_pkts() return
    - MINOR: mux-quic: use sched call time for pacing
    - CI: github: allow to run the Illumos job manually
    - BUILD: tcp_sample: var_fc_counter defined but not used
    - CI: github: add 'workflow_dispatch' on remaining build jobs
    - DOC: config: refine a little bit the text on QUIC pacing
    - MINOR: proto_sockpair: send_fd_uxst: init iobuf, cmsghdr, cmsgbuf to zeros
    - MINOR: startup: rename on_new_child_failure to mworker_on_new_child_failure
    - REORG: startup: move on_new_child_failure in mworker.c
    - MINOR: startup: prefix prepare_master and run_master with mworker_*
    - REORG: startup: move mworker_prepare_master in mworker.c
    - MINOR: startup: keep updating verbosity modes only in haproxy.c
    - REORG: startup: move mworker_run_master and mworker_loop in mworker.c
    - REORG: startup: move mworker_reexec and mworker_reload in mworker.c
    - MINOR: startup: prefix apply_master_worker_mode with mworker_*
    - REORG: startup: move mworker_apply_master_worker_mode in mworker.c
    - MINOR: cfgparse-quic: strengthen quic-cc-algo parsing
    - BUG/MAJOR: quic: fix wrong packet building due to already acked frames
    - DEV: lags/show-sess-to-flags: Properly handle fd state on server side
    - BUG/MEDIUM: http-ana: Don't release too early the L7 buffer
    - MINOR: quic: make bbr consider the max window size setting
    - DOC: quic: Amend the pacing information about BBR.
    - BUG/MEDIUM: quic: prevent EMSGSIZE with GSO for larger bufsize
    - MINOR: cli: Add a "help" keyword to show sess
    - MINOR: cli/quic: Add a "help" keyword to show quic
    - DOC: management: mention "show sess help" and "show quic help"
    - DOC: install: update the list of supported versions
    - MINOR: version: mention that 3.1 is stable now

MINOR: version: mention that 3.1 is stable now

This version will be maintained up to around Q1 2026. The INSTALL file
also mentions it.

DOC: install: update the list of supported versions

OpenSSL up to 3.4 was tested, and gcc up to 14 was tested, so let's
reflect this in the install doc.

DOC: management: mention "show sess help" and "show quic help"

These ones were recently added but we forgot to update the doc.

MINOR: cli/quic: Add a "help" keyword to show quic

Add a help keyword to show quic, that will provide a longer explanation
of all the available options than what is provided by the command "help".

MINOR: cli: Add a "help" keyword to show sess

Add a help keyword to show sess, that will provide a longer explanation of
all the available options than what is provided by the command "help".

BUG/MEDIUM: quic: prevent EMSGSIZE with GSO for larger bufsize

A UDP datagram cannot be greater than 65535 bytes, as UDP length header
field is encoded on 2 bytes. As such, sendmsg() will reject a bigger
input with error EMSGSIZE. By default, this does not cause any issue as
QUIC datagrams are limited to 1.252 bytes and sent individually.

However, with GSO support, value bigger than 1.252 bytes are specified
on sendmsg(). If using a bufsize equal to or greater than 65535, syscall
could reject the input buffer with EMSGSIZE. As this value is not
expected, the connection is immediately closed by haproxy and the
transfer is interrupted.

This bug can easily reproduced by requesting a large object on loopback
interface and using a bufsize of 65535 bytes. In fact, the limit is
slightly less than 65535, as extra room is also needed for IP + UDP
headers.

Fix this by reducing the count of datagrams encoded in a single GSO
invokation via qc_prep_pkts(). Previously, it was set to 64 as specified
by man 7 udp. However, with 1252 datagrams, this is still too many.
Reduce it to a value of 52. Input to sendmsg will thus be restricted to
at most 65.104 bytes if last datagram is full.

If there is still data available for encoding in qc_prep_pkts(), they
will be written in a separate batch of datagrams. qc_send_ppkts() will
then loop over the whole QUIC Tx buffer and call sendmsg() for each
series of at most 52 datagrams.

This does not need to be backported.

DOC: quic: Amend the pacing information about BBR.

BBR handles itself its own burst size (mentioned as send_quantum in BBR RFC).

MINOR: quic: make bbr consider the max window size setting

Limit the BBR congestion control window size as this is done for all the others
congestion control algorithms with tune.quic.frontend.default-max-window-size
or as first argument passed to "bbr" option for "quic-cc-algo".

BUG/MEDIUM: http-ana: Don't release too early the L7 buffer

In some cases, the buffer used to store the request to be able to perform a
L7 retry is released released too early, leading to a crash because a retry
is performed with an empty request.

First, there is a test on invalid 101 responses that may be caught by the
"junk-response" retry policy. Then, it is possible to get an error
(empty-response, bad status code...) after an interim response. In both
cases, the L7 buffer is already released while it should not.

To fix the issue, the L7 buffer is now released at the end of the
AN_RES_WAIT_HTTP analyser, but only when a response was successfully
received and processed. In all error cases, the stream is quickly released,
with the L7 buffer. So there is no leak and it is safer this way.

This patch may fix the issue #2793. It must be as far as 2.4.

DEV: lags/show-sess-to-flags: Properly handle fd state on server side

It must be handled as an hexadecimal value.

BUG/MAJOR: quic: fix wrong packet building due to already acked frames

If a packet build was asked to probe the peer with frames which have just
been acked, the frames build run by qc_build_frms() could be cancelled  by
qc_stream_frm_is_acked() whose aim is to check that current frames to
be built have not been already acknowledged. In this case the packet build run
by qc_do_build_pkt() is not interrupted, leading to the build of an empty packet
which should be ack-eliciting.

This is a bug detected by the BUG_ON() statement in qc_do_build_pk():

    BUG_ON(qel->pktns->tx.pto_probe &&
           !(pkt->flags & QUIC_FL_TX_PACKET_ACK_ELICITING));

Thank you to @Tristan971 for having reported this issue in GH #2709

This is an old bug which must be backported as far as 2.6.

MINOR: cfgparse-quic: strengthen quic-cc-algo parsing

quic-cc-algo is a bind keyword which is used to specify the congestion
control algorithm. It is parsed via function bind_parse_quic_cc_algo().

The parsing function was too laxed as it used strncmp for algo token
matching. This could cause surprise if specifying an invalid algorithm
but starting identically to another entry. Especially if extra
parameters are specified in parenthesis, as in this case parameters
value will be completely ignored and default value used instead.

To fix this, convert algo argument to ist. Then, use istsplit() to
extract algo token from the optional extra arguments and compare the
whole value with isteq().

REORG: startup: move mworker_apply_master_worker_mode in mworker.c

mworker_apply_master_worker_mode() is called only in master-worker mode, so
let's move it mworker.c

MINOR: startup: prefix apply_master_worker_mode with mworker_*

This patch prepares the move of apply_master_worker_mode in mworker.c. So,
let's at first rename it to mworker_apply_master_worker_mode.

REORG: startup: move mworker_reexec and mworker_reload in mworker.c

Let's move mworker_reexec() and mworker_reload() in mworker.c. mworker_reload()
is called only within the functions, which are already in mworker.c. So, this
reorganization allows to declare mworker_reload() as a static.

REORG: startup: move mworker_run_master and mworker_loop in mworker.c

mworker_run_master() is called only in master mode. mworker_loop() is static
and called only in mworker_run_master(). So let's move these both functions in
mworker.c.

We also need here to make run_thread_poll_loop() accessible from other units,
as it's used in mworker_loop().

MINOR: startup: keep updating verbosity modes only in haproxy.c

This commit prepares the move of mworker_run_master() in mworker.c.

Let's remove from it's definition the code, which adjusts verbosity in
dependency of other global run time modes (daemon or foreground). This part
should stay in main(), where all verbosity modes are handeled for
different mode combinations.

REORG: startup: move mworker_prepare_master in mworker.c

mworker_prepare_master() performs some preparation routines for the new worker
process, which will be forked during the startup. It's called only in
master-worker mode, so let's move it in mworker.c.

MINOR: startup: prefix prepare_master and run_master with mworker_*

This patch prepares the move of prepare_master() and run_master() definitions
into mworker.c. So, let's at first prefix its names with mworker_*.

REORG: startup: move on_new_child_failure in mworker.c

mworker_on_new_child_failure() performs some routines for the worker process,
if it has failed the reload. As it's called only in mworker_catch_sigchld()
from mworker.c, let's move mworker_on_new_child_failure() in mworker.c as well.
Like this it could also be declared as a static.

MINOR: startup: rename on_new_child_failure to mworker_on_new_child_failure

This patch prepares the moving of on_new_child_failure definition into
mworker.c. So, let's rename it accordingly and let's also update its
description.

MINOR: proto_sockpair: send_fd_uxst: init iobuf, cmsghdr, cmsgbuf to zeros

In master-worker mode, worker process uses now send_fd_uxst() to send
'_send_status' command to master. Since refactoring, this started to trigger
the following Valgrind reports:

==810584== Syscall param sendmsg(msg.msg_iov[0]) points to uninitialised byte(s)
==810584==    at 0x4AAC99D: __libc_sendmsg (sendmsg.c:28)
==810584==    by 0x4AAC99D: sendmsg (sendmsg.c:25)
==810584==    by 0x56350F: send_fd_uxst (proto_sockpair.c:271)
==810584==    by 0x3AA25C: main (haproxy.c:4151)
==810584==  Address 0x1ffefffbfe is on thread 1's stack
==810584==  in frame #1, created by send_fd_uxst (proto_sockpair.c:241)
==810584==
==810584== Syscall param sendmsg(msg.msg_control) points to uninitialised byte(s)
==810584==    at 0x4AAC99D: __libc_sendmsg (sendmsg.c:28)
==810584==    by 0x4AAC99D: sendmsg (sendmsg.c:25)
==810584==    by 0x56350F: send_fd_uxst (proto_sockpair.c:271)
==810584==    by 0x3AA25C: main (haproxy.c:4151)
==810584==  Address 0x1ffefffc14 is on thread 1's stack
==810584==  in frame #1, created by send_fd_uxst (proto_sockpair.c:241)
==810584==

So, let's initialize with zeros all buffers, which are passed to sendmsg
syscall(), used in send_fd_uxst() to avoid these Valgrind messages. They
increase Valgrind output and could make unnoticeable some other, more important
reports.

DOC: config: refine a little bit the text on QUIC pacing

The QUIC pacing options changed a few times during their development.
For example the unit is now in datagrams not bytes. Also a few
sentences were slightly ambiguous so let's reword this.

No backport is needed.

CI: github: add 'workflow_dispatch' on remaining build jobs

Add 'workflow_dispatch' on the remaining scheduled build jobs that does
not have it.

This keyword allows to start manually a job from the "Actions" interface
in github.

BUILD: tcp_sample: var_fc_counter defined but not used

var_fc_counter is not used on Illumos and emit a warning

  src/tcp_sample.c:291:12: warning: ‘var_fc_counter’ defined but not used [-Wunused-function]
    291 | static int var_fc_counter(struct arg *args, char **err)
        |            ^~~~~~~~~~~~~~

Let's add an ifdef to build it.

CI: github: allow to run the Illumos job manually

Add the "workflow_dispatch" option to the Illumos CI so it can be run
manually from the github actions page.

MINOR: mux-quic: use sched call time for pacing

QUIC pacing was recently implemented to limit burst and improve overall
bandwidth. This is used only for MUX STREAM emission. Pacing requires
nanosecond resolution. As such, it used now_cpu_time() which relies on
clock_gettime() syscall.

The usage of clock_gettime() has several drawbacks :
* it is a syscall and thus requires a context-switch which may hurt
  performance
* it is not be available on all systems
* timestamp is retrieved multiple times during a single task execution,
  thus yielding different values which may tamper pacing calculation

Improve this by using task_mono_time() instead. This returns task call
time from the scheduler thread context. It requires the flag
TASK_F_WANTS_TIME on QUIC MUX tasklet to force the scheduler to update
call time with now_mono_time(). This solves every limitations listed
above :
* syscall invokation is only performed once before tasklet execution,
  thus reducing context-switch impact
* on non compatible system, a millisecond timer is used as a fallback
  which should ensure that pacing works decently for them
* timer value is now guaranteed to be fixed duing task execution

BUG/MEDIUM: quic: fix sending performance due to qc_prep_pkts() return

qc_prep_pkts() is a QUIC transport level function which encodes one or
several datagrams in a buffer before sending them. It returns the number
of encoded datagram. This is especially important when pacing is used to
limit packet bursts.

This datagram accounting was not trivial as qc_prep_pkts() used several
code paths depending on the condition of the current encoded packet.
Thus, there were several places were the local variable dgram_cnt could
have been incremented. This was implemented by the following commit :

commit 5cb8f8a6224db96f4386277c41ddae4a29a4130d
MINOR: quic: support a max number of built packet per send iteration

However, there is a bug due to a missing increment when all frames from
the current QEL have been encoded. In this case, the encoding continue
in the same datagram to coalesce a futur packet. However, if this is the
last QEL, encoding loop will then break. As first_pkt is not NULL,
qc_txb_store() is called outside but dgram_cnt is yet not incremented.

In particular, this causes qc_prep_pkts() to return 0 when there is only
small STREAM frames to emit for application QEL. In qc_send(), this is
interpreted as a value which prevents further emission for the current
invokation. Thus, it may hurts performance, both without and with
pacing.

To fix this, removing multiple dgram_cnt increment. Now, it is modified
only in a single place which should cover every case, and render the
code easier to validate.

The most notable case where the bug is visible is when using cubic with
pacing without any burst, with quic-cc-algo cubic(,1). First, transfer
bandwidth in average was suboptimal, with significant variation. Worst,
it could sometimes fall dramatically for a particular stream without
recovering before returning to an expected level on the next one.

No need to backport.

BUG/MINOR: mux-quic: fix show quic report of QCS prepared bytes

On show quic, each MUX streams are listed with their various indicator
for buffering on Rx and Tx. In particular, txoff displays in parenthesis
the current level of data prepared by the upper stream instance not yet
emitted by QUIC transport layer.

This value is only accessible after a substract operation. However,
there was a typo which caused the result to be always 0. Fix this by
reusing the correct offsets in the calculation.

This should be backported up to 3.0.

CI: github: improve the AWS-LC job

Like the WolfSSL job, improve the AWS-LC job by adding the socat command
so all SSL reg-tests can be run.
Also add gdb and output of corefiles.

CI: github: improve the Wolfssl job

Improve the WolfSSL job by adding the missing socat command.
Also add gdb and output corefiles like it's done on the VTest job.

BUILD: init: use the more portable FD_CLOEXEC for /dev/null

In 3.1-dev10, commit 8dd4efe42f ("MAJOR: mworker: move master-worker
fork in init()"), the FD associated to /dev/null was made CLOEXEC
using O_CLOEXEC. Unfortunately this is not portable on older OSes,
doesn't build on Solaris for example, and was even reported as breaking
moderately old Linux OSes for other projects. Better not use it unless
absolutely certain it will work (currently we only use it for Linux
namespaces, which are optional), and use the conventional FD_CLOEXEC
instead.

No backport is needed.

BUILD: systemd: fix usage of reserved name "sun" in the address field

systemd.c doesn't build on Solaris / Illumos because it uses "sun" as
the field name in a structure, while "sun" is the name of the macro
used to detect Solaris:

  src/systemd.c: In function 'sd_notify':
  src/systemd.c:43:22: error: expected identifier or '(' before numeric constant
     struct sockaddr_un sun;
                        ^
  src/systemd.c:44:2: warning: no semicolon at end of struct or union
    } socket_addr = {
    ^

Admittedly, the OS could have instead defined "sun" to itself to avoid
this. Any other name will work, let's just use "ux" for the short form
of "unix".

The problem appeared in 3.0-dev with commit aa3632962f ("MEDIUM:
mworker: get rid of libsystemd"), though by then this file was only
built when USE_SYSTEMD was set, which was not the case for non-linux
platforms. However since 3.1-dev14 with commit 15845247db ("MEDIUM:
mworker: remove USE_SYSTEMD requirement for -Ws"), all platforms
now build this file.

No backport is needed even though it will not hurt to have it in 3.0
for completeness.

CI: github: add a WolfSSL job which tries the latest version

Like the AWS-LC job, add a CI job which looks for the latest WolfSSL
version and tries to build it.

The patch adds a function which determines the latest version of WolfSSL
from the github tag, and the yml which describes the job.

Revert "CI: update to the latest WolfSSL version"

This reverts commit 03f57fcf94dae61906b56d10d1fb21f7afaae4fc.

Looks like the 5.7.4 version is broke with HAProxy, let's revert the CI
for now.

DOC: ot: mention planned deprecation of the OT filter

Miroslav mentioned below that he's currently working on an OpenTelemetry
replacement for the OpenTracing filter since OpenTracing itself is no
longer maintained nor supported:

https://github.com/haproxy/haproxy/issues/2782#issuecomment-2493576327

Given that he aims for 3.2, let's already settle on an upcoming deprecation
of the filter for 3.3 with a removal for 3.5. This will leave time to finish
the development and permit users to switch smoothly. At this point no warning
is emitted (since the users have no alternative) but better mention this plan
in the doc to make them aware of future changes.

CI: update to the latest WolfSSL version

Update the CI to the 5.7.4 WolfSSL version.

CI: update to the latest AWS-LC version

Update the CI to the 1.39.0 AWS-LC version.

BUG/MINOR: quic: Avoid BUG_ON() on ->on_pkt_lost() BBR callback call

The per-packet delivery rate sample is applied to ack-eliciting packet only
calling ->drs_on_transmit() BBR callback. So, ->on_pkt_lost() which inspects the
delivery rate sampling information during packet loss detection must not be
called for non ack-eliciting packet. If not, it would be facing with non
initialized variables with big chance to trigger a BUG_ON().

As BBR is implemented in the current developement version, there is
no need to backport this patch.

BUILD: activity/memprofile: fix a build warning in the posix_memalign handler

A "return NULL" statement was placed for error handling in the
posix_memalign() handler instead of an int errno value, by recent
commit 5ddc8b3ad4 ("MINOR: activity/memprofile: monitor non-portable
calls as well"). Surprisingly the warning only triggered on gcc-4.8.
Let's use ENOMEM instead. No backport needed.

BUG/MAJOR: mux-h1: Properly handle wrapping on obuf when dumping the first-line

The formatting of the first-line, for a request or a response, does not
properly handle the wrapping of the output buffer. This may lead to a data
corruption for the current response or eventually for the previous one.

Utility functions used to format the first-line of the request or the
response rely on the chunk API. So it is not expected to pass a buffer that
wraps. Unfortunatly, because of a change performed during the 2.9 dev cycle,
the output buffer was direclty used instead of a non-wrapping buffer created
from it with b_make() function. It is not an issue for the request because
its start-line is always the first block formatted in the output buffer. But
for the response, the output may be not empty and may wrap. In that case,
the response start-line is dumped at a random position in the buffer,
corrupting data. AFAIK, it is only an issue if the HTTP request pipelining
is used.

To fix the issue, we now take care to create a non-wapping buffer from the
output buffer.

This patch should fix issues #2779 and #2996. It must be backported as far as
2.9.