MAJOR: ring: implement a waiting queue in front of the ring
The queue-based approach consists in forcing threads to wait away from
the work area so as not to disturb the current writer, and to prepare
the work by grouping them in a queue. The last arrived takes the head
of the queue by placing its preinitialized ring cell there, becomes the
queue's leader, informs itself about the amount of previously accumulated
bytes so that when its turn comes, it immediately knows how much room is
needed to be released.
It can then take the whole queue with it, leaving an empty one for new
threads to come while it's releasing the room needed to copy everything.
By doing so we're cascading contention areas so that multiple parts can
work in parallel.
Note that we must never leave a write counter set to 0xFF at tail, and
this happens when a message cannot fit and we give up, because in this
case we're writing back tail_ofs, and only later we restore the counter.
The solution here is to make a special case when we're going to drop
the messages, and to write the readers count before restoring tail.
This already shows a tremendous performance gain on ARM (385k -> 4.8M),
thanks to the fact that now all waiting threads wait on the queue's
head instead of polluting the tail lock. On x86_64, the EPYC sees a big
boost at 24C48T (1.88M -> 3.82M) and a slowdown at 3C6T (6.0->4.45)
though this one is much less of a concern as so few threads need less
bandwidth than bigger counts.