Github Link : ThreadSafeQueueLib
Currently What I Am Doing
I realized that to properly understand the C++ memory model, I need a stronger OS background. So right now I am learning from Operating Systems: Three Easy Pieces (OSTEP), and honestly this book is elite. I started with the goal of learning the concurrency piece, but the first piece on abstractions was so interesting that I got distracted in the best possible way.
I am also reading Chapters 5, 6, and 7 from C++ Concurrency in Action. But at this point I can clearly feel that OS foundations make the C++ concurrency ideas much easier to digest.
Motivation
I am building this project under the guidance and mentorship of Toshit Jain Bhaiya. He and his brother Tushar are pretty cracked in C++.
I wanted to understand how concurrent data structures guarantee progress and correctness without locks, beyond textbook explanations. The standard library's std::queue is fundamentally unsuited for concurrent access—it provides no atomicity guarantees and breaks under contention.
High-frequency trading systems and message-passing architectures rely heavily on lock-free queues. I chose to implement one from scratch to understand the actual constraints: memory ordering, progress guarantees, and the tradeoffs between throughput and correctness.
Problem Definition
Design and implement a family of thread-safe queues in C++ that support various modes (SPSC, MPSC, MPMC) and configurations (bounded vs unbounded, blocking vs non-blocking):
- Avoiding data races and undefined behavior
- Providing progress guarantees (lock-free or wait-free)
- Supporting both bounded (fixed-capacity ring buffer) and unbounded modes
- Maintaining reasonable throughput under contention
The difficulty lies in the interaction between interface design and concurrency. A standard queue's empty() check followed by
front() is inherently prone to data races, another thread can pop() between the two calls. This forces a fundamentally different API design.
Design Decisions
Why lock-free?
Mutex-based Locking queues are easier to implement and reason about. However, they suffer from priority inversion, context-switch overhead, and poor scalability under contention. In latency-sensitive systems, a thread holding a lock can block all other threads indefinitely.
Lock-free structures provide a progress guarantee: at least one thread will make progress in a finite number of steps, even if other threads are delayed or preempted. This matters in real-time and high-throughput scenarios.
Bounded vs unbounded
Bounded queues use a fixed-size ring buffer. Memory is pre-allocated, index arithmetic is fast (modulo operations), and memory usage is predictable. The tradeoff is that producers must handle the full case—either block, spin, or return failure.
Unbounded queues grow dynamically, typically using a linked list of nodes. This avoids the full-queue problem but introduces allocation overhead and potential memory fragmentation.
Template metaprogramming for compile-time configuration
Rather than runtime polymorphism with virtual functions, I used template parameters to select queue characteristics (SPSC/MPSC/MPMC, bounded/unbounded, blocking/non-blocking). This allows the compiler to generate specialized code for each configuration, eliminating branches and enabling aggressive inlining during Compile Time itself.
Core Technical Challenges
The ABA problem
In lock-free structures using compare-and-swap (CAS), a thread may read value A, get preempted, and another thread may change the value to B and back to A. The first thread's CAS succeeds even though the data structure has changed.
In queues, ABA can cause a node to be recycled and reused while another thread still holds a pointer to it.
Memory ordering choices
C++ provides many memory orderings, from relaxed (no synchronization) to seq_cst (full sequential consistency). Using seq_cst everywhere is safe but slow. it inserts memory barriers that prevent hardware reordering optimizations.
Basically your Compiler and CPU can change the order of operations, as long as it doesn't interfere with the program's logical execution.
I analyzed each atomic operation to determine the minimum ordering required. Producer writes to the buffer need release semantics; consumer reads need acquire semantics. Index updates can often use relaxed ordering when combined with acquire-release on the data itself.
Getting this wrong is dangerous, My system uses ARM which has a weaker memory model that the TSO (Total Store Ordering) of x86.
False sharing
When producer and consumer indices share a cache line, writing to one invalidates the other thread's cache, even though they're accessing different variables. This thrashing destroys performance.
Do understand what is really Happening, I had to learn the Cache Protocols specifically the MESI protocol (Modified, Exclusive, Shared, Invalid).
I aligned indices to separate cache lines using alignas(HARDWARE_CACHE_LINE_SIZE). This simple change improved throughput by ~6x in my benchmarks.
Debugging concurrency issues
Concurrency bugs are non-deterministic. A test might pass 10,000 times and fail once. I used ThreadSanitizer extensively during development—it instruments memory accesses and reports data races. I also wrote stress tests that spawn many threads and run millions of operations, checking invariants after each run.
Its not trivial to prove the correctness of your code and that it will not suffer a data race. Various Flags must be used to handle this.
Results & Evaluation
WILL ADD THIS LATER
What I Learned
WILL ADD After I learn...