In performance critical areas like high-frequency trading arenas, latency is the difference between making or losing capital.
In this blog, we will be optimizing a text book parser from processing ~440M ticks per second to processing ~25B ticks per second.
Version 0: The Naive, Idiomatic C++ processor
A standard textbook feed handler consists of an order book having a generic Message that is polymorphic. Hence, we create a base class, derive specific tick types and manage their lifetimes.
#pragma once
#include <vector>
#include <memory>
#include <cstdint>
// Generic market processing interface
class MarketMessage {
public:
virtual ~MarketMessage() = default;
virtual void process() = 0;
};
// Specific implementation for adding an order
class AddOrderMessage : public MarketMessage {
uint64_t timestamp;
uint64_t order_id;
uint32_t price;
uint32_t quantity;
char side; // Buy / Sell
public:
AddOrderMessage(uint64_t ts, uint64_t id, uint32_t p, uint32_t q, char s) :
timestamp(ts), order_id(id), price(p), quantity(q), side(s) {}
void process() override {
// Dummy state machine update logic
asm volatile("" : : "g"(price), "g"(quantity) : "memory");
}
};
void process_stream(const std::vector<std::shared_ptr<MarketMessage>>& stream) {
for (const auto& msg : stream) {
msg -> process(); // Latency trap
}
}
Now, this code is clean, but it is fundamentally hostile to modern CPU architectures. The following Google Benchmark and perf result will show it.
--------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------
Naive/10K 8095 ns 8067 ns 64906 items_per_second=1.23966G/s
Naive/100K 80437 ns 80079 ns 8692 items_per_second=1.24877G/s
Naive/1M 2424241 ns 2413241 ns 290 items_per_second=414.381M/s
Performance counter stats:
7,87,43,674 cache-misses:u
5,35,97,05,947 cache-references:u
We are hitting a cache-cliff. The CPU is spending more time waiting for main memory and looking up vtable addresses than it does calculating anything.
Version 1: Data Oriented Design
To quote Martin Thompson, to achieve mechanical sympathy, we must stop thinking about 'Objects' and start thinking about 'Data'. The CPU prefetcher loves predictable, contiguous memory. We need to kill the heap allocations, remove the vtables and pack our data tightly aligning with the 64 byte L1 cache lines.
enum class MsgType : uint8_t {
ADD,
CANCEL,
MODIFY
};
// Flat contiguous POD struct
struct MarketMessagePod {
uint64_t timestamp;
uint64_t order_id;
uint32_t price;
uint32_t quantity;
MsgType type;
char side;
};
void process_stream(const std::vector<MarketMessagePod>& stream) {
for (const auto& msg : stream) {
// Switch-case replaces virtual dispatch
switch (msg.type) {
case MsgType::ADD:
// Dummy state machine update
asm volatile("" : : "g"(msg.price), "g"(msg.quantity) : "memory");
break;
// ...
}
}
}
By switching to an array of structs (AoS), the results are immediate:
------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------
DoD/10K 2760 ns 2752 ns 253335 items_per_second=3.63315G/s
DoD/100K 31076 ns 30974 ns 22501 items_per_second=3.22853G/s
DoD/1M 1074340 ns 1068376 ns 650 items_per_second=936M/s
Performance counter stats:
9,94,29,596 cache-misses:u
7,49,13,99,026 cache-references:u
We can now process double the ticks at ~936M per second. But, this increased the cache misses but also increased the number of cache references as well. We can find the percentage of cache misses as: $$ \frac{cache\_misses}{cache\_references} \times 100 \% $$
Using this formula, the cache-miss rate for the initial (version 1) approach is $1.46 \%$. For the new data oriented design, this rate becomes $1.327 \%$. Hence, the cache-miss rate decreases as well.
Version 2: Static Resolution, CRTP and Branch Hinting
Now, the memory is aligned, but now we face a control-flow bottleneck. The switch statement introduces conditional branches, risking pipeline flushes. We need to use Curiously Recurring Template Pattern (CRTP) and compiler attributes to eliminate branching entirely. We inherit the MarketMessagePod from the V2 data oriented design.
#include <data_oriented.hpp>
template <typename Derived>
class MessageHandler {
public:
inline void handle(const MarketMessagePod& msg) {
static_cast<Derived*>(this) -> process_impl(msg);
}
};
class FastOrderBook : public MessageHandler<FastOrderBook> {
public:
inline void process_impl(const MarketMessagePod& msg) {
if (msg.type == MsgType::ADD) [[likely]] {
asm volatile("" : : "g"(msg.price), "g"(msg.quantity) : "memory");
}
else [[unlikely]] {
asm volatile("" : : "g"(msg.order_id) : "memory");
}
}
};
void process_stream(const std::vector<MarketMessagePod>& stream, FastOrderBook& book) {
const size_t size = stream.size();
const MarketMessagePod* __restrict__ data = stream.data();
#pragma GCC unroll 4
for (size_t i = 0; i < size; i++) {
book.handle(data[i]);
}
}
The throughput now jumps to 1.15 billion per ticks.
------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------
SR_CRTP_BH/10K 3206 ns 3198 ns 185542 items_per_second=3.12661G/s
SR_CRTP_BH/100K 32419 ns 32318 ns 21525 items_per_second=3.09423G/s
SR_CRTP_BH/1M 874833 ns 869572 ns 795 items_per_second=1.14999G/s
Performance counter stats for:
8,59,02,211 cache-misses:u
6,40,36,81,500 cache-references:uVersion 3: Silicon Symphony: SIMD Vectorization
Writing custom deep learning libraries teaches you very quickly that calculating token generation rates relies entirely on packing the CPU's wide registers. Processing scalar values is a waste of silicon. We can apply that exact same SIMD acceleration to our market data parser.
To use AVX2 intrinsics, we must flip our architecture inside out to a Struct of Arrays (SoA) to provide the 256-bit registers with uniform data.
#include <immintrin.h> // AVX/SSE Intrinsics
struct MarketDataSoA {
alignas(64) std::vector<uint64_t> timestamps;
alignas(64) std::vector<uint64_t> order_ids;
alignas(64) std::vector<uint32_t> prices;
alignas(64) std::vector<uint32_t> quantities;
alignas(64) std::vector<uint8_t> types;
void reserve(size_t size) {
timestamps.reserve(size); order_ids.reserve(size);
prices.reserve(size); quantities.reserve(size); types.reserve(size);
}
};
inline void process_stream_simd(const MarketDataSoA& book, size_t size) {
const uint32_t* __restrict__ qty_ptr = book.quantities.data();
size_t i = 0;
__m256i vector_accumulator = _mm256_setzero_si256();
for (; i + 8 <= size; i += 8) {
__m256i q_vec = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(qty_ptr + i));
vector_accumulator = _mm256_add_epi32(vector_accumulator, q_vec);
}
for (; i < size; i++) {
asm volatile("" : : "g"(qty_ptr[i]) : "memory");
}
asm volatile("" : : "x"(vector_accumulator) : "memory");
}------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------
SIMD/10K 298 ns 297 ns 2340635 items_per_second=33.6261G/s
SIMD/100K 3374 ns 3364 ns 207721 items_per_second=29.7253G/s
SIMD/1M 38100 ns 37976 ns 18422 items_per_second=26.3322G/s
Performance counter stats for:
18,69,98,149 cache-misses:u
9,39,03,07,057 cache-references:u
26 Billion Ticks per second. That is a staggering 60x gain in performance. This is still not enough for low latency trading.
Version 4: Kernel Bypass
A single Linux scheduler tick takes about 3 to 5 microseconds. In that tiny window, our Version 3 SIMD loop could have processed over 100,000 messages. If the OS decides to pause our thread, we lose the trade.
We must guarantee determinism by isolating our thread. A custom kernel supports tuning the bootloader parameters (isolcpus=2 nohz_full=2 rcu_nocbs=2) to completely evict the OS from a specific core.
Then, we bind our execution strictly to that physical silicon:
#pragma once
#include <pthread.h>
#include <sched.h>
#include <stdexcept>
#include <iostream>
#include <simd_vectorized.hpp> // Reusing our peak SIMD logic
class HardwareIsolator {
public:
static void pin_thread_to_core(int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_t current_thread = pthread_self();
if (pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset) != 0) {
std::cerr << "WARNING: Failed to pin thread to core " << core_id << "\n";
}
}
};
// The V4 processing function simply wraps V3 but ensures we are pinned
inline void process_stream(const MarketDataSoA& book, size_t size, int target_core) {
HardwareIsolator::pin_thread_to_core(target_core);
// Execute the hyper-optimized SIMD loop
process_stream_simd(book, size);
}
Running the benchmark with FIFO real time priority (with sudo chrt -f 99), we get:
----------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------
Metal/10K 1222 ns 1219 ns 578139 items_per_second=8.20443G/s
Metal/100K 4604 ns 4592 ns 142760 items_per_second=21.7761G/s
Metal/1M 39963 ns 39890 ns 17527 items_per_second=25.0692G/s
Performance counter stats for:
12,87,16,735 cache-misses:u
6,94,77,45,302 cache-references:u
Bypassing the OS slightly decreased our peak throughput from ~26 G/s to ~25 G/s. This is because we have hit the physical limits of the silicon's memory bandwidth.
But in low latency arenas, peak throughput is a vanity metric; jitter is the only metric that pays. We traded a tiny fraction of theoretical throughput for absolute, unwavering determinism.
The Numbers
| Method | Cache Misses | Cache References | Throughput |
|---|---|---|---|
| Naive | 78743674 | 5359705947 | 414.381M/s |
| Data Oriented Design | 99429596 | 7491399026 | 936M/s |
| Static Processing + CRTP | 85902211 | 6403681500 | 1.15G/s |
| SIMD Processing | 186998149 | 9390307057 | 26.3322G/s |
| Metal | 128716735 | 6947745302 | 25.0692G/s |
Source
NOTE: The test was done on a Ryzen 7 7435HS with
sudo cpupower frequency-set -g performance
The code files for this implementation are available here and are free to use under the Boost Software License 1.0.