The problem with Redis
Redis processes every command on a single thread. This was a brilliant design choice in 2009: no locks, no race conditions, dead-simple reasoning. But modern servers have 16, 32, 64+ cores sitting idle while Redis maxes out one.
The Redis answer is clustering: shard your data across multiple Redis instances. This works, but it adds complexity (slot migration, cross-slot errors, resharding), latency (network hops between nodes), and cost (multiple processes, multiple memory spaces).
Lux asks: what if a single process could use all your cores safely?
Sharded architecture
struct Store {
shards: Box<[RwLock<Shard>]>
}
shard_count = num_cpus * 16
shard_index = fx_hash(key) % shard_countLux splits its keyspace into N shards, each protected by a reader-writer lock. The shard count auto-tunes based on CPU count: 64 shards on a 4-core machine, 256 on a 16-core. Each shard is cache-line aligned (128 bytes) to prevent false sharing between cores.
Reader-writer locks mean that multiple GET requests to the same shard run in parallel with zero contention. Only writes require exclusive access, and only to the specific shard being written to. With 256 shards, the probability of two concurrent writes hitting the same shard is less than 0.4%.
Zero-copy RESP parser
A typical Redis command like SET mykey myvalue arrives as bytes on a TCP socket. Most parsers allocate a new String for each argument. For a 64-command pipeline, that is 192+ heap allocations before a single command executes.
Lux's parser returns &[u8] slices that point directly into the read buffer. Zero allocations. The parsed command is a pointer and a length into memory you already have. Strings are only allocated at the moment of insertion into the store.
before:
fn parse() -> Vec<String> // 192 heap allocs per pipeline
after:
fn parse() -> Vec<&[u8]> // 0 heap allocsPipeline batching by shard
When a client sends a pipeline of 64 commands, the naive approach processes them one at a time: lock shard, execute, unlock, lock next shard, execute, unlock. That is 64 lock acquisitions.
Lux sorts the pipeline by shard. Commands hitting the same shard are grouped together and executed under a single lock acquisition. For a 64-command pipeline hitting 20 unique shards, that is 20 lock acquisitions instead of 64.
For single-key commands (SET, GET, INCR, LPUSH, RPUSH, LPOP, RPOP, SADD, HSET, ZADD, ZPOPMIN, and more), Lux uses a fast path that bypasses the full command dispatcher and operates directly on the already-locked shard data. No redundant lock acquisition, no command re-parsing. This is why LPUSH hits 6.5M ops/sec and ZPOPMIN hits 11.5M ops/sec at pipeline=64.
64-command pipeline, 20 unique shards:
Redis: 64 commands processed sequentially
Lux: 20 lock acquisitions, ~3.2 commands per lock
Lux: responses written to single scratch buffer
Lux: reordered to original command order on outputEliminating hidden costs
Cached clock
Every key expiration check calls Instant::now(), which is a syscall. On ARM that costs 100-200ns per call. Lux captures the timestamp once per read and passes it to every command in the batch. A 64-command pipeline does 1 clock read instead of 64.
Pre-computed hashing
Standard HashMaps hash the key twice: once to find the shard, once for the internal lookup. Lux uses hashbrown's raw_entry() API to compute the hash once and reuse it for both shard selection and HashMap lookup. This also allows byte-slice lookups against String keys without UTF-8 conversion overhead.
Byte-level command dispatch
Redis converts command names to uppercase strings for matching. Lux compares raw bytes with inline case-insensitive comparison. No allocation, no conversion. cmd_eq(args[0], b"SET") compiles down to a handful of CPU instructions.
Zero-alloc response assembly
Batched commands write responses into a single pre-allocated scratch buffer. Offset pairs track where each response starts and ends. At the end, responses are copied to the output buffer in original command order. One allocation per pipeline instead of one per command.
Why Rust?
This architecture is only possible because of Rust. The borrow checker guarantees at compile time that our zero-copy slices cannot outlive the read buffer. The type system prevents data races between threads accessing shared shards. The ownership model means we can hold a read lock, write a RESP response directly from borrowed data, and the compiler proves it is safe.
In C, this level of concurrency would require careful manual memory management and years of debugging race conditions. In Go or Java, the garbage collector would add unpredictable latency spikes. Rust gives us C-level performance with memory safety guarantees, and that combination is what makes Lux possible.
The result
12M
ops/sec peak
Pipeline
advantage grows with depth
~2MB
total binary
Every optimization compounds. Zero-copy parsing saves allocations. Shard batching reduces lock overhead. Cached clocks eliminate repeated clock reads. Pre-computed hashes skip redundant work. And Rust makes the concurrency model tractable. At pipeline=1, Lux and Redis are roughly equal in the published benchmark because both are mostly network-bound. As pipeline depth rises, Lux's same-shard batching and multi-core execution create a wider advantage on single-key workloads. Multi-key commands have different tradeoffs and should be benchmarked against your actual command mix.