Kache: Building a Redis like In-Memory KV Store in C++

Over the last few weeks, I decided to build Kache, a Redis-like in-memory KV store for cache servers, entirely from scratch in C++. The goal wasn’t just to clone Redis, but to deeply understand how real cache servers work under the hood: how they store data efficiently, to how they handle multiple clients without blocking, and how they manage time-based expiration.

At a high level, Kache is a single-threaded, event-driven TCP server with an in-memory data store and a small background worker pool. All client commands execute on the event loop; anything that might block is deferred.

This blog post is a deep dive into Kache’s internals: what I built, how it works, and what I learned along the way. I’ll walk through the architecture, core data structures, command execution, timers, background threads, and the performance characteristics of the system.

You can check out the code on GitHub: sakkshm/kache.

This project was less about “shipping a product” and more about learning systems programming by building something.

Why Build Kache?

Before diving into the technical details, here’s why I decided to take this project on.

Learn by doing: Reading about Redis or Memcached only takes you so far. Implementing a cache server forces you to understand the real trade-offs: latency vs complexity, memory layout, blocking vs non-blocking code, and correctness under concurrency.
Understand performance bottlenecks: In-memory caches are all about low latency. I wanted to see how design choices like event loops, intrusive data structures, and balanced trees actually affect performance.
Experiment with system design ideas: Redis is extremely optimized and mature. By building a smaller version, I could experiment freely with ideas like TTL heaps, thread pools, and binary protocols.
Hands-on C++ systems programming: Working with raw pointers, custom allocators, intrusive nodes, and manual memory management gave me a much deeper appreciation for how data layout impacts speed.

High-Level Architecture

Kache’s architecture is inspired by Redis but intentionally simplified. Clients connect to the server over TCP and send binary-encoded requests. These requests are handled by a single-threaded event loop built on poll, which is responsible for reading incoming data, parsing commands, executing them, and managing timers such as key expirations.

All data lives in an in-memory store that implements the core data structures, including hashmaps and sorted sets (ZSets). To ensure the event loop remains fast and predictable, any work that could block or take longer—such as deferred cleanup, expiration handling, or other maintenance tasks—is offloaded to a background thread pool. This design keeps the hot path minimal while still allowing the system to scale in functionality without compromising latency.

The Event Loop

The event loop is the heart of Kache. It uses poll() to monitor all active client sockets and processes timers in the same loop.

// Simplified code
 
while (true) {
    poll(fds, nfds, timeout);
 
    for (auto &fd : fds) {
        if (fd.revents & POLLIN) handle_read(fd);
        if (fd.revents & POLLOUT) handle_write(fd);
    }
 
    process_timers();
}

A few key design choices here:

Non-blocking IO: All sockets are non-blocking. The event loop never waits on a slow client.
Single-threaded execution: All command execution happens on one thread. This avoids locks entirely on the data structures.
Timers in the same loop: TTL expiration is checked during each iteration, ensuring expired keys are removed promptly.

As long as the loop stays tight, Kache can handle thousands of concurrent clients efficiently.

Protocol Design

Kache uses a simple Tag-Length-Value based binary protocol. Every request looks like this:

    [length: 4 bytes][command payload]

Command payload looks like:

    [n strings] (no. of strings) (4 bytes)
    [len 1]     (len of str 1)   (4 bytes)
    [str 1]     (string 1)
    [len 2]     (len of str 2)   (4 bytes)
    [str 2]     (string 2)
    .
    .
    .
    [len n]     (len of str n)   (4 bytes)
    [str n]

Response looks like:

    [4 bytes]   resp. length
    [4 bytes]   status code
    [n bytes]   response/data

This makes parsing deterministic and avoids the ambiguity of line-based or delimiter-based protocols.

if (conn.read_buf.size() >= 4) {
    uint32_t len = read_u32(conn.read_buf);
    if (conn.read_buf.size() >= 4 + len) {
        parse_command(conn.read_buf.substr(4, len));
    }
}

The choice of binary over text was deliberate:

Faster parsing
No string tokenization overhead
Easy to reason about correctness

For a learning project, this keeps things clean and predictable.

Hashmap Internals

The backbone of Kache is an intrusive hash table. Instead of std::unordered_map, each object embeds its own hash node:

struct HNode {
    HNode* next;
    uint64_t hcode;
};

Key points:

Intrusive design reduces allocations
Linked-list chaining handles collisions
Power-of-two buckets allow fast masking instead of modulo

Lookup is straightforward:

size_t pos = key->hcode & table.mask;
for (HNode* n = table.tab[pos]; n; n = n->next) {
    if (eq(n, key)) return n;
}

This design mirrors how Redis internally manages many of its core structures.

Sorted Sets (ZSets)

Sorted sets were one of the most interesting parts to implement.

Each ZSet combines:

Hashmap → O(1) lookup by name
AVL tree → ordered traversal by (score, name)

Insertion looks like this:

ZNode *node = znode_new(name, len, score);
hm_insert(&zset->hmap, &node->hmapNode);
tree_insert(zset, node);

And queries support offsets and limits:

zquery(score, name, offset, limit);

This hybrid structure is exactly why ZSets are so powerful as they allow fast access and ordered iteration at the same time.

Expiration and Timers

Kache supports TTL using a min-heap of expiration times:

struct HeapItem {
    uint64_t expire_at;
    size_t* ref;
};

How it works:

Earliest expiration sits at the top
process_timers() pops expired entries
Updates adjust heap positions efficiently

This avoids periodic scans of all keys and keeps expiration costs proportional to actual expirations.

Background Thread Pool

One subtle issue with caches is memory deallocation latency. Freeing large structures can block the main loop.

To avoid this, Kache uses a small thread pool:

thread_pool.enqueue([obj] {
    delete obj;
});

This ensures:

Event loop stays responsive
Deletions don’t cause latency spikes
Cleanup work happens asynchronously

It’s a simple idea, but it makes a big difference under load.

Supported Commands

Kache supports a minimal but useful command set.

Command	Description
`set <key> <value>`	Set a value for a key
`get <key>`	Retrieve the value of a key
`del <key>`	Delete a key and its value
`expire <key> <time>`	Set a TTL for a key (time in milliseconds)
`persist <key>`	Remove the TTL from a key
`zadd <key> <score> <name>`	Add a `(name, score)` pair to a sorted set
`zrem <key> <name>`	Remove an entry from the sorted set
`zscore <key> <name>`	Get the score associated with a name
`zquery <key> <score> <name> <offset> <limit>`	Query a sorted set with ordering and pagination

Each command is designed to execute in predictable time with minimal allocations.

What I Learned

Building Kache turned into a crash course in systems programming:

Event-driven architecture
Designing lock-free data structures
Memory layout and cache locality
TTL scheduling with heaps
Thread pool fundamentals
Protocol design and framing

It felt like building a miniature Redis engine, and that made the internals of real-world systems click in a way books never quite managed.

Benchmarks

I ran a production-mode benchmark with -O3, logging disabled, and a mixed workload over persistent TCP connections. All tests were run on a single machine (CPU model), with clients on the same host, using persistent connections.

Results

Metric	Value
Total requests	1,220,000
Throughput	116,541 req/s
Average latency	0.0296 ms
Min latency	0.0146 ms
Max latency	0.5257 ms
p95 latency	0.0496 ms
p99 latency	0.0546 ms

Observations

Sub-0.03 ms average latency confirms that most operations stay in CPU cache.
Tight p95 and p99 values indicate low jitter and predictable execution.
Rare latency spikes were tied to bulk cleanup or timer bursts.
Single-thread throughput crossed 100k req/s, validating the event-loop design.

For a from-scratch C++ cache, this was extremely satisfying to see.

Closing Thoughts

Kache taught me more than I expected:

How high-performance caches really work
Why data structure choices matter so much
How memory management impacts latency
How real systems balance simplicity and performance

It’s not production-ready and that’s fine. The real value was in building it end-to-end and understanding each piece deeply.