Skip to main content

High-Performance Map Container Selection Guide

I. Containers and Their Underlying Technical Characteristics

The core value of ordered containers lies in supporting range queries and ordered traversal. The technical differences between the two mainstream containers stem from their underlying storage structures, which directly determine their performance boundaries and applicable scenarios.

turbo::btree_map/set is implemented based on a variant of B+ tree. Compared with the red-black tree of std::map, its node structure has been specially optimized—non-leaf nodes only store key-value indexes instead of complete data, while leaf nodes are linked through a doubly linked list, with all data concentrated in the leaf layer. This structure allows batch loading of consecutive leaf nodes during a single range query, significantly reducing the number of CPU memory accesses, and the cache hit rate is over 30% higher than that of red-black trees. Its expansion adopts a batch progressive migration strategy, which avoids business blocking caused by a large number of node rotations like red-black trees, and can maintain more stable write performance in scenarios with millions of Key insertions. In terms of C++17 compatibility, it natively supports std::string_view as the Key, eliminating the need for intermediate copying via std::string, and is compatible with structured binding syntax, so there's no need to manually call first and second when traversing key-value pairs. The core reason for the permanent stability of iterators is that insertion and deletion operations only adjust node pointers without moving the memory addresses of existing data. However, the index structure of non-leaf nodes brings about a 10% additional memory overhead, which accounts for a higher proportion in small dataset scenarios (fewer than 10,000 Keys), making it difficult to reflect its advantages.

tsl::ordered_map/set adopts a hybrid architecture of hash table + doubly linked list. The hash table provides O(1)-level fast lookup capabilities, while the doubly linked list maintains the insertion order of data to achieve the requirement of "unordered insertion and ordered traversal". Its hash buckets use chaining to resolve conflicts, but the linked list nodes are optimized with cache line padding—each node's size is aligned to 64 bytes to avoid false sharing caused by multiple nodes being squeezed into the same cache line. This makes its lookup performance about 20% higher than traditional chaining implementations. However, this container has an obvious technical shortcoming: it needs to rebuild the mapping relationship between hash buckets and linked list nodes during expansion. This process will cause temporary performance jitter, which is more obvious in high-frequency insertion scenarios with tens of thousands of operations per second. Therefore, it is more suitable for mixed scenarios with moderate insertion frequency and requirements for traversal order.

Unordered hash containers are the mainstream choice in projects. The differences in conflict resolution mechanisms and memory layouts among containers with different algorithms are the core factors determining their performance and applicable scenarios.

turbo::flat_hash_map/set with flat algorithm adopts open addressing + quadratic probing strategy. All key-value pairs are stored in a continuous memory array, eliminating the pointer overhead of linked list nodes in traditional chaining. It chooses quadratic probing instead of linear probing as the probing strategy, aiming to avoid the "clustering effect" of linear probing—where conflicting elements cluster together, leading to a sharp increase in probing steps. The step size of quadratic probing is (where i is the number of probes), which can distribute conflicting elements more evenly in the array. In terms of memory layout, it strictly follows 64-byte cache line alignment, and the size of each data block is adapted to the CPU's cache reading granularity, making the cache hit rate of small objects close to 100%, far higher than that of containers with linked list structures. In terms of C++17 feature support, its emplace operation implements perfect forwarding, which can directly construct objects in the memory array, avoiding the redundant step of "constructing temporary objects first and then copying" in the Abseil version. As a result, the insertion performance of small objects is improved by more than 15%. It should be noted that the invalidation of its iterators stems from memory reallocation during expansion—the address of the new array is different from the original array, and the original memory address pointed to by the iterator becomes invalid. Therefore, it is necessary to call reserve to preallocate sufficient space before batch insertion. In addition, its hash seed is stored in Thread Local Storage (TLS), and the TLS spaces of different processes are isolated from each other, leading to inconsistent hash values across processes and making it impossible to be directly used in serialization scenarios.

The core technology of F14 algorithm containers (folly::F14Map/Set, melon::F14Map/Set) is the hybrid addressing mechanism, whose core design goal is to balance cache friendliness at low load and conflict resolution capabilities at high load. When the load factor is lower than 0.5, it uses open addressing; when the load factor exceeds 0.5, it automatically switches to chaining. This switching threshold is verified through a large number of actual tests—0.5 is the critical point of the cache advantage of open addressing, and the conflict probability will increase significantly beyond this point. It supports a high load factor of up to 0.9, which benefits from the segmented hash bucket design: the entire hash table is divided into multiple independent segments, and each segment handles conflicts independently, avoiding performance degradation caused by global conflicts. In terms of serialization support, it has a built-in fixed hash seed, which is stored in the metadata area of the container and can be directly reused during cross-process deserialization, fundamentally solving the problem of hash stability. During project implementation, targeted fixes have been made for the high-concurrency deadlock problem of the Folly version—in the original implementation, when multiple threads lock different segments at the same time, lock order inversion may occur (thread 1 locks segment A first and then segment B, while thread 2 locks segment B first and then segment A). After the fix, all threads are forced to lock in the order of "segment number from smallest to largest", completely eliminating deadlocks. The melon version has streamlined the metadata of Folly, removing non-core functions such as statistical monitoring, reducing the metadata memory usage by 20%, and is more suitable for scenarios with tight memory resources such as edge nodes.

The core competitiveness of tsl::robin_map/set lies in the Robin Hood hashing conflict resolution mechanism, which is an optimization for open addressing. In traditional open addressing, conflicting elements are placed in the next free bucket, causing elements to move further away from their ideal hash buckets and leading to a sharp increase in the number of probing steps during lookup. The core logic of Robin Hood hashing is "fair competition"—when inserting an element, calculate its "distance" from the ideal bucket. If this distance is greater than the distance of the existing element in the current bucket, swap their positions to keep the element with the greater distance in the current bucket. This swapping logic can make all elements as close to their ideal buckets as possible, significantly reducing the number of probing steps during lookup. In scenarios with long string Keys, due to the higher probability of hash conflicts, the effect of this optimization is more obvious—the number of probing steps is reduced by 40% compared with traditional chaining, and the conflict resolution efficiency is improved by 30%. Its hash algorithm adopts fixed FarmHash without random seeds, natively supporting cross-process serialization. At the same time, it is natively compatible with std::string_view, avoiding data copying without additional conversion, and has significant advantages in scenarios with a high proportion of string Keys.

turbo::node_hash_map/set adopts a structure of chaining + independent node storage. Each node contains a key-value pair and two pointers (prev and next). Insertion and deletion operations only need to adjust pointer directions without changing the memory address of the node, which is the core technical reason for the permanent stability of its iterators. The advantage of this structure is that it supports dynamic insertion and deletion of data during traversal, without the need to collect Keys first and then process them in batches like the flat algorithm. However, it has obvious shortcomings—scattered node storage will lead to a significant decrease in CPU cache hit rate. Because the memory addresses of nodes are discontinuous, after the CPU reads one node each time, the next node is likely not in the current cache line, requiring a new memory access. Actual test data shows that in small object scenarios, its cache miss rate is 13 percentage points higher than that of flat algorithm containers, and the performance is therefore about 30% lower, making it only suitable for scenarios with rigid requirements for iterator stability.

Atomic lock-free containers (folly::AtomicHashMap/Set, melon::AtomicHashMap/Set) are the optimal solution for high-concurrency read scenarios. Their core technology is fine-grained atomic operations + double-word CAS mechanism. Their hash table is divided into multiple segments, and the head of each segment is stored through an atomic variable. Read operations do not require locking and directly obtain the head pointer through atomic_load. In high-concurrency read scenarios with 32 threads, the QPS is more than 50% higher than that of locked containers. To address the classic ABA problem of atomic operations, it adopts a double-word CAS scheme of "pointer + version number"—each node's pointer is accompanied by a 32-bit version number, which increments by 1 each time the node is modified. When a thread performs CAS, it not only compares the pointer value but also the version number, completely avoiding the problem where "thread 1 reads A, thread 2 modifies A to B and then back to A, and thread 1 mistakenly judges that A has not been modified". On this basis, the melon version has made compressed storage optimization: using the free high-order space of 64-bit addresses (most systems only use 48-bit addresses), it packs a 16-bit version number and a 48-bit pointer into a 64-bit variable. Compared with Folly's "64-bit pointer + 64-bit version number" scheme, the metadata memory usage is reduced by 50%, and the single-thread performance loss is only 3%, making it more suitable for resource-constrained scenarios such as edge nodes. It should be clearly stated that such containers are only applicable to read-heavy and write-light scenarios with a read-write ratio greater than 20:1. In high-concurrency write scenarios, multiple threads performing CAS on the head of the same segment will lead to a large number of retries and a sharp performance drop.

II. Scenario-Based Selection and Technical Basis

For small dataset scenarios (such as function parameter passing and temporary data storage), turbo::flat_hash_map is preferred. The core technical basis is that its continuous memory layout has high memory utilization and does not require frequent node memory allocation during construction. In scenarios with 10,000 Keys, the construction speed is 30% faster than that of std::unordered_map. If no additional dependencies are desired, std::unordered_map can be used directly, as the performance difference between the two is negligible for small datasets.

For large dataset scenarios storing small objects, turbo::flat_hash_map is still the first choice. The technical reason is that its cache line alignment design can maximize CPU cache hit rate and reduce memory access overhead. During implementation, it is necessary to call reserve to preallocate space in advance. The recommended preallocation size is 1.2 times the actual data volume, which is a verified optimal threshold—it can avoid expansion while not wasting too much memory. If memory resources are tight and the data load factor is higher than 0.7, folly::F14Map can be selected because its hybrid addressing mechanism has more stable performance under high load.

For high-concurrency read and low-concurrency write scenarios (such as configuration caching and dictionary query), turbo::flat_hash_map is selected for single-machine environments based on the technical basis of its excellent lookup performance and low memory usage. In concurrent environments, if the Map data is completely independent, atomic lock-free containers are selected because their lock-free read feature can maximize concurrent performance. If the Map data is strongly bound to business data (such as counters and user status), the slot hash sharding scheme must be adopted. The technical reason is that the built-in locks of the container cannot be linked with business locks; forced use will lead to lock order inversion and deadlocks. Manual sharding can strictly control the order of "locking business data first and then shard locks", balancing performance and data consistency.

For low-concurrency read and high-concurrency write scenarios (such as log burying points and data collection), folly::F14Map is selected for single-machine environments. The core technical basis is that its hybrid addressing mechanism has stable insertion and deletion performance under high load—with a load factor of 0.86, the average probing length is only 1.275, and the performance attenuation is less than 5%. If dynamic insertion and deletion of data during traversal are required, turbo::node_hash_map is selected because its iterator stability feature eliminates the need for additional Key processing. In concurrent environments, if the Map data is independent, pmap containers (if with built-in shard locks) are selected. The granularity of shard locks is finer than that of global locks, which can reduce lock contention. If the Map data is bound to business, slot hash sharding is selected, and the number of shards is recommended to be equal to the number of CPU cores to maximize multi-core parallel efficiency.

The selection for scenarios with special requirements must be closely linked to technical characteristics: when std::string_view is needed as a non-copy Key, tsl::robin_map is preferred because it natively supports it and has high hash conflict resolution efficiency. If turbo::flat_hash_map is used, a custom hash function based on std::string_view is required to avoid temporary copying. When stable iterators are needed for traversal and modification, turbo::node_hash_map is selected for single-machine environments, and pmap::flat_node_map is selected for concurrent independent Maps. For serialization or cross-process transmission, only F14, Robin, or atomic containers can be selected. The technical reason is that they support fixed hash seeds, ensuring cross-process hash consistency. For memory-sensitive embedded/edge node scenarios, turbo::flat_hash_map is preferred. If high load support is required, melon::F14Map is selected, benefiting from its streamlined metadata design.

III. Coding and Integration Technical Points

In terms of iterator usage, the iterators of flat, F14, and Robin series containers will be invalidated due to expansion, with the core reason being memory address reallocation. Therefore, direct insertion and deletion during traversal are prohibited, and the Keys to be operated must be collected first before batch processing. Node series iterators are stable, but locks must be added in concurrent scenarios; otherwise, data races will occur, leading to node pointer confusion.

In terms of concurrent coding, for scenarios where Map is bound to business data, the number of shards in slot hash sharding must be equal to the number of CPU cores, and the locks and data of each shard must be aligned to 64-byte cache lines to avoid false sharing. The lock order must strictly follow "business lock → shard lock", which is a necessary condition to avoid deadlocks. For scenarios where Map data is independent, atomic lock-free containers are selected for high-concurrency reads, and pmap containers with built-in shard locks are selected for high-concurrency writes. Encapsulating single-machine containers with std::mutex is prohibited. Actual tests show that the concurrent QPS of this scheme is only 60% of that of mature containers.

In terms of serialization coding, containers supporting fixed hash seeds are preferred, and the seed must be explicitly passed as a template parameter in the code. If turbo/Abseil containers must be used, a custom stateless hash function is required, such as one based on CityHash with a fixed seed passed in, to avoid the impact of process-local random seeds.

In terms of integration, turbo containers must enable C++17 compilation options (such as GCC's -std=c++17), otherwise std::string_view and structured binding cannot be supported. For Folly containers, the lock order inversion problem must be fixed first; for melon containers, it is necessary to confirm that the ABA problem has been fixed. For tsl::robin_map, the hash seed configuration must be verified to ensure stability in serialization scenarios. Self-developed pmap containers only support independent Map scenarios, and slot hash sharding must be used in binding scenarios.

IV. Quick Selection Reference Table

Business ScenarioRecommended ContainerCore Technical Basis
Single-machine small/large dataset, performance + memory sensitivityturbo::flat_hash_mapContinuous memory + cache line alignment, high memory utilization and cache hit rate
Insert/delete data during traversal (single-machine)turbo::node_hash_mapChaining with independent nodes, permanent iterator stability
Insert/delete data during traversal (concurrent independent Map)pmap::flat_node_mapBuilt-in shard locks + stable iterators, balancing concurrency and dynamic operations
Data serialization/cross-process transmissionfolly::F14Map/tsl::robin_mapFixed hash seed, cross-process hash consistency
High-concurrency read (independent Map)melon::AtomicHashMapLock-free read + compressed storage, excellent high-concurrency performance and low memory usage
High-concurrency write (independent Map)pmap::flat_hash_map (concurrent version)Built-in shard locks, fine lock granularity and low contention
Map bound to external data (concurrent)Slot hash sharding (turbo::flat_hash_map + shard locks)Controllable lock order, avoiding deadlocks and data inconsistency