Collision mitigation strategy

A project log for CLEFS - Custom Lightweight Embedded File System

A "Data Store" for embedded and read-intensive applications.

yann-guidon-ygdesYann Guidon / YGDES 09/02/2018 at 21:070 Comments

CLEFS is a sort of database, split into 2 parts : the storage and the index. The index is a large hash table that contains fixed-size information and critical fields, the rest (the variable sized data) is stored in the storage (I should maybe call this the vault).

Hash tables have well known characteristics. The hash algorithms are well studied but what bothers me is the collisions.

I don't want to use linked lists when a collision occurs. The index is actually a sparse, monotonicly increasing list of hashes so the index is derived from the hash either by truncation (of the LSB), or by a rule-of-three if the index size is not a power of 2 (for example with a compacted, condensed, read-only file system). Since the keys are sorted and also stored in full, one can find the actual location of the key with a linear scan of the list, backwards or forwards, if the location is not reached immediately.

The index should be able to grow or shrink at will, as it contains more or fewer keys. A power-of-two size is excellent for dynamic resizing.

The first method to deal with collisions is to "push" enough entries around. With a fill of 50%, it's pretty easy, there is enough free room after and before the target location. The dichotomy search algorithm will still find the entry after a small linear scan.

Today I have come up with a different method, and I still ignore if it is used in the wild. Instead of using linked lists or pushing other entries, the key itself could be ... rotated. This should reshuffle the bits nicely if the hash is of good quality. This is potentially slower than the previous method because consecutive accesses are almost free with today's technology. However, there is no need to "push" elements. The only additional requirement is to store a "rotation count" alongside with the key, as well as a reference counter (so the lookup code can know if the hash has a collision and how many items collided). I have to figure out a resizing (shrink/expansion of the index) strategy...