Hash collision counting in FrozenDictionary

This is the second short post about implementation details of FrozenDictionary that didn’t make it into the main article. Today we’ll look at the hash collision counting algorithm.

In FrozenDictionary, all key hash codes are known in advance. This allows choosing a number of buckets such that the collision rate does not exceed a certain threshold — currently 5%. Here’s how it works:

A bucket count n is chosen from a predefined range. This range was tuned heuristically by the .NET team.
For each hash code, the remainder of division by n is calculated, and duplicates are counted.
If there are too many collisions, n is increased.

Here’s a simple example using a bool[] array:

int[] hashCodes = [/*some hash codes*/];
var seenBuckets = new bool[bucketsNumber];
var collisionsCount = 0;

foreach (var hash in hashCodes) {
    var bucketIndex = hash % bucketsCount;
    if (seenBuckets[bucketIndex]) {
        collisionsCount++;
    } else {
        seenBuckets[bucketIndex] = true;
    }
}

But in .NET, they do it differently using an int[] array and bitwise operations:

int[] hashCodes = [/*some hash codes*/];
var seenBuckets = new int[bucketsNumber];
var collisionsCount = 0;

foreach (var hash in hashCodes) {
    var bucketIndex = hash % bucketsCount;
    var i = bucketIndex / 32;
    var bit = 1 << bucketIndex;
    if ((seenBuckets[i] & bit) != 0) {
        collisionsCount++;
    } else {
        seenBuckets[i] |= bit;
    }
}

This approach greatly reduces the array size (by 20%–80%), and the time to create it drops by an average of 70%. However, read and write operations become slower — reads are about 50% slower, and writes up to 198% slower on average (see graphs below).

Array creation time: bool[] vs bit-shifted int[]

Memory usage comparison: bool[] vs bit-shifted int[]
read benchmark

Read performance: bool[] vs bit-shifted int[]
write benchmark

Write performance: bool[] vs bit-shifted int[]

So in the end, using integers and bitwise shifts is a tradeoff. The .NET team seems to have decided that reducing memory usage is more important than having a faster algorithm.