Do You Track Hash Table Efficiency?

OneMoreTime · Post by **OneMoreTime** » Sat Apr 07, 2018 5:33 pm

I recently got around to adding a Hash Table to a program I am writing from scratch. The program is still slow in terms of time to depth, but the nodes/second is decent. I only use the hash table right now for transposition spotting. It does not use the stored best move in any way yet (although it does store it). I needed to see how efficient the hash table probing code was, so I decided to add a variable that tracks the greatest number of probes the hash table needs to make in order to return a result. The result can be "not in here" or "found" (obviously). The best case scenario is a one-probe determination. The worst case is you linearly scan the entire table, and the very last slot you look in is empty.

I am not using a power of two for the slot count. I use a quadruple prime (one where the digits all match with a 1, 3, 7, 9 as the last digits being the only difference). I generate a 64-bit hash (not using Zobrist) and then modulo divide by the large prime. The remainder is the slot index I examine first. If the slot is not empty, I begin a linear scan starting with lower slot indices. Should it bottom out at slot[0], I set the index to (huge prime number - 1) and start scanning top-down. Every time I perform any such linear scan, I increment the hash table probe counter. If the current count is larger than the global variable that counts the longest run, I adjust the global counter to reflect the new longest run.

So far, at 250 million nodes searched, the longest probe is averaging about 4.

The position I am looking at right now has been searched through 2.5 billion nodes, with the longest probe at 8.

I have to say, these seem like encouraging results. Do any of you track the hash table probing efficiency in a similar manner?

H.G.Muller · Post by **H.G.Muller** » Sun Apr 08, 2018 11:55 am

No, I don't, because my hash tables in general work in a different way. The number of probes is limited, usually to 3. In micro-Max it is even limited to 1. As the access scheme for probing and storing is the same, the position you are seeking can only be in this limited set of slots that you probe.

Typically I try to design the table layout in such a way that all probed slots fit into the same 'chache line' (the 64-byte memory block that is read by the CPU in one operation), so that you would not need multiple RAM access. (Because such accesses are rather slow, taking as much time as a couple of hundred normal instructions.) E.g. when my slots require 12 bytes, I fit 5 of them in a cache line (plus 4 filler bytes). One of those is the 'overflow slot' that will always be probed, the other 4 are then two pairs, and the hash key of a position determines in which pair it could be stored. It then probes the two members of this pair if the overflow slot was no match.

Note that modulo operations are excessively slow; Bob Hyatt once tried that in Crafty, and measured a slowdown of 33% just because of a single modulo operation per node. As hash size only affects search speed very little, like a factor 1.06 per hash-size doubling, you would be off better by rounding down the hash size to a power of 2. Unless your program is already so slow that the slowdown by the modulo is less than 6%. Also note there are alternative ways to map a number into a range that is not a power of 2, which do not require modulo: do a 64-bit multiply of a 32-bit part of the key and the size of your table, and shift that right 32 bits.

OneMoreTime · Post by **OneMoreTime** » Sun Apr 08, 2018 7:50 pm

H.G.Muller wrote:No, I don't, because my hash tables in general work in a different way. The number of probes is limited, usually to 3. In micro-Max it is even limited to 1. As the access scheme for probing and storing is the same, the position you are seeking can only be in this limited set of slots that you probe.

Typically I try to design the table layout in such a way that all probed slots fit into the same 'chache line' (the 64-byte memory block that is read by the CPU in one operation), so that you would not need multiple RAM access. (Because such accesses are rather slow, taking as much time as a couple of hundred normal instructions.) E.g. when my slots require 12 bytes, I fit 5 of them in a cache line (plus 4 filler bytes). One of those is the 'overflow slot' that will always be probed, the other 4 are then two pairs, and the hash key of a position determines in which pair it could be stored. It then probes the two members of this pair if the overflow slot was no match.

Note that modulo operations are excessively slow; Bob Hyatt once tried that in Crafty, and measured a slowdown of 33% just because of a single modulo operation per node. As hash size only affects search speed very little, like a factor 1.06 per hash-size doubling, you would be off better by rounding down the hash size to a power of 2. Unless your program is already so slow that the slowdown by the modulo is less than 6%. Also note there are alternative ways to map a number into a range that is not a power of 2, which do not require modulo: do a 64-bit multiply of a 32-bit part of the key and the size of your table, and shift that right 32 bits.

I like your idea of limiting the hash probe count. But 3 seems like a very strict requirement. I will have to experiment.

I tested the modulo divide time on my machine. That 33% figure must be from ancient history. The slowdown in nodes/second was less than 10% on my machine. But, the time to depth without the hash table was 4.7 : 1 for larger depths. The hash table savings is worth the slowdown per-node.

The prime-modulo combo is producing very good slot randomness. The next step is to get a probe count that is the break even point, and then don't probe beyond that. My replacement scheme is pretty good I think. I count the number of times a slot is accessed and produces a cutoff. So when the value is retrieved and does so, the counter increments by 1. When the table is full I compute the average slot access for all occupied slots, delete those below that number, then re-hash the entire table. It does this once maybe every 2 hours. The table ends up being filled with many entries searched deeply along the pv and second-best move alternatives, branching out far and wide, and the first search after the hash table reorg is always very, very fast.

H.G.Muller · Post by **H.G.Muller** » Sun Apr 08, 2018 8:15 pm

You measured that on Crafty? I am just relaying what Bob reported, not so very long ago. Even then, 10% speedup, basically for free, is a very good deal.

You should not compare the time-to-depth between using a hash table and using no hash table at all, but between hash table and a hash table of half the size. Because you never lose more than a factor 2 on rounding down to the nearest power of two.On average, you lose only a factor 1.4.

I my hashing scheme the pair is used to hold the two largest depths that mapped to that group of slots ('bucket'). If a depth is to be writte that is lower than both, it will be written in the overflow slot. You can combine that with 'aging': consider slots that were not accessed in ay resent search as having depth=0, and thus preferably overwrite those. The 4 filler bytes could then be used as aging counters. The overflow slot does not need an aging counter.

OneMoreTime · Post by **OneMoreTime** » Mon Apr 09, 2018 1:50 pm

H.G.Muller wrote:You measured that on Crafty? I am just relaying what Bob reported, not so very long ago. Even then, 10% speedup, basically for free, is a very good deal.

You should not compare the time-to-depth between using a hash table and using no hash table at all, but between hash table and a hash table of half the size. Because you never lose more than a factor 2 on rounding down to the nearest power of two.On average, you lose only a factor 1.4.

I my hashing scheme the pair is used to hold the two largest depths that mapped to that group of slots ('bucket'). If a depth is to be writte that is lower than both, it will be written in the overflow slot. You can combine that with 'aging': consider slots that were not accessed in ay resent search as having depth=0, and thus preferably overwrite those. The 4 filler bytes could then be used as aging counters. The overflow slot does not need an aging counter.

I used the microsecond timer on my hardware and benchmarked the time taken for modulo divides, since that was what you indicated was the cause of the Crafty slowdown per node. I find it hard to believe one modulo divide = 33% of all of the computation time taken for a single node. Maybe his evaluation function is more detailed than mine and it takes longer to compute, but I got something like 9.885% as my slowdown for modulo vs. non modulo in the calculation. All other factors were left unchanged.

My hash table is of a single dimension and I do not perform bucket expansion. I modulo divide to get the remainder, and that remainder is the index. I visit the hash table at that index. If it is occupied, and the hash numbers do not match, I merely decrement the index and try again. If I find a slot with a "hash number = 0" then I know it is not in the table, and that slot becomes the slot for the new data. If I find a slot with matching hash numbers, I retrieve that data.

I am not sure what a data structure with bucket expansion looks like. Is it a linked list with nodes pointing to the parent slot?

My hash table has an "aging" data field that is a counter. It is initialized to zero when the hash table slot is first filled. Then, if it accesses that slot, and the depth is such that it can be used, and that usage creates a cutoff, the counter is incremented. When the hash table is about 2/3rds full, I scan the entire hash table, add up all the non-zero counters, and divide by the occupied slot count. This gives me the average utilization of the hash entries. I probe the table a second time, clearing all of the entries that fall below this average. Then, I perform a third probe, reorganizing the hash table, since some slots were previously assigned based on linear misses when performing the first hash modulo operation. This allows the program to continue using valuable hash entries for future searches.

H.G.Muller · Post by **H.G.Muller** » Mon Apr 09, 2018 9:54 pm

Could also be that Crafty is 3.3 times faster than your engine, so that the overhead that is 33% for Crafty is only 10% for you. Still, there seems no reason to despise these 10%.

The bucket entry could look like this:

Code: Select all

struct _HashBucket { // 64 bytes = 1 cache line
  t_int32 signature[5];
  t_int32 move[5];
  t_int16 score[5];
  t_uint8 depth[5];
  t_uint8 flags[5];
  t_int8 age[4];
};

Then I can derive the bucket number by masking some lowest bits from the 64-bit key (because my number of buckets is a power of 2), use the highest 32 bits of that key as signature, and use bit number 31 (neither in the index nor in the signature) to decide if I should probe (after entry 4) entry 0 and 1 in the bucket, or entry 2 and 3. If neither of those has a matching signature, the result of the search will be overwrite the lowest depth of (0,1) or (2,3) if it has a higher depth than that, and otherwise entry 4. Since the overwhelmingly large part of the nodes is of low depth, they almost always end up in slot 4, so that if there is a match, it will almost always be slot 4 that matches. And because I test that first, the other entries are tested virtually only in case of a miss.

If I use the aging counters they contain the number of the search that last matched the signature of that entry. If searchNr - age > 1 the entry has not been used in the previous search, and it gets overwritten no matter what depth it had. Usually I only test this for one of the two entries in the pair; to that end I derive the entry number k from 2 bits of the hash key, so that it is 0-3, and then age entry k, (by setting its depth to 0 if it was stale), and probe k and k^1. This way I don't need extra passes through memory for aging (which can cause sigificant overhead at fast TC ad a large hash table), but only have to work on data that was already fetched from memory anyway (which is typically some 100 times faster).

H.G.Muller · Post by **H.G.Muller** » Tue Jul 31, 2018 8:19 pm

I have a new favorite scheme, which uses 'undercut replacement' for some slots. Each position can be stored in 3 locations. If it is in one of the 3 already, it stays there (hit). If it isn't in any of the 3 (miss), it goes, in order of priority:
1) into the 3rd slot if its depth is larger than or equal to the depth stored there, or the stored entry is 'stale' (depth preferred with aging).
2) into the 2nd slot if its depth is larger than or equal to the stored depth, or in one of 8 cases when it is equal to the stored depth minus one (undercut).
3) into the 1st slot (always replace)
An entry is stale if it wasn't hit in the search with the current or previous search number. An analysis session counts as a single search.
The slot numbering refers to probe order, i.e. the always-replace slot is probed first (as the overwhelming majority of hits will be on low-depth entries).

Code: Select all

int index = key & hashMask;
int sig = key >> 32;
int slot = 4;
HashBucket *b = hashTable + index;
if(b->signature[nr] == sig || b->signature[nr -= key >> 30 & 2 | 1] == sig|| b->signature[--nr] == sig) { // hit in {4, 3, 2} or {4,1,0}
  ...
} else { // miss; nr inicates depth-preferred slot (0 or 2) at this point
  if(searchNr - b->age[nr] > 1 || depth >= b->depth[nr]) b->age[nr] == searchNr; // entry is stale, or new depth is better
  else if(depth + 1 == b->depth[++nr] && ++b->age[nr] == 8 || depth + 1 > b->depth[nr]) b->age[nr] = 0; // 8th undercut, or new depth is better
  else nr = 4; // none of the above
  b->signature[nr] = sig; // allocate entry
}

The survival time of an entry in an undercut slot is roughly proprtional to its depth, except for d=0 entries (which cannot be undercut). That we accept only the 8th undercut is an attempt to correct that by artificially lowering the undercut rate. The population in the always-replace slot will be dominated by d=0 entries, and by sharing the slot between the two 'replacements groups' the number of such entries can be kept at a reasonable level compared to the other depths.

OpenChess

OpenChess

Do You Track Hash Table Efficiency?

Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?

Re: Do You Track Hash Table Efficiency?