Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"11-19-10 - Hashes and Cache Tables"

11 Comments -

1 – 11 of 11
Blogger won3d said...

So, I have some data on using cuckoo for hash/cache tables (I too like that name).

A few note on cuckoo. You obviously don't want to use the original straight-up cuckoo stuff. You want each item to hash to 2 buckets (not slots), and each bucket should be cache-aligned, and you want to do linear probing within those buckets. Use power-of-2 table sizes.

For rehashing, I had a primary hash which was just the low bits (& mask), and the secondary hash folded in the higher bits depending on how large the table was. You might be able to just get away with >> 16 though. The problem is that you can get self-collisions, where an item will rehash to the same bucket as the hash. I think the solution to that is to do mod table_size-1, and use this as an offset from the original bucket. That way, you never have a collision, and the mod does an excellent job of mixing in the higher bits.

Anyway, there are lots of tricks like this. LMK if you want to know more.

I think you're being a bit hard on the strchr benchmark, I mean some of the things you're talking about are difficult to factor out. Like what would be a word-sized version of FNV? Or isn't it an advantage that Fletcher/Adler is more easily parallelizable? So it isn't perfect (what benchmark is) but I don't think it is egregiously so. I agree that the biggest issue is probably the test corpus.

November 19, 2010 at 2:35 PM

Blogger cbloom said...

"
I think you're being a bit hard on the strchr benchmark, I mean some of the things you're talking about are difficult to factor out."

Meh, I just think it's depressing that they've done a lot of promising work, and the result is hardly any useful information. I mean seriously, what can you actually learn from that page that's trustworthy information? If you look at how close all the results are to each other, they are within the margin of platform variation and optimization variation, therefore you can say absolutely nothing about which one is best.

" Like what would be a word-sized version of FNV?"

They do have one, it's called "Whiz" on that page. It is not a correct word-sized FNV, in fact it's much weaker, but they find it to be very good (on short english text). But that's partly because their "Whiz" and "Jester" have been highly tweaked while the others haven't.

"Or isn't it an advantage that Fletcher/Adler is more easily parallelizable?"

It *is* an advantage, but they aren't using it. To compare speed without taking advantage of that makes the speed comparison bogus. The only fair way to test speed is to test the best version of each algorithm.

It bothers me a lot when the only real conclusion you can draw from a test is "highly tweaked code snippet beats completely untweaked code snippet!" .

November 19, 2010 at 2:55 PM

Blogger cbloom said...

"and each bucket should be cache-aligned, and you want to do linear probing within those buckets. Use power-of-2 table sizes."

BTW PAQ does something similar with cache-aligned buckets and then reprobing around in the bucket.

I believe Sean was working on a thorough study of the issue of cache-line-alignment for hash tables, but maybe hasn't gotten around to it (?)

November 19, 2010 at 3:38 PM

Blogger cbloom said...

Meh. Maybe I'm just cranky today and being overly critical.

But they're also calling their hash functions through *function pointers* which is a pretty disastrous mistake. Combine that with the cache misses from chaining, and cache misses for the strcmp, and you can see why the speed results show such little difference.

(for example, in a real test Murmur2 should not be so competitive with FNV on short words - FNV is way way faster than Murmur2 for short word hash tables)

November 19, 2010 at 4:13 PM

Anonymous Anonymous said...

By the way, I do still have my how to structure a hash table research coming.

I was hoping to just use a decent hash function and some trivial input to not worry about the hash function side of it, but experiments showed some hash functions were exploiting my trivial input, so I'm going to end up using random input in an attempt to factor the hash function out of things (but obviously any real hash function might then work worse than what I measure, sigh).

November 19, 2010 at 4:46 PM

Blogger cbloom said...

I've looked at Murmur3 a bit. It's actually quite a bit slower than Murmur2 on short items, and even Murmur2 is un-competitively slow on short items. The throughput is very high for long items, particularly on x64. It might be a good "file hash" but it's not a good "hash table" hash.

BTW another thing missing from the strchr test is that when hashing strings, you generally want the strlen() call to be integrated with the hash function, which they don't do. Again, some hash functions are more amenable to that. In fact this tends to bias you back towards preferring byte-at-a-time hash functions rather than big-chunk unrolled ones. (you can use the nice trick for checking "any zero bytes in a dword" though)

November 19, 2010 at 5:01 PM

Blogger won3d said...

Oh, the Whiz thing is new. I should look at it, maybe, although maybe it's just terrible. I didn't know there had been updates to the strchr thing. Anyway, just post something to that board and he'll probably make those updates.

The way my hash function works is that I have a short inlined loop that deals with the last (up to) 16 bytes of the string, but the rest is not inlined and calls the big unrolled job. Actually, it is a bit more complicated than that, but the point is that it is a good idea to partially inline a string hash function so that you have a simple, small bit (FNV-1a with a stronger final mix) and a big fast bit (Murmur, lookup3, something with SSE2).

Some more details on my bucketing. I think I wrote something a little misleading in my first post: items DO hash into a slot, but that slot is part of a cache-aligned bucket of slots. Linear probing occurs within the bucket, not necessarily from the first slot in the bucket, wrapping around until you return to the original slot. I'm not sure whether it matters to vary what direction you probe in (I don't think so).

So check if the first bucket has space, otherwise check if the second bucket has space. If both are full, this means that your hash table is likely in need of rebalancing. One thing to do is be proactive, and go through the elements of both buckets and try to rehash them. Move them if they achieve better balance in their new home. I never implemented the full recursive cuckoo. I figure, if you've reached this crisis, then your table is hopelessly full and you might as well rehash into a bigger table.

In a cache table, another option is to evict something. But you don't have that many options. Effectively, your replacement strategy is going to be among a few buckets worth of elements rather than the whole table.

Anyway, with this cuckoo-like implementation, I took various pow2 sized tables (1-16.7M), and tried to insert as many elements as there were slots. I plotted when the first "crisis" occurred, and how full the tables eventually got.

Up to 128 elements, there was never a crisis. After that, I lost ~1% every doubling. The tables could get incredibly full; they almost never dropped below 98.5%. And the real joy is that reading those tables is still damn fast.

November 19, 2010 at 6:01 PM

Blogger cbloom said...

Ah, yeah, hashing to arbitrary spot and then linear looping around the cache line is one of the cache-aware-hashing strategies we've discussed.

So to look up, you have to look in two different buckets and linear probe around both?

"In a cache table, another option is to evict something. But you don't have that many options. Effectively, your replacement strategy is going to be among a few buckets worth of elements rather than the whole table. "

This is a big improvement over most cache tables which don't get many choices at all about what to evict (and you probably can't afford to consider too many things).

November 19, 2010 at 6:26 PM

Blogger won3d said...

Yeah, you run around both buckets, but that's as far as the probing ever goes. In other open addressing scheme (dense uses quadratic), you can get some nasty behavior. It is nice that you can bound things to 2x bucket cache misses for every read at maximum load (which is the steady state for a cache, of course).

OK, so cuckoo-ish cache tables sound sane, but how much work can you afford in cache maintenance? I imagine in something like LZP where the cache is in the inner loops, you can't afford to do something like cuckoo, and you just go with direct mapped. In something like a disk cache, you're so much faster than the backing medium that it makes sense to doing something like full LRU or ARC.

November 19, 2010 at 6:47 PM

Blogger Tyler Streeter said...

It seems like "cache tables" might be related to bloom filters: probabilistic results (e.g. some small chance of false positives) and fixed storage space. (Simple bloom filters are just sets, but bloomier filters, also described on that wikipedia page, are associative arrays.)

November 23, 2010 at 9:29 AM

Blogger cbloom said...

Mmm yeah; I think probabilistic data structures in general are a very exciting area of open research.

November 23, 2010 at 11:13 AM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.