Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"03-25-14 - deduper"

10 Comments -

1 – 10 of 10
Blogger AJ said...

1/2 the files that are of the same size are coincidental, so a hash of a file's first and last sizeof(1 disk sector read) is sufficient to uniquely identify it. Files that maybe the same, say a number of same pixel width/height .BMP files, may only differ in any meta data, which is often kept in the first and/or last blocks too. If that fails to unique-ify the file, then reading the entire thing, and hashing would be required.

March 30, 2014 at 7:40 PM

Blogger cbloom said...

Good point. Todo.

Only for large files though. Small files (< 64k certainly) you may as well go ahead and read the whole thing.

March 30, 2014 at 9:12 PM

Blogger cbloom said...

I think the more general optimization is like this :

Compute hashes in 64k chunks.

So you're building a "signature" for each file.

Compute the hashes in a "dynamic programming sense". That is, lazy evaluate them only as needed to differentiate files.

So you only compute the hash of the second 64k chunk when there are other files that match the first 64k hash.

March 31, 2014 at 9:40 AM

Blogger Unknown said...

I wrote this sort of tool in Python, and I got excellent performance comparing file content directly without hashing.
My tool immediately split same-size groups into common-prefix groups whenever a difference was found.
This excludes files from comparison as early as possible, but most of the performance likely came from reading files in 1MB chunks, limiting and amortizing I/O calls.

What's a good upper limit for the number of simultaneously open files? If there are many files of the same size closing files and reopening them before reading another chunk might be necessary.

April 8, 2014 at 12:32 AM

Blogger MH said...

Fun! I wrote one of these. I sorted the list on filesize, then opened the file and read the first 4k (or all of it if it was smaller), and used that as a quick filter, then did full file compares.

My unverified hope was that a first-4k read off a disk was fast.

Though, now I want to do a fast deduper of SIFT on images to catch resized or otherwise modified images.

April 9, 2014 at 5:17 PM

Blogger cbloom said...

Image search is something I would love to work on.

The current "reverse image searches" are all ridiculously bad. You want to be able to find crops and re-compresses and so on.

Doesn't seem that hard.

April 10, 2014 at 6:27 AM

Anonymous Anonymous said...

Google reverse image search seems to do all that. http://twitter.com/nothings/status/421590856617824256

April 17, 2014 at 3:46 PM

Blogger cbloom said...

When I've tried it in the past it's always massively failed. Like it couldn't even find minor crops.

April 17, 2014 at 5:34 PM

Anonymous Anonymous said...

Maybe I forgot to delete the tags from the image, and so it still said 'NSA' in it, and I totally misinterpreted it?

April 24, 2014 at 10:40 PM

Blogger cbloom said...

Maybe they special case NSA images. I dunno. Personally I've always had much more success with TinEye's reverse image search. But even that is very easily fooled by crops and resizes.

April 25, 2014 at 4:22 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.