Wednesday, October 21, 2009

Wolfram|Alpha use case: Calculating probability of hash collisions

A collegue was discussing with me today whether seven hex digits (what Git commit names are usually abbreviated to) is sufficient to avoid collisions. Usually four-five just works, but the engineer answer is as usual "I can calculate that!". Wikipedia states regarding the birthday problem:

"Given n random integers drawn from a discrete uniform distribution with range [1,d], what is the probability p(n;d) that at least two numbers are the same?"

So, quickly plugging this into Wolfram|Alpha, assuming seven hex digits gives d=16^7, and a reasonable set of git objects, say n=10000 :
{ n = 10000, d = 16^7,
  p = 1-exp(-(n (n-1))/(2 d)),
  q = 1-((d-1)/d)^n }

Which indeed gives that p = 17% probability that there for this setup would exist at least one collision in the whole set, but on the other hand you only use the abbreviation manually and the probability to actually use one of these is only q = 0.0037% . Q.E.D.

See also my first and second tweet about this. I have other entries about git, wolframalpha or just plain nerdy.

No comments: