Django: Is Base64 of md5 hash of email address under 30 characters?

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Underwater World

--

Chapters
00:00 Django: Is Base64 Of Md5 Hash Of Email Address Under 30 Characters?
01:23 Accepted Answer Score 4
01:48 Answer 2 Score 2
02:29 Answer 3 Score 2
03:33 Thank you

--

Full question
https://stackoverflow.com/questions/1108...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #django

#avk47

ACCEPTED ANSWER

Score 4

You can do it like this:

>>> from hashlib import md5
>>> h = md5('email@example.com').digest().encode('base64')[:-1]
>>> _
'Vlj/zO5/Dr/aKyJiOLHrbg=='
>>> len(h)
24

You can ignore the last char because it's just a new line. The chance of collision is the same as the MD5 hash, you don't lose information when you encode in base64.

>>> original = md5('email@example.com').digest()
>>> encoded = original.encode('base64')
>>> original == encoded.decode('base64') 
True

ANSWER 2

Score 2

MD5 hashes are always 16 bytes long, and Base64 encodes groups of 3 bytes to 4 characters; thus (16 / 3 rounded up) => 6 groups of 3, times 4 = 24 characters for a MD5 hash encoded to Base64.

However, note that the above linked Wikipedia page states:

However, it has since been shown that MD5 is not collision resistant.

So you cannot count on this method giving you unique usernames from email addresses. Producing them is very easy with the help of the hashlib module:

>>> from hashlib import md5
>>> md5('foo@bar.com').digest().encode('base64').strip()
'862kBc6JC2+CBAlN6xLYqA=='

ANSWER 3

Score 2

UUID is 128bit, thus you could apply base64 on it directly to get a 22-chars long string (by removing fixed padding '==', as Gumbo suggests in comments of the question)

>>> import base64
>>> len(base64.urlsafe_b64encode(uuid.uuid4().bytes).rstrip('='))
22

Here, the urlsafe_b64encode and the stripping of '=' are used to avoid chars that User.username field does not like, including '/' '+' and '='

Also, UUID has two bits of fixed '10'(hence the 17th char in the hex representation is always 8,9,A,B) and four bits of versions, check the wiki.
Thus you could throw away the 4+2=6bits along w/ 2 effective bits to get a 30-chars long hex string:

>>> s = uuid.uuid4().hex
>>> len(s[:12] + s[13:16] + s[17:])
30

In this way you only throw away 2 effective bits instead of 8 when simply slicing s by s[:30], and you could expect better uniqueness (1/4 coding space of uuid at most).