Why is dill much faster and more disk-efficient than pickle for numpy arrays

Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn

--

Music by Eric Matyas
https://www.soundimage.org
Track title: Thinking It Over

--

Chapters
00:00 Question
01:35 Accepted answer (Score 14)
02:47 Answer 2 (Score 16)
03:22 Thank you

--

Full question
https://stackoverflow.com/questions/4469...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy #serialization #pickle #dill

#avk47

ANSWER 1

Score 22

I'm the dill author. dill is an extension of pickle, but it does add some alternate pickling methods for numpy and other objects. For example, dill leverages the numpy methods for the pickling of arrays.

Additionally, (I believe) dill uses DEFAULT_PROTOCOL by default (not HIGHEST_PROTOCOL), for python3, and for python2 it uses HIGHEST_PROTOCOL by default.

ACCEPTED ANSWER

Score 17

This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)