The Python Oracle

Why is dill much faster and more disk-efficient than pickle for numpy arrays

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Dreamlands

--

Chapters
00:00 Why Is Dill Much Faster And More Disk-Efficient Than Pickle For Numpy Arrays
01:09 Accepted Answer Score 17
02:16 Answer 2 Score 22
02:43 Thank you

--

Full question
https://stackoverflow.com/questions/4469...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #numpy #serialization #pickle #dill

#avk47



ANSWER 1

Score 22


I'm the dill author. dill is an extension of pickle, but it does add some alternate pickling methods for numpy and other objects. For example, dill leverages the numpy methods for the pickling of arrays.

Additionally, (I believe) dill uses DEFAULT_PROTOCOL by default (not HIGHEST_PROTOCOL), for python3, and for python2 it uses HIGHEST_PROTOCOL by default.




ACCEPTED ANSWER

Score 17


This ought to be a comment, but I have not enough reputation... My guess is that this is due to the pickle protocol used.

On Python 2, the default protocol is 0 and highest supported protocol is 2. On Python 3, the default protocol is 3 and highest supported protocol is 4 (as of Python 3.6).

Each protocol version improves on the previous one, but protocol 0 is especially slow for largish objects. It should be avoided in most cases, except if you need to be able to read your pickles using extremely old versions of Python. Protocol 2 is already much better.

Now, I suppose dill uses pickle.HIGHEST_PROTOCOL by default, and if that is indeed the case, it would probably be the cause of a good deal of the speed difference. You could try using pickle.HIGHEST_PROTOCOL to see if you get similar performance using dill and standard pickle.

with open('dill', 'wb') as fp:
    dill.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)
with open('pickle', 'wb') as fp:
    pickle.dump(B, fp, protocol=pickle.HIGHEST_PROTOCOL)