Save classifier to disk in scikit-learn
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------
Music by Eric Matyas
https://www.soundimage.org
Track title: Puzzle Game 5
--
Chapters
00:00 Save Classifier To Disk In Scikit-Learn
00:22 Accepted Answer Score 254
00:42 Answer 2 Score 246
01:20 Answer 3 Score 136
02:02 Answer 4 Score 37
03:13 Answer 5 Score 6
05:06 Thank you
--
Full question
https://stackoverflow.com/questions/1059...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #machinelearning #scikitlearn #classification
#avk47
ACCEPTED ANSWER
Score 255
Classifiers are just objects that can be pickled and dumped like any other. To continue your example:
import cPickle
# save the classifier
with open('my_dumped_classifier.pkl', 'wb') as fid:
    cPickle.dump(gnb, fid)    
# load it again
with open('my_dumped_classifier.pkl', 'rb') as fid:
    gnb_loaded = cPickle.load(fid)
ANSWER 2
Score 248
You can also use joblib.dump and joblib.load which is much more efficient at handling numerical arrays than the default python pickler.
Joblib is included in scikit-learn:
>>> import joblib
>>> from sklearn.datasets import load_digits
>>> from sklearn.linear_model import SGDClassifier
>>> digits = load_digits()
>>> clf = SGDClassifier().fit(digits.data, digits.target)
>>> clf.score(digits.data, digits.target)  # evaluate training error
0.9526989426822482
>>> filename = '/tmp/digits_classifier.joblib.pkl'
>>> _ = joblib.dump(clf, filename, compress=9)
>>> clf2 = joblib.load(filename)
>>> clf2
SGDClassifier(alpha=0.0001, class_weight=None, epsilon=0.1, eta0=0.0,
       fit_intercept=True, learning_rate='optimal', loss='hinge', n_iter=5,
       n_jobs=1, penalty='l2', power_t=0.5, rho=0.85, seed=0,
       shuffle=False, verbose=0, warm_start=False)
>>> clf2.score(digits.data, digits.target)
0.9526989426822482
Edit: in Python 3.8+ it's now possible to use pickle for efficient pickling of object with large numerical arrays as attributes if you use pickle protocol 5 (which is not the default).
ANSWER 3
Score 136
What you are looking for is called Model persistence in sklearn words and it is documented in introduction and in model persistence sections.
So you have initialized your classifier and trained it for a long time with
clf = some.classifier()
clf.fit(X, y)
After this you have two options:
1) Using Pickle
import pickle
# now you can save it to a file
with open('filename.pkl', 'wb') as f:
    pickle.dump(clf, f)
# and later you can load it
with open('filename.pkl', 'rb') as f:
    clf = pickle.load(f)
2) Using Joblib
from sklearn.externals import joblib
# now you can save it to a file
joblib.dump(clf, 'filename.pkl') 
# and later you can load it
clf = joblib.load('filename.pkl')
One more time it is helpful to read the above-mentioned links
ANSWER 4
Score 37
In many cases, particularly with text classification it is not enough just to store the classifier but you'll need to store the vectorizer as well so that you can vectorize your input in future.
import pickle
with open('model.pkl', 'wb') as fout:
  pickle.dump((vectorizer, clf), fout)
future use case:
with open('model.pkl', 'rb') as fin:
  vectorizer, clf = pickle.load(fin)
X_new = vectorizer.transform(new_samples)
X_new_preds = clf.predict(X_new)
Before dumping the vectorizer, one can delete the stop_words_ property of vectorizer by:
vectorizer.stop_words_ = None
to make dumping more efficient. Also if your classifier parameters is sparse (as in most text classification examples) you can convert the parameters from dense to sparse which will make a huge difference in terms of memory consumption, loading and dumping. Sparsify the model by:
clf.sparsify()
Which will automatically work for SGDClassifier but in case you know your model is sparse (lots of zeros in clf.coef_) then you can manually convert clf.coef_ into a csr scipy sparse matrix by:
clf.coef_ = scipy.sparse.csr_matrix(clf.coef_)
and then you can store it more efficiently.