Difference between blank and pretrained models in spacy

--------------------------------------------------
Hire the world's top talent on demand or became one of them at Toptal: https://topt.al/25cXVn
and get $2,000 discount on your first invoice
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Sunrise at the Stream

--

Chapters
00:00 Difference Between Blank And Pretrained Models In Spacy
01:35 Accepted Answer Score 7
03:19 Thank you

--

Full question
https://stackoverflow.com/questions/6088...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #spacy #textclassification

#avk47

ACCEPTED ANSWER

Score 7

If you are using spacy's text classifier, then it is fine to start with a blank model. The TextCategorizer doesn't use features from any other pipeline components.

If you're using spacy to preprocess data for another text classifier, then you would need to decide which components make sense for your task. The pretrained models load a tagger, parser, and NER model by default.

The lemmatizer, which isn't implemented as a separate component, is the most complicated part of this. It tries to provide the best results with the available data and models:

If you don't have the package spacy-lookups-data installed and you create a blank model, you'll get the lowercase form as a default/dummy lemma.
If you have the package spacy-lookups-data installed and you create a blank model, it will automatically load lookup lemmas if they're available for that language.
If you load a provided model and the pipeline includes a tagger, the lemmatizer switches to a better rule-base lemmatizer if one is available in spacy for that language (currently: Greek, English, French, Norwegian Bokmål, Dutch, Swedish). The provided models also always include the lookup data for that language so they can be used when the tagger isn't run.

If you want to get the lookup lemmas from a provided model, you can see them by loading the model without the tagger:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tagger"])

In general, the lookup lemma quality is not great (there's no information to help with ambiguous cases) and the rule-based lemmas will be a lot better, however it does take additional time to run the tagger, so you can choose lookup lemmas to speed things up if the quality is good enough for your task.

And if you're not using the parser or NER model for preprocessing, you can speed things up by disabling them:

nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])