The Python Oracle

NLTK/NLP buliding a many-to-many/multi-label subject classifier

--------------------------------------------------
Rise to the top 3% as a developer or hire one of them at Toptal: https://topt.al/25cXVn
--------------------------------------------------

Music by Eric Matyas
https://www.soundimage.org
Track title: Hypnotic Orient Looping

--

Chapters
00:00 Nltk/Nlp Buliding A Many-To-Many/Multi-Label Subject Classifier
02:09 Accepted Answer Score 9
03:11 Answer 2 Score 0
03:55 Thank you

--

Full question
https://stackoverflow.com/questions/7742...

--

Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...

--

Tags
#python #statistics #nlp #machinelearning #nltk

#avk47



ACCEPTED ANSWER

Score 9


What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.

You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)

what feature extraction should I pursue for such a task

For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.




ANSWER 2

Score 0


I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401

The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.

I think as a first step it is important to understand the differences between multi-class and multi-label classification.