n-grams in python, four, five, six grams?
Become part of the top 3% of the developers by applying to Toptal https://topt.al/25cXVn
--
Track title: CC G Dvoks String Quartet No 12 Ame 2
--
Chapters
00:00 Question
00:36 Accepted answer (Score 270)
01:19 Answer 2 (Score 81)
01:39 Answer 3 (Score 23)
03:03 Answer 4 (Score 19)
03:37 Thank you
--
Full question
https://stackoverflow.com/questions/1753...
Accepted answer links:
[ngram module]: http://www.nltk.org/_modules/nltk/model/...
Answer 3 links:
[here]: https://www.nltk.org/api/nltk.html#modul...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #string #nltk #ngram
#avk47
--
Track title: CC G Dvoks String Quartet No 12 Ame 2
--
Chapters
00:00 Question
00:36 Accepted answer (Score 270)
01:19 Answer 2 (Score 81)
01:39 Answer 3 (Score 23)
03:03 Answer 4 (Score 19)
03:37 Thank you
--
Full question
https://stackoverflow.com/questions/1753...
Accepted answer links:
[ngram module]: http://www.nltk.org/_modules/nltk/model/...
Answer 3 links:
[here]: https://www.nltk.org/api/nltk.html#modul...
--
Content licensed under CC BY-SA
https://meta.stackexchange.com/help/lice...
--
Tags
#python #string #nltk #ngram
#avk47
ACCEPTED ANSWER
Score 281
Great native python based answers given by other users. But here's the nltk approach (just in case, the OP gets penalized for reinventing what's already existing in the nltk library).
There is an ngram module that people seldom use in nltk. It's not because it's hard to read ngrams, but training a model base on ngrams where n > 3 will result in much data sparsity.
from nltk import ngrams
sentence = 'this is a foo bar sentences and I want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print(grams)
ANSWER 2
Score 85
I'm surprised that this hasn't shown up yet:
In [34]: sentence = "I really like python, it's pretty awesome.".split()
In [35]: N = 4
In [36]: grams = [sentence[i: i + N] for i in range(len(sentence) - N + 1)]
In [37]: for gram in grams: print (gram)
['I', 'really', 'like', 'python,']
['really', 'like', 'python,', "it's"]
['like', 'python,', "it's", 'pretty']
['python,', "it's", 'pretty', 'awesome.']
ANSWER 3
Score 21
Using only nltk tools
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
def get_ngrams(text, n ):
n_grams = ngrams(word_tokenize(text), n)
return [ ' '.join(grams) for grams in n_grams]
Example output
get_ngrams('This is the simplest text i could think of', 3 )
['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']
In order to keep the ngrams in array format just remove ' '.join
ANSWER 4
Score 17
here is another simple way for do n-grams
>>> from nltk.util import ngrams
>>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
>>> tokenize = nltk.word_tokenize(text)
>>> tokenize
['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
>>> bigrams = ngrams(tokenize,2)
>>> bigrams
[('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
>>> trigrams=ngrams(tokenize,3)
>>> trigrams
[('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
>>> fourgrams=ngrams(tokenize,4)
>>> fourgrams
[('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]