python - How to provide (or generate) tags for nltk lemmatizers -


i have set of documents, , transform such form, allow me count tfidf words in documents (so each document being represented vector of tfidf-numbers).

i thought enough call wordnetlemmatizer.lemmatize(word), , porterstemmer - 'have', 'has', 'had', etc not being transformed 'have' lemmatizer - , goes other words well. have read, supposed provide hint lemmatizer - tag representing type of word - whether noun, verb, adjective, etc.

my question - how these tags? supposed excecute on documents this?

i using python3.4, , lemmatizing + stemming single word @ time. tried wordnetlemmatizer, , englishstemmer nltk , stem() stemming.porter2.

ok, googled more , found out how these tags. first 1 have preprocessing, sure file tokenized (in case removing stuff left off after conversion pdf txt).

then these file has tokenized sentences, each sentence word array, , can tagged nltk tagger. lemmatization can done, , stemming added on top of it.

from nltk.tokenize import sent_tokenize, word_tokenize # use sent_tokenize split text sentences, , word_tokenize # split sentences words nltk.tag import pos_tag # use generate array of tuples (word, tag) # can translated wordnet tag in # [this response][1].  nltk.stem.wordnet import wordnetlemmatizer stemming.porter2 import stem  # code response mentioned above def get_wordnet_pos(treebank_tag):     if treebank_tag.startswith('j'):         return wordnet.adj     elif treebank_tag.startswith('v'):         return wordnet.verb     elif treebank_tag.startswith('n'):         return wordnet.noun     elif treebank_tag.startswith('r'):         return wordnet.adv     else:         return ''       open(myinput, 'r') f:     data = f.read()     sentences = sent_tokenize(data)     ignoretypes = ['to', 'cd', '.', 'ls', ''] # choice     lmtzr = wordnetlemmatizer()     sent in sentences:         words = word_tokenize(sentence)         tags = pos_tag(words)         (word, type) in tags:             if type in ignoretypes:                 continue             tag = get_wordnet_pos(type)             if tag == '':                 continue             lema = lmtzr.lemmatize(word, tag)             stemw = stem(lema) 

and @ point stemmed word stemw can write file, , use these count tfidf vectors per document.


Comments

Popular posts from this blog

php - How to display all orders for a single product showing the most recent first? Woocommerce -

asp.net - How to correctly use QUERY_STRING in ISAPI rewrite? -

angularjs - How restrict admin panel using in backend laravel and admin panel on angular? -