python - How to provide (or generate) tags for nltk lemmatizers -
i have set of documents, , transform such form, allow me count tfidf words in documents (so each document being represented vector of tfidf-numbers).
i thought enough call wordnetlemmatizer.lemmatize(word), , porterstemmer - 'have', 'has', 'had', etc not being transformed 'have' lemmatizer - , goes other words well. have read, supposed provide hint lemmatizer - tag representing type of word - whether noun, verb, adjective, etc.
my question - how these tags? supposed excecute on documents this?
i using python3.4, , lemmatizing + stemming single word @ time. tried wordnetlemmatizer, , englishstemmer nltk , stem() stemming.porter2.
ok, googled more , found out how these tags. first 1 have preprocessing, sure file tokenized (in case removing stuff left off after conversion pdf txt).
then these file has tokenized sentences, each sentence word array, , can tagged nltk tagger. lemmatization can done, , stemming added on top of it.
from nltk.tokenize import sent_tokenize, word_tokenize # use sent_tokenize split text sentences, , word_tokenize # split sentences words nltk.tag import pos_tag # use generate array of tuples (word, tag) # can translated wordnet tag in # [this response][1]. nltk.stem.wordnet import wordnetlemmatizer stemming.porter2 import stem # code response mentioned above def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('j'): return wordnet.adj elif treebank_tag.startswith('v'): return wordnet.verb elif treebank_tag.startswith('n'): return wordnet.noun elif treebank_tag.startswith('r'): return wordnet.adv else: return '' open(myinput, 'r') f: data = f.read() sentences = sent_tokenize(data) ignoretypes = ['to', 'cd', '.', 'ls', ''] # choice lmtzr = wordnetlemmatizer() sent in sentences: words = word_tokenize(sentence) tags = pos_tag(words) (word, type) in tags: if type in ignoretypes: continue tag = get_wordnet_pos(type) if tag == '': continue lema = lmtzr.lemmatize(word, tag) stemw = stem(lema)
and @ point stemmed word stemw
can write file, , use these count tfidf vectors per document.
Comments
Post a Comment