pandas - Combine Sklearn TFIDF with Additional Data -
i trying prepare data supervised learning. have tfidf data, generated column in dataframe called "merged"
vect = tfidfvectorizer(stop_words='english', use_idf=true, min_df=50, ngram_range=(1,2)) x = vect.fit_transform(merged['kws_name_desc']) print x.shape print type(x) (57629, 11947) <class 'scipy.sparse.csr.csr_matrix'>
but need add additional columns matrix. each document in tfidf matrix, have list of additional numeric features. each list length 40 , it's comprised of floats.
so clarify, have 57,629 lists of length 40 i'd append on tdidf result.
currently, have in dataframe, example data: merged["other_data"]. below example row merged["other_data"]
0.4329597715,0.3637511039,0.4893141843,0.35840...
how can append 57,629 rows of dataframe column tf-idf matrix? don't know begin , appreciate pointers/guidance.
i figured out:
first: iterate on pandas column , create list of lists
for_np = [] x in merged['other_data']: row = x.split(",") row2 = map(float, row) for_np.append(row2)
then create np array:
n = np.array(for_np)
then use scipy.sparse.hstack on x (my original tfidf sparse matrix , new matrix. i'll end-up reweighting these 40-d vectors if not improve classification results, approach worked!
import scipy.sparse x = scipy.sparse.hstack([x, n])
Comments
Post a Comment