From Featurers To Words Python ("reverse" Bag Of Words)
Using sklearn I've created a BOW with 200 features in Python, which are easily extracted. But, how can I reverse it? That is, go from a vector with 200 0's or 1's to the correspond
Solution 1:
I'm not totally sure what you're going for, but it seems like you're just trying to figure out which column represents which word. For this, there is the handy get_feature_names
argument.
Let's take a look with the example corpus provided in the docs:
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?' ]
# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
description
0 This is the first document.
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()
# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()
>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 2, 0, 1, 0, 1, 1, 0, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 1],
[0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)
To see what column represents which word use get_feature_names
:
>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
So your first column is and
, second is document
, and so on. For readability, you can stick this in a dataframe:
>>> pd.DataFrame(features, columns = vec.get_feature_names())
and document first is one second the third this
0 0 1 1 1 0 0 1 0 1
1 0 2 0 1 0 1 1 0 1
2 1 0 0 1 1 0 1 1 1
3 0 1 1 1 0 0 1 0 1
Post a Comment for "From Featurers To Words Python ("reverse" Bag Of Words)"