How To Prioritize Certain Features With Max_features Parameter In Countvectorizer
I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to
Solution 1:
This is hacky, and you probably cannot count on it working in the future, but CountVectorizer
primarily relies on the learned attribute vocabulary_
, which is a dictionary with tokens as keys and "feature index" as values. You can add to that dictionary and everything appears to work as intended; borrowing from the example in the docs:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())
## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
# [0 1 0 1 0 1 0 1 0 0 1 0 0]
# [1 0 0 1 0 0 0 0 1 1 0 1 0]
# [0 0 1 0 1 0 1 0 0 0 0 0 1]]
# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())
## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]
Post a Comment for "How To Prioritize Certain Features With Max_features Parameter In Countvectorizer"