Skip to content Skip to sidebar Skip to footer

How To Prioritize Certain Features With Max_features Parameter In Countvectorizer

I have a working program but I realized that some important n-grams in the test data were not a part of the 6500 max_features I had allowed in the training data. Is it possible to

Solution 1:

This is hacky, and you probably cannot count on it working in the future, but CountVectorizer primarily relies on the learned attribute vocabulary_, which is a dictionary with tokens as keys and "feature index" as values. You can add to that dictionary and everything appears to work as intended; borrowing from the example in the docs:

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
print(X2.toarray())

## Output:
# [[0 0 1 1 0 0 1 0 0 0 0 1 0]
#  [0 1 0 1 0 1 0 1 0 0 1 0 0]
#  [1 0 0 1 0 0 0 0 1 1 0 1 0]
#  [0 0 1 0 1 0 1 0 0 0 0 0 1]]

# Now we tweak:
vocab_len = len(vectorizer2.vocabulary_)
vectorizer2.vocabulary_['new token'] = vocab_len  # append to end
print(vectorizer2.transform(["And this document has a new token"]).toarray())

## Output
# [[1 0 0 0 0 0 0 0 0 0 1 0 0 1]]

Post a Comment for "How To Prioritize Certain Features With Max_features Parameter In Countvectorizer"