Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn
I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results. The program runs ok when not using a custom vocabulary and I'm s
Solution 1:
One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2)
. This means you can't get the feature '21 CFR Part 11'
using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2
. How many of your pre-selected vocabulary items are unigrams or bigrams?
Solution 2:
I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2
to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1
in your code?
Solution 3:
In Python for-in loop, it could not use count+=1 to make count add one when every loop. You could use for i in range(n): to replace it. Because count's value would stay 1.
Post a Comment for "Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn"