Skip to content Skip to sidebar Skip to footer

Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn

I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results. The program runs ok when not using a custom vocabulary and I'm s

Solution 1:

One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2). This means you can't get the feature '21 CFR Part 11' using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2. How many of your pre-selected vocabulary items are unigrams or bigrams?


Solution 2:

I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2 to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1 in your code?


Solution 3:

In Python for-in loop, it could not use count+=1 to make count add one when every loop. You could use for i in range(n): to replace it. Because count's value would stay 1.


Post a Comment for "Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn"