Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn

April 29, 2023 Post a Comment

I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results. The program runs ok when not using a custom vocabulary and I'm s

Solution 1:

One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2). This means you can't get the feature '21 CFR Part 11' using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2. How many of your pre-selected vocabulary items are unigrams or bigrams?

Solution 2:

I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2 to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1 in your code?

Baca Juga

Solution 3:

In Python for-in loop, it could not use count+=1 to make count add one when every loop. You could use for i in range(n): to replace it. Because count's value would stay 1.

Learn Python Tutorials

Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Problems Using A Custom Vocabulary For TfidfVectorizer Scikit-learn"