Skip to content Skip to sidebar Skip to footer

K-means Using Signature Matrix Generated From Minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard d

Solution 1:

TL;DR

Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.

Some explanations

I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.

I'm guessing you're doing something like:

# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data. 
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string inenumerate(data):
    minhash = MinHash(num_perm=n_perms)
    for gram in ngrams(string, 3):
         minhash.update("".join(gram).encode('utf-8'))
     minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)

This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.

To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.

Alternative way of text clustering

What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:

So your problem becomes easily solved:

# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans

# Get your data
data = get_your_list_of_strings_to_cluster()

# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)

# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)

And there you go. From there:

  • Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
  • Try other clusterers (f/ex I find HDBSCANmiles better than kmeans - quicker, more robust, more accurate, less tuning).

Hope this helps.

Tom

Post a Comment for "K-means Using Signature Matrix Generated From Minhash"