K-means Using Signature Matrix Generated From Minhash
Solution 1:
TL;DR
Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.
Some explanations
I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms]
matrix, then using this as a features matrix X
on which you run k-means.
I'm guessing you're doing something like:
# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data.
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string inenumerate(data):
minhash = MinHash(num_perm=n_perms)
for gram in ngrams(string, 3):
minhash.update("".join(gram).encode('utf-8'))
minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)
This will behave horribly because of the fateful flaw - the minhash_values
array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.
To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values
then create a matrix which is [n_samples x n_unique_hashes]
, (n_unique_hashes
is the number of unique features found) setting it to 1
where the text sample contains that feature, 0
elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.
Alternative way of text clustering
What an unbelievable hassle though! Fortunately, scikit-learn
is there to help. It provides some very easy to use and scalable vectorisers:
So your problem becomes easily solved:
# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
# Get your data
data = get_your_list_of_strings_to_cluster()
# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)
# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)
And there you go. From there:
- Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
- Try other clusterers (f/ex I find HDBSCANmiles better than kmeans - quicker, more robust, more accurate, less tuning).
Hope this helps.
Tom
Post a Comment for "K-means Using Signature Matrix Generated From Minhash"