Skip to content Skip to sidebar Skip to footer

Passing Term-document Matrix To Gensim Lda Model

My term-document matrix is in a numpy matrix format, and I have a dictionary to represent the of the term-document matrix. Is there any way I can easily pass these two into Gensim

Solution 1:

To treat a 2D numpy (or even scipy.sparse.csc) array as a gensim corpus, use the built-in matutils.Scipy2Corpus function.

Solution 2:

I believe Gensim uses pretty much the same structure to represent a bag of words corpus, but I don't think a default dictionary or numpy array would be compatible. Gensim's API lists a few "corpusreaders" that can accommodate various formats, but those seem to be built for importing data from other tool kits. So maybe in your case the easiest solution would be to reconstruct the documents using your matrix and dictionary as a list of separated strings. Then convert your list to Gensim's bag of word corpus and finally to LDA as shown in the tutorials.

This approach has the added benefit that you can apply Gensim's preprocessing functions and filter words with low/high frequencies.

Solution 3:

Given a numpy with with the document vectors in each row just use:

corpus = gensim.matutils.Dense2Corpus(array)

Post a Comment for "Passing Term-document Matrix To Gensim Lda Model"