-
-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Open
Description
Description
LDA Perplexity too big on Wiki dump
Steps/Code to Reproduce
I used gensim
WikiCorpus
to obtain the Bag-of-Words for each document, then vectorised it using scikit-learn
CountVectorizer
. All is from the En Wiki dump dated 2017-04-10. I'm using the latest scikit-learn
.
The LDA used non-informative prior,
lda = LatentDirichletAllocation(
n_topics=20,
doc_topic_prior=1.0,
topic_word_prior=1.0,
max_iter=1000,
total_samples=n_docs,
n_jobs=-1
)
for _ in range(runs):
lda.partial_fit(training_sample)
print(np.log(lda.perplexity(test_docs)))
Expected Results
7.x - 8.x, which is what the Spark LDA would give and a personal code as well. As a baseline, the entropy (log-perplexity) of the mean of all the docs is 8.3 (I manually computed the entropy of M1 of the generated corpus for Python, it gives 8.3; for the corpus generated for Spark, it also gave 8.3). Normally after seeing 20% of data it should print out 8.3 or inferior.
Actual Results
13.8 all the way
Versions
All Python packages are of the latest versions as of reporting.