8000 LatentDirichletAllocation Perplexity too big on Wiki dump · Issue #8943 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

LatentDirichletAllocation Perplexity too big on Wiki dump #8943

@jli05

Description

@jli05

Description

LDA Perplexity too big on Wiki dump

Steps/Code to Reproduce

I used gensim WikiCorpus to obtain the Bag-of-Words for each document, then vectorised it using scikit-learn CountVectorizer. All is from the En Wiki dump dated 2017-04-10. I'm using the latest scikit-learn.

The LDA used non-informative prior,

lda = LatentDirichletAllocation(
    n_topics=20,
    doc_topic_prior=1.0,
    topic_word_prior=1.0,
    max_iter=1000,
    total_samples=n_docs,
    n_jobs=-1
)

for _ in range(runs):
    lda.partial_fit(training_sample)
    print(np.log(lda.perplexity(test_docs)))

Expected Results

7.x - 8.x, which is what the Spark LDA would give and a personal code as well. As a baseline, the entropy (log-perplexity) of the mean of all the docs is 8.3 (I manually computed the entropy of M1 of the generated corpus for Python, it gives 8.3; for the corpus generated for Spark, it also gave 8.3). Normally after seeing 20% of data it should print out 8.3 or inferior.

Actual Results

13.8 all the way

Versions

All Python packages are of the latest versions as of reporting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0