-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
LatentDirichletAllocation Perplexity too big on Wiki dump #8943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please provide an minimal working example of this using a smaller dataset (e.g. 20_newsgroups) that would be faster to run? |
We first apply vectorizer = CountVectorizer(stop_words='english', max_features=10000)
vectorizer.fit(texts.data)
docs = vectorizer.transform(texts.data)
docs_test = vectorizer.transform(texts_test.data) I did training on the 20_newsgroups training corpus with lda = LatentDirichletAllocation(n_topics=20, topic_word_prior=1.0, doc_topic_prior=1.0, max_iter=50)
lda.fit(docs) and ran Simple check on the entropy (log-perplexity) of M1 (average normalised word count vector): 7.84. |
perhaps when the normalisation of components was changed the perplexity
method was not fixed. I wonder if this was an issue in 0.17.
…On 1 Jun 2017 8:57 am, "Jencir Lee" ***@***.***> wrote:
I did training on the 20_newsgroups training corpus with
lda = LatentDirichletAllocation(n_topics=20, topic_word_prior=1.0, doc_topic_prior=1.0, max_iter=50)
and ran np.log(lda.perplexity()) on the test corpus, and obtained 27.7.
Simple check on the entropy (log-perplexity) of M1 (average normalised
word count vector): 9.09.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8943 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wkm3NqbdH9t-YiC3HqW42__bRzaks5r_fBmgaJpZM4NoeXH>
.
|
@jli05 it would help tremendously if you provided a stand-alone snippet to reproduce the problem, which means something we can just copy and paste in an IPython session. Please read https://stackoverflow.com/help/mcve for more details. |
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
def contrib_m1(docs, n_docs):
''' Compute contribution of the partition to M1
Parameters
--------
8000
---
docs : m-by-vocab_size array or csr_matrix
Current partition of word count vectors.
n_docs : int
total number of documents
Returns
----------
out : length-V array
contribution of `docs` to M1
'''
partition_size, vocab_size = docs.shape
assert partition_size >= 1 and vocab_size >= 1
assert n_docs >= partition_size
# transposed normalised docs
_docs = docs.T / np.squeeze(docs.sum(axis=1))
_docs = _docs.T
assert np.allclose(_docs.sum(axis=1), 1)
return np.asarray(_docs.mean(axis=0)).squeeze() * (docs.shape[0] / n_docs)
# Download training and test corpuses
texts_train = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
texts_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
# Vectorize the training and test corpuses
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
vectorizer.fit(texts_train.data)
docs_train = vectorizer.transform(texts_train.data)
docs_test = vectorizer.transform(texts_test.data)
# Keep only documents with positive word counts
docs_train = docs_train[np.asarray(docs_train.sum(axis=1)).squeeze() >= 1]
docs_test = docs_test[np.asarray(docs_test.sum(axis=1)).squeeze() >= 1]
# Print # documents
print('# training docs:', docs_train.shape[0])
print('# test docs:', docs_test.shape[0])
# Run LDA
# Use non-informative priors
lda = LatentDirichletAllocation(n_topics=20,
topic_word_prior=1.0,
doc_topic_prior=1.0,
n_jobs=-1,
max_iter=50)
lda.fit(docs_train)
# Compute log-perplexity on test documents
print('Log-perplexity on test docs:', np.log(lda.perplexity(docs_test)))
# Baseline: log-perplexity on M1 (average normalised word count vector)
m1 = contrib_m1(docs_test, docs_test.shape[0])
assert np.isclose(m1.sum(), 1)
print('Baseline log-perplexity on M1:', - m1.dot(np.log(m1 + 1e-12))) Ran on AWS Output:
|
Please note there's another problem. In the code above we cannot run with any Error message when we use
The EBS has 33G free space during running, so it cannot be "no space left on device". For this corpus, I think we better support |
If we restrict the vocabulary size to 2000 (
|
@jli05 Thanks for the detailed test script. Regarding the second error you are getting, looks like (run with |
Thanks for confirmation . Can we pinpoint the cause of the 2nd problem to be 'LatentDirichletAllocation' or ,joblib' |
@rth are you saying you can or you can not reproduce the "No space left on device" error? |
Setting The "No space left on device" is very likely related to joblib and memmapping (wild guessing a bit here). I think we should focus on why the LDA perplexity is bigger than expected in this issue. |
I can reproduce that exception if I run it on my laptop with only a few GB of free memory. Not sure how it could happen with 16GB RAM, but I agree that it's not an actual issue, at worse not optimal multiprocessing in this particular example, and setting
+1 . The LDA does seem to have quite a bit of submitted issues, generally: #8943, #9107, #5149, #8020, #6848, #6777, |
(I've not checked if that list includes any LinearDiscriminantAnalysis bugs, but there are a few of them too.) |
For the 2nd problem, it was trying to do pickle? ... |
You are completely right - most of those issues are regarding LinearDiscriminantAnalysis, I want too fast. The ones relevant to LatentDirichletAllocation are only #8020 and #6777
As far as I understood the situation, joblib was trying to memmap some data so it could be shared between different processes. On Linux this involves writing to |
Maybe |
I think the perplexity scores issue has to do with this issue I just filed. It points out that lda.perplexity() calls lda.transform(), which retrieves the normalized document-topic matrix. Explicitly passing the unnormalized document-topic matrix as returned by lda._e_step() produces the value of perplexity that @jli05 expects. This bit should produced the consistent results he wants:
I'm not 100% that using the unnormalized document-topic matrix is correct though, since it does not form a proper probability distribution over topics. |
If we're talking about the \gamma_{dk} in Eq 5 in https://mimno.infosci.cornell.edu/info6150/readings/HoffmanBleiBach2010b.pdf, it doesn't have to be "normalised", \gamma_{dk} is just the parameter for the Dirichlet distribution in Eq 2. It's valid as long as it's all positive. |
Description
LDA Perplexity too big on Wiki dump
Steps/Code to Reproduce
I used
gensim
WikiCorpus
to obtain the Bag-of-Words for each document, then vectorised it usingscikit-learn
CountVectorizer
. All is from the En Wiki dump dated 2017-04-10. I'm using the latestscikit-learn
.The LDA used non-informative prior,
Expected Results
7.x - 8.x, which is what the Spark LDA would give and a personal code as well. As a baseline, the entropy (log-perplexity) of the mean of all the docs is 8.3 (I manually computed the entropy of M1 of the generated corpus for Python, it gives 8.3; for the corpus generated for Spark, it also gave 8.3). Normally after seeing 20% of data it should print out 8.3 or inferior.
Actual Results
13.8 all the way
Versions
All Python packages are of the latest versions as of reporting.
The text was updated successfully, but these errors were encountered: