8000 LatentDirichletAllocation Perplexity too big on Wiki dump · Issue #8943 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

LatentDirichletAllocation Perplexity too big on Wiki dump #8943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jli05 opened this issue May 27, 2017 · 18 comments
Open

LatentDirichletAllocation Perplexity too big on Wiki dump #8943

jli05 opened this issue May 27, 2017 · 18 comments

Comments

@jli05
Copy link
jli05 commented May 27, 2017

Description

LDA Perplexity too big on Wiki dump

Steps/Code to Reproduce

I used gensim WikiCorpus to obtain the Bag-of-Words for each document, then vectorised it using scikit-learn CountVectorizer. All is from the En Wiki dump dated 2017-04-10. I'm using the latest scikit-learn.

The LDA used non-informative prior,

lda = LatentDirichletAllocation(
    n_topics=20,
    doc_topic_prior=1.0,
    topic_word_prior=1.0,
    max_iter=1000,
    total_samples=n_docs,
    n_jobs=-1
)

for _ in range(runs):
    lda.partial_fit(training_sample)
    print(np.log(lda.perplexity(test_docs)))

Expected Results

7.x - 8.x, which is what the Spark LDA would give and a personal code as well. As a baseline, the entropy (log-perplexity) of the mean of all the docs is 8.3 (I manually computed the entropy of M1 of the generated corpus for Python, it gives 8.3; for the corpus generated for Spark, it also gave 8.3). Normally after seeing 20% of data it should print out 8.3 or inferior.

Actual Results

13.8 all the way

Versions

All Python packages are of the latest versions as of reporting.

@rth
Copy link
Member
rth commented May 30, 2017

Could you please provide an minimal working example of this using a smaller dataset (e.g. 20_newsgroups) that would be faster to run?

@jli05
Copy link
Author
jli05 commented May 31, 2017

We first apply CountVectorizer to vectorise the top 10k vocabulary.

vectorizer = CountVectorizer(stop_words='english', max_features=10000)
vectorizer.fit(texts.data)
docs = vectorizer.transform(texts.data)
docs_test = vectorizer.transform(texts_test.data)

I did training on the 20_newsgroups training corpus with

lda = LatentDirichletAllocation(n_topics=20, topic_word_prior=1.0, doc_topic_prior=1.0, max_iter=50)
lda.fit(docs)

and ran np.log(lda.perplexity(docs_test)), and obtained 22.3.

Simple check on the entropy (log-perplexity) of M1 (average normalised word count vector): 7.84.

@jnothman
Copy link
Member
jnothman commented Jun 1, 2017 via email

@lesteve
Copy link
Member
lesteve commented Jun 2, 2017

@jli05 it would help tremendously if you provided a stand-alone snippet to reproduce the problem, which means something we can just copy and paste in an IPython session. Please read https://stackoverflow.com/help/mcve for more details.

@jli05
Copy link
Author
jli05 commented Jun 12, 2017
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

def contrib_m1(docs, n_docs):
    ''' Compute contribution of the partition to M1

    Parameters
    --------
8000
---
    docs : m-by-vocab_size array or csr_matrix
        Current partition of word count vectors.
    n_docs : int
        total number of documents

    Returns
    ----------
    out : length-V array
        contribution of `docs` to M1
    '''
    partition_size, vocab_size = docs.shape
    assert partition_size >= 1 and vocab_size >= 1
    assert n_docs >= partition_size

    # transposed normalised docs
    _docs = docs.T / np.squeeze(docs.sum(axis=1))
    _docs = _docs.T

    assert np.allclose(_docs.sum(axis=1), 1)
    return np.asarray(_docs.mean(axis=0)).squeeze() * (docs.shape[0] / n_docs)

# Download training and test corpuses
texts_train = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
texts_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Vectorize the training and test corpuses
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
vectorizer.fit(texts_train.data)
docs_train = vectorizer.transform(texts_train.data)
docs_test = vectorizer.transform(texts_test.data)

# Keep only documents with positive word counts
docs_train = docs_train[np.asarray(docs_train.sum(axis=1)).squeeze() >= 1]
docs_test = docs_test[np.asarray(docs_test.sum(axis=1)).squeeze() >= 1]

# Print # documents
print('# training docs:', docs_train.shape[0])
print('# test docs:', docs_test.shape[0])

# Run LDA
# Use non-informative priors
lda = LatentDirichletAllocation(n_topics=20,
                                topic_word_prior=1.0,
                                doc_topic_prior=1.0,
                                n_jobs=-1,
                                max_iter=50)
lda.fit(docs_train)

# Compute log-perplexity on test documents
print('Log-perplexity on test docs:', np.log(lda.perplexity(docs_test)))

# Baseline: log-perplexity on M1 (average normalised word count vector)
m1 = contrib_m1(docs_test, docs_test.shape[0])
assert np.isclose(m1.sum(), 1)
print('Baseline log-perplexity on M1:', - m1.dot(np.log(m1 + 1e-12)))

Ran on AWS c4.2xlarge (8 vCPUs + 15G Memory) under Ubuntu 16.04.2 LTS, Python 3.5.2, iPython 6.1.0. All OS, Python packages updated as of writing.

Output:

# training docs: 10979
# test docs: 7288
/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)
Log-perplexity on test docs: 22.0525858739
Baseline log-perplexity on M1: 8.09012219458

@jli05
Copy link
Author
jli05 commented Jun 12, 2017

Please note there's another problem. In the code above we cannot run with any maxiter over 100 for LatentDirichletAllocation.

Error message when we use maxiter=100:

# training docs: 10979
# test docs: 7288
/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py:508: 
DeprecationWarning: The default value for 'learning_method' will be changed from
 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-1-006fd6a84213> in <module>()
     54                                 n_jobs=-1,
     55                                 max_iter=100)
---> 56 lda.fit(docs_train)
     57 
     58 # Compute log-perplexity on test documents

/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py in fi
t(self, X, y)
    521                     for idx_slice in gen_batches(n_samples, batch_size):
    522                         self._em_step(X[idx_slice, :], total_samples=n_s
amples,
--> 523                                       batch_update=False, parallel=paral
lel)
    524                 else:
    525                     # batch update

/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py in _e
m_step(self, X, total_samples, batch_update, parallel)
    408         # E-step
    409         _, suff_stats = self._e_step(X, cal_sstats=True, random_init=Tru
e,
--> 410                                      parallel=parallel)
    411 
    412         # M-step

/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py in _e
_step(self, X, cal_sstats, random_init, parallel)
    361                                               self.mean_change_tol, cal_
sstats,
    362                                               random_state)
--> 363             for idx_slice in gen_even_slices(X.shape[0], n_jobs))
    364 
    365         # merge result

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py in _
_call__(self, iterable)
    766                 # consumption.
    767                 self._iterating = False
--> 768             self.retrieve()
    769             # Make sure that we get a last message telling us we are don
e
    770             elapsed_time = time.time() - self._start_time

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py in r
etrieve(self)
    717                     ensure_ready = self._managed_backend
    718                     backend.abort_everything(ensure_ready=ensure_ready)
--> 719                 raise exception
    720 
    721     def __call__(self, iterable):

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/parallel.py in r
etrieve(self)
    680                 # check if timeout supported in backend future implement
ation
    681                 if 'timeout' in getfullargspec(job.get).args:
--> 682                     self._output.extend(job.get(timeout=self.timeout))
    683                 else:
    684                     self._output.extend(job.get())

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/usr/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outq
ueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/pool.py in send(
obj)
    369             def send(obj):
    370                 buffer = BytesIO()
--> 371                 CustomizablePickler(buffer, self._reducers).dump(obj)
    372                 self._writer.send_bytes(buffer.getvalue())
    373             self._send = send

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/pool.py in __cal
l__(self, a)
    238                     print("Memmaping (shape=%r, dtype=%s) to new file %s
" % (
    239                         a.shape, a.dtype, filename))
--> 240                 for dumped_filename in dump(a, filename):
    241                     os.chmod(dumped_filename, FILE_PERMISSIONS)
    242 

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py 
in dump(value, filename, compress, protocol, cache_size)
    481     elif is_filename:
    482         with open(filename, 'wb') as f:
--> 483             NumpyPickler(f, protocol=protocol).dump(value)
    484     else:
    485         NumpyPickler(filename, protocol=protocol).dump(value)

/usr/lib/python3.5/pickle.py in dump(self, obj)
    406         if self.proto >= 4:
    407             self.framer.start_framing()
--> 408         self.save(obj)
    409         self.write(STOP)
    410         self.framer.end_framing()

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py 
in save(self, obj)
    275 
    276             # And then array bytes are written right after the wrapper.
--> 277             wrapper.write_array(obj, self)
    278             return
    279 

/usr/local/lib/python3.5/dist-packages/sklearn/externals/joblib/numpy_pickle.py 
in write_array(self, array, pickler)
     90                                            buffersize=buffersize,
     91                                            order=self.order):
---> 92                 pickler.file_handle.write(chunk.tostring('C'))
     93 
     94     def read_array(self, unpickler):

OSError: [Errno 28] No space left on device

The EBS has 33G free space during running, so it cannot be "no space left on device".

For this corpus, I think we better support maxiter up to 500 for the default mini batch size (128) to allow ~5 full iterations over the entire training corpus.

@jli05
Copy link
Author
jli05 commented Jun 12, 2017

If we restrict the vocabulary size to 2000 (max_features=2000 for CountVectorizer) we could run with maxiter=500 to the end and it produces:

# training docs: 10944
# test docs: 7252
/usr/local/lib/python3.5/dist-packages/sklearn/decomposition/online_lda.py:508: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)
Log-perplexity on test docs: 22.1065814407
Baseline log-perplexity on M1: 7.06556698687

@jnothman jnothman added this to the 0.19 milestone Jun 12, 2017
@jnothman jnothman modified the milestones: 0.20, 0.19 Jun 13, 2017
@rth
Copy link
Member
rth commented Jun 13, 2017

@jli05 Thanks for the detailed test script.

Regarding the second error you are getting, looks like joblib failing in multiprocessing because it doesn't have enough RAM (not sure why, though). In it my benchmark this script takes only 1.5GB RAM on 4 CPU, and n_jobs=1 is both faster (it still runs some calculations on all cores, via BLAS, I presume) and takes less memory (or it would with 8 CPU),
bench2

(run with mprof run --include-children -T 0.025 perplexity_example.py). So switching to n_jobs=1 should fix your second issue although I can exactly reproduce it..

@jli05
Copy link
Author
jli05 commented Jun 13, 2017

Thanks for confirmation . Can we pinpoint the cause of the 2nd problem to be 'LatentDirichletAllocation' or ,joblib'

@lesteve
Copy link
Member
lesteve commented Jun 14, 2017

@rth are you saying you can or you can not reproduce the "No space left on device" error?

@lesteve
Copy link
Member
lesteve commented Jun 14, 2017

Setting n_jobs=1 is a reasonable work-around to investigate the LDA perplexity problem, which is scikit-learn related.

The "No space left on device" is very likely related to joblib and memmapping (wild guessing a bit here). I think we should focus on why the LDA perplexity is bigger than expected in this issue.

@rth
Copy link
Member
rth commented Jun 14, 2017

The "No space left on device" is very likely related to joblib and memmapping (wild guessing a bit here).

I can reproduce that exception if I run it on my laptop with only a few GB of free memory. Not sure how it could happen with 16GB RAM, but I agree that it's not an actual issue, at worse not optimal multiprocessing in this particular example, and setting n_jobs=1 should fix it.

I think we should focus on why the LDA perplexity is bigger than expected in this issue.

+1 . The LDA does seem to have quite a bit of submitted issues, generally: #8943, #9107, #5149, #8020, #6848, #6777, 6725 . I'm not familiar with the implementation but I'm tempted to check it against the one in gensim (which uses a similar approach, as far as I understand) to see if the results are somewhat similar..

@jnothman
Copy link
Member

(I've not checked if that list includes any LinearDiscriminantAnalysis bugs, but there are a few of them too.)

@jnothman jnothman changed the title LDA Perplexity too big on Wiki dump LatentDirichletAllocation Perplexity too big on Wiki dump Jun 14, 2017
@jli05
Copy link
Author
jli05 commented Jun 14, 2017

For the 2nd problem, it was trying to do pickle? ...

@rth
Copy link
Member
rth commented Jun 14, 2017

(I've not checked if that list includes any LinearDiscriminantAnalysis bugs, but there are a few of them too.)

You are completely right - most of those issues are regarding LinearDiscriminantAnalysis, I want too fast. The ones relevant to LatentDirichletAllocation are only #8020 and #6777

For the 2nd problem, it was trying to do pickle? ...

As far as I understood the situation, joblib was trying to memmap some data so it could be shared between different processes. On Linux this involves writing to /dev/shm by default, which is a tempfs (shared memory that gets allocated in RAM). So in this case, if one doesn't have enough RAM, I'll return a "No space left on device" even though it actually means "not enough memory". I agree the error message could be more informative, but I'm not sure how easy it would be to change it. Here is another example of the same exception; joblib/joblib#168 (comment)

@jli05
Copy link
Author
jli05 commented Jun 14, 2017

Maybe joblib needs to clean up the mmap-ed data from mini batch to mini batch? Given the mini batch size, the vocabulary size, the number of topics k, the memory use should be fixed or at least shouldn't grow drastically from that?

@nasrallah
Copy link
nasrallah commented Jun 15, 2017

I think the perplexity scores issue has to do with this issue I just filed.
#9134

It points out that lda.perplexity() calls lda.transform(), which retrieves the normalized document-topic matrix. Explicitly passing the unnormalized document-topic matrix as returned by lda._e_step() produces the value of perplexity that @jli05 expects.

This bit should produced the consistent results he wants:

unnorm_doc_topic_matrix = lda._e_step(docs_test,False,False)[0]
print(np.log(lda.perplexity(docs_test, unnorm_doc_topic_matrix)))

I'm not 100% that using the unnormalized document-topic matrix is correct though, since it does not form a proper probability distribution over topics.

@jli05
Copy link
Author
jli05 commented Jun 15, 2017

If we're talking about the \gamma_{dk} in Eq 5 in https://mimno.infosci.cornell.edu/info6150/readings/HoffmanBleiBach2010b.pdf, it doesn't have to be "normalised", \gamma_{dk} is just the parameter for the Dirichlet distribution in Eq 2. It's valid as long as it's all positive.

@glemaitre glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018
@jnothman jnothman removed this from the 0.21 milestone Apr 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0