We want to cluster beer styles based on ~100K reviews from a popular beer review site. We know the review text and style of beer (approx 31 styles in data). To accomplish this, we will normalize the review text 61EC , create features, and create 2 principle components via SVD and plot the average of the beers in each style.
This runs on Python3.X. Libraries needed: numpy, scipy, sklearn, matplotlib, nltk (with stopword corpus).
Data is available here: https://www.dropbox.com/s/3jlokbq7tjnbyr2/beer_reviews.csv?dl=0
With the python packages: 'nose' and 'coverage' installed, navigate to main directory and run:
nosettests --with-coverage --cover-package=text_clustering_funs
To get the following outputs:
Name Stmts Miss Cover Missing
-------------------------------------------------------
test_beer.py 13 0 100%
text_clustering_funs.py 25 0 100%
-------------------------------------------------------
TOTAL 38 0 100%
-------------------------------------------------------
Rand 3 tests in 1.789s
OK
The following is a graph of the two larges principle components from the text features created.