Closed

Description
Here is the code:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
html = urllib2.urlopen("http://habrahabr.ru/").read()
data = nltk.clean_html(html)
data = re.sub('&([^;]+);', '', data).lower()
v = CountVectorizer(min_n=2, max_n=4);
X = v.fit_transform([data]);
keys = [x for x in zip(v.inverse_transform(X)[0], X.A[0])]
print keys
In result I get few english n-grams. Changing encoding didn't help.
UPDATE: another url you can try is http://vesna.yandex.ru/