8000 CountVectorizer can't find russian n-grams · Issue #1098 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
CountVectorizer can't find russian n-grams #1098
Closed
@ghost

Description

Here is the code:

import nltk
from sklearn.feature_extraction.text import CountVectorizer

html = urllib2.urlopen("http://habrahabr.ru/").read()
data = nltk.clean_html(html)
data = re.sub('&([^;]+);', '', data).lower()

v = CountVectorizer(min_n=2, max_n=4);
X = v.fit_transform([data]);
keys = [x for x in zip(v.inverse_transform(X)[0], X.A[0])]

print keys

In result I get few english n-grams. Changing encoding didn't help.

UPDATE: another url you can try is http://vesna.yandex.ru/

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0