8000 [PorterStemmer] Remove stem_word from PorterStemmer. Breaks backwards… · ExplodingCabbage/nltk@2000554 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2000554

Browse files
[PorterStemmer] Remove stem_word from PorterStemmer. Breaks backwards compatability!
Prior to this change, the public API of the PorterStemmer was a mess. NLTK's version was based off Vivake Gupta's implementation at http://tartarus.org/~martin/PorterStemmer/python.txt, endorsed by Martin himself at http://tartarus.org/~martin/PorterStemmer/. However, Gupta's implementation is a shoddy port of Martin Porter's own implementation in C, and had several vestigial quirks lying around. These include the claim that the stem() method takes a "char pointer" as an argument (no such thing in Python) and the need to pass in start and end indexes between which stem() should read the word from the given char array. At some point in nltk's history, during or prior to the 2006 commit that added porter.py to the current Git repository: nltk@edf4677 this was "solved" by renaming Vivake's stem() method to stem_word() and creating a wrapper for it called stem() that conformed to the StemmerI interface. This was completely pointless; the right thing to do would've been to remove the unnecessary parts of Vivake's stem() method and thereby acheive conformity to StemmerI. This commit does this, but at the cost of breaking backwards compatibility for anyone who was using stem_word(word) instead of stem(word); those people will need to adjust their application code when updating to the latest version of NLTK.
1 parent 9d72aa4 commit 2000554

File tree

1 file changed

+20
-26
lines changed

1 file changed

+20
-26
lines changed

nltk/stem/porter.py

Lines changed: 20 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -529,29 +529,6 @@ def _step5(self, word):
529529

530530
return word
531531

532-
def stem_word(self, p, i=0, j=None):
533-
"""
534-
Returns the stem of p, or, if i and j are given, the stem of p[i:j+1].
535-
"""
536-
## --NLTK--
537-
if j is None and i == 0:
538-
word = p
539-
else:
540-
if j is None:
541-
j = len(p) - 1
542-
word = p[i:j+1]
543-
544-
if word in self.pool:
545-
return self.pool[word]
546-
547-
word = self._step1ab(word)
548-
word = self._step1c(word)
549-
word = self._step2(word)
550-
word = self._step3(word)
551-
word = self._step4(word)
552-
word = self._step5(word)
553-
return word
554-
555532
def _adjust_case(self, word, stem):
556533
lower = word.lower()
557534

@@ -583,10 +560,27 @@ def _adjust_case(self, word, stem):
583560
# ret = ret + separator
584561
# return ret
585562

586-
## --NLTK--
587-
## Define a stem() method that implements the StemmerI interface.
588563
def stem(self, word):
589-
stem = self.stem_word(word.lower(), 0, len(word) - 1)
564+
stem = word.lower()
565+
566+
# --NLTK--
567+
if word in self.pool:
568+
return self.pool[word]
569+
570+
if len(word) <= 2:
571+
return word # --DEPARTURE--
572+
# With this line, strings of length 1 or 2 don't go through the
573+
# stemming process, although no mention is made of this in the
574+
# published algorithm. Remove the line to match the published
575+
# algorithm.
576+
577+
stem = self._step1ab(stem)
578+
stem = self._step1c(stem)
579+
stem = self._step2(stem)
580+
stem = self._step3(stem)
581+
stem = self._step4(stem)
582+
stem = self._step5(stem)
583+
590584
return self._adjust_case(word, stem)
591585

592586
def __repr__(self):

0 commit comments

Comments
 (0)
0