JHnlp
diff --git a/‎README.md
Lines changed: 31 additions & 22 deletions b/‎README.md
Lines changed: 31 additions & 22 deletions
diff --git a/‎corenlp.py
Lines changed: 38 additions & 24 deletions b/‎corenlp.py
Lines changed: 38 additions & 24 deletions
@@ -1,4 +1,4 @@
-# Python interface to Stanford Core NLP tools v1.3.3
+# Python interface to Stanford Core NLP tools v3.4.1
 
 This is a Python wrapper for Stanford University's NLP group's Java-based [CoreNLP tools](http://nlp.stanford.edu/software/corenlp.shtml).  It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.
 
@@ -8,23 +8,21 @@ This is a Python wrapper for Stanford University's NLP group's Java-based [CoreN
    * Outputs parse trees which can be used by [nltk](http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html).
 
 
-It requires [pexpect](http://www.noah.org/wiki/pexpect) and (optionally) [unidecode](http://pypi.python.org/pypi/Unidecode) to handle non-ASCII text.  This script includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
+It requires [pexpect](http://www.noah.org/wiki/pexpect) and [unidecode](http://pypi.python.org/pypi/Unidecode) to handle non-ASCII text.  This script includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
 
-It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON.  The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 1.3.3** released 2012-07-09.
+It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON.  The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 3.4.1** released 2014-08-27.
 
-## Download and Usage 
+## Download and Usage
 
-To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the tgz file containing Stanford's CoreNLP package.  By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.
+To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the compressed file containing Stanford's CoreNLP package.  By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.  In other words:
 
-In other words: 
+	sudo pip install pexpect unidecode
+	git clone git://github.com/dasmith/stanford-corenlp-python.git
+	cd stanford-corenlp-python
+	wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
+	unzip stanford-corenlp-full-2014-08-27.zip
 
-    sudo pip install pexpect unidecode   # unidecode is optional
-    git clone git://github.com/dasmith/stanford-corenlp-python.git
-	  cd stanford-corenlp-python
-    wget http://nlp.stanford.edu/software/stanford-corenlp-2012-07-09.tgz
-    tar xvfz stanford-corenlp-2012-07-09.tgz
-
-Then, to launch a server:
+Then launch the server:
 
     python corenlp.py
 
@@ -39,7 +37,7 @@ Assuming you are running on port 8080, the code in `client.py` shows an example
     import jsonrpc
     from simplejson import loads
     server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
-            jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
+                                 jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
 
     result = loads(server.parse("Hello world.  It is so beautiful"))
     print "Result", result
@@ -112,8 +110,19 @@ To use it in a regular script or to edit/debug it (because errors via RPC are op
     corenlp = StanfordCoreNLP()  # wait a few minutes...
     corenlp.parse("Parse it")
 
+
+## Coreference Resolution
+
+The library supports [coreference resolution](http://en.wikipedia.org/wiki/Coreference), meaning pronouns can be "dereferenced."  If an entry in the `coref` list is, `[u'Hello world', 0, 1, 0, 2]`, the numbers mean:
+
+  * 0 = The reference appears in the 0th sentence (e.g. "Hello World")
+  * 1 = The 2nd token, "world", is the [headword](http://en.wikipedia.org/wiki/Head_%28linguistics%29) of that sentence
+  * 0 = 'Hello world' begins at the 0th token in the sentence
+  * 2 = 'Hello world' ends before the 2nd token in the sentence.
+
 <!--
 
+
 ## Adding WordNet
 
 Note: wordnet doesn't seem to be supported using this approach.  Looks like you'll need Java.
@@ -129,23 +138,23 @@ tar xvfz WNprolog-3.0.tar.gz
 **Stanford CoreNLP tools require a large amount of free memory**.  Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines.  32-bit machine users can lower the memory requirements by changing `-Xmx3g` to `-Xmx2g` or even less.
 If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:
 
-	java -cp stanford-corenlp-2012-07-09.jar:stanford-corenlp-2012-07-06-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
+	java -cp stanford-corenlp-2014-08-27.jar:stanford-corenlp-3.4.1-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
 
 You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available [on my webpage](http://web.media.mit.edu/~dustin)).
 
 
-# Contributors
+# License & Contributors
 
 This is free and open source software and has benefited from the contribution and feedback of others.  Like Stanford's CoreNLP tools, it is covered under the [GNU General Public License v2 +](http://www.gnu.org/licenses/gpl-2.0.html), which in short means that modifications to this program must maintain the same free and open source distribution policy.
 
-This project has benefited from the contributions of:
+I gratefully welcome bug fixes and new features.  If you have forked this repository, please submit a [pull request](https://help.github.com/articles/using-pull-requests/) so others can benefit from your contributions.  This project has already benefited from contributions from these members of the open source community:
 
-  * @jcc Justin Cheng 
+  * [Emilio Monti](https://github.com/emilmont)
+  * [Justin Cheng](https://github.com/jcccf) 
   * Abhaya Agarwal
 
-## Related Projects
+*Thank you!*
 
-These two projects are python wrappers for the [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml), which includes the Stanford Parser, although the Stanford Parser is another project.
-  - [stanford-parser-python](http://projects.csail.mit.edu/spatial/Stanford_Parser) uses [JPype](http://jpype.sourceforge.net/) (interface to JVM)
-  - [stanford-parser-jython](http://blog.gnucom.cc/2010/using-the-stanford-parser-with-jython/) uses Python
+## Related Projects
 
+Maintainers of the Core NLP library at Stanford keep an [updated list of wrappers and extensions](http://nlp.stanford.edu/software/corenlp.shtml#Extensions).
@@ -1,7 +1,7 @@
 #!/usr/bin/env python
 #
 # corenlp  - Python interface to Stanford Core NLP tools
-# Copyright (c) 2012 Dustin Smith
+# Copyright (c) 2014 Dustin Smith
 #   https://github.com/dasmith/stanford-corenlp-python
 # 
 # This program is free software; you can redistribute it and/or
@@ -18,16 +18,24 @@
 # along with this program; if not, write to the Free Software
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
 
-import json, optparse, os, re, sys, time, traceback
+import json
+import optparse
+import os, re, sys, time, traceback
 import jsonrpc, pexpect
 from progressbar import ProgressBar, Fraction
 from unidecode import unidecode
+import logging
 
 
 VERBOSE = True
+
 STATE_START, STATE_TEXT, STATE_WORDS, STATE_TREE, STATE_DEPENDENCY, STATE_COREFERENCE = 0, 1, 2, 3, 4, 5
 WORD_PATTERN = re.compile('\[([^\]]+)\]')
-CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\)\) -> \((\d*),(\d)*,\[(\d*),(\d*)\)\), that is: \"(.*)\" -> \"(.*)\"")
+CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\]\) -> \((\d*),(\d)*,\[(\d*),(\d*)\]\), that is: \"(.*)\" -> \"(.*)\"")
+
+# initialize logger
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
 
 
 def remove_id(word):
@@ -65,7 +73,7 @@ def parse_parser_results(text):
     """
     results = {"sentences": []}
     state = STATE_START
-    for line in unidecode(text).split("\n"):
+    for line in text.encode('utf-8').split("\n"):
         line = line.strip()
 
         if line.startswith("Sentence #"):
@@ -120,19 +128,21 @@ class StanfordCoreNLP(object):
     Command-line interaction with Stanford's CoreNLP java utilities.
     Can be run as a JSON-RPC server or imported as a module.
     """
-    def __init__(self):
+    def __init__(self, corenlp_path=None):
         """
         Checks the location of the jar files.
         Spawns the server as a process.
         """
-        jars = ["stanford-corenlp-2012-07-09.jar",
-                "stanford-corenlp-2012-07-06-models.jar",
+        jars = ["stanford-corenlp-3.4.1.jar",
+                "stanford-corenlp-3.4.1-models.jar",
                 "joda-time.jar",
-                "xom.jar"]
+                "xom.jar",
+                "jollyday.jar"]
 
         # if CoreNLP libraries are in a different directory,
         # change the corenlp_path variable to point to them
-        corenlp_path = "stanford-corenlp-2012-07-09/"
+        if not corenlp_path:
+            corenlp_path = "./stanford-corenlp-full-2014-08-27/"
 
         java_path = "java"
         classname = "edu.stanford.nlp.pipeline.StanfordCoreNLP"
@@ -144,12 +154,13 @@ def __init__(self):
         jars = [corenlp_path + jar for jar in jars]
         for jar in jars:
             if not os.path.exists(jar):
-                print "Error! Cannot locate %s" % jar
+                logger.error("Error! Cannot locate %s" % jar)
                 sys.exit(1)
 
         # spawn the server
         start_corenlp = "%s -Xmx1800m -cp %s %s %s" % (java_path, ':'.join(jars), classname, props)
-        if VERBOSE: print start_corenlp
+        if VERBOSE: 
+            logger.debug(start_corenlp)
         self.corenlp = pexpect.spawn(start_corenlp)
 
         # show progress bar while loading the models
@@ -189,32 +200,33 @@ def _parse(self, text):
         # function of the text's length.
         # anything longer than 5 seconds requires that you also
         # increase timeout=5 in jsonrpc.py
-        max_expected_time = min(5, 3 + len(text) / 20.0)
+        max_expected_time = min(40, 3 + len(text) / 20.0)
         end_time = time.time() + max_expected_time
-        
+
         incoming = ""
         while True:
             # Time left, read more data
             try:
                 incoming += self.corenlp.read_nonblocking(2000, 1)
-                if "\nNLP>" in incoming: break
+                if "\nNLP>" in incoming: 
+                    break
                 time.sleep(0.0001)
             except pexpect.TIMEOUT:
                 if end_time - time.time() < 0:
-                    print "[ERROR] Timeout"
-                    return {'error': "timed out after %f seconds" % max_expected_time,
-                            'input': text,
-                            'output': incoming}
+                    logger.error("Error: Timeout with input '%s'" % (incoming))
+                    return {'error': "timed out after %f seconds" % max_expected_time}
                 else:
                     continue
             except pexpect.EOF:
                 break
 
-        if VERBOSE: print "%s\n%s" % ('='*40, incoming)
+        if VERBOSE: 
+            logger.debug("%s\n%s" % ('='*40, incoming))
         try:
             results = parse_parser_results(incoming)
         except Exception, e:
-            if VERBOSE: print traceback.format_exc()
+            if VERBOSE: 
+                logger.debug(traceback.format_exc())
             raise e
 
         return results
@@ -225,7 +237,9 @@ def parse(self, text):
         reads in the result, parses the results and returns a list
         with one dictionary entry for each parsed sentence, in JSON format.
         """
-        return json.dumps(self._parse(text))
+        response = self._parse(text)
+        logger.debug("Response: '%s'" % (response))
+        return json.dumps(response)
 
 
 if __name__ == '__main__':
@@ -234,15 +248,15 @@ def parse(self, text):
     """
     parser = optparse.OptionParser(usage="%prog [OPTIONS]")
     parser.add_option('-p', '--port', default='8080',
-                      help='Port to serve on (default 8080)')
+                      help='Port to serve on (default: 8080)')
     parser.add_option('-H', '--host', default='127.0.0.1',
-                      help='Host to serve on (default localhost; 0.0.0.0 to make public)')
+                      help='Host to serve on (default: 127.0.0.1. Use 0.0.0.0 to make public)')
     options, args = parser.parse_args()
     server = jsonrpc.Server(jsonrpc.JsonRpc20(),
                             jsonrpc.TransportTcpIp(addr=(options.host, int(options.port))))
 
     nlp = StanfordCoreNLP()
     server.register_function(nlp.parse)
 
-    print 'Serving on http://%s:%s' % (options.host, options.port)
+    logger.info('Serving on http://%s:%s' % (options.host, options.port))
     server.serve()