8000 Updted CoreNLP to v3.4.1 · JHnlp/stanford-corenlp-python@5fc37fa · GitHub
[go: up one dir, main page]

Skip to content

Commit 5fc37fa

Browse files
committed
Updted CoreNLP to v3.4.1
1 parent 4477bf8 commit 5fc37fa

File tree

2 files changed

+69
-46
lines changed

2 files changed

+69
-46
lines changed

README.md

Lines changed: 31 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Python interface to Stanford Core NLP tools v1.3.3
1+
# Python interface to Stanford Core NLP tools v3.4.1
22

33
This is a Python wrapper for Stanford University's NLP group's Java-based [CoreNLP tools](http://nlp.stanford.edu/software/corenlp.shtml). It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.
44

@@ -8,23 +8,21 @@ This is a Python wrapper for Stanford University's NLP group's Java-based [CoreN
88
* Outputs parse trees which can be used by [nltk](http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html).
99

1010

11-
It requires [pexpect](http://www.noah.org/wiki/pexpect) and (optionally) [unidecode](http://pypi.python.org/pypi/Unidecode) to handle non-ASCII text. This script includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
11+
It requires [pexpect](http://www.noah.org/wiki/pexpect) and [unidecode](http://pypi.python.org/pypi/Unidecode) to handle non-ASCII text. This script includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
1212

13-
It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 1.3.3** released 2012-07-09.
13+
It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 3.4.1** released 2014-08-27.
1414

15-
## Download and Usage
15+
## Download and Usage
1616

17-
To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the tgz file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.
17+
To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the compressed file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run. In other words:
1818

19-
In other words:
19+
sudo pip install pexpect unidecode
20+
git clone git://github.com/dasmith/stanford-corenlp-python.git
21+
cd stanford-corenlp-python
22+
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-08-27.zip
23+
unzip stanford-corenlp-full-2014-08-27.zip
2024

21-
sudo pip install pexpect unidecode # unidecode is optional
22-
git clone git://github.com/dasmith/stanford-corenlp-python.git
23-
cd stanford-corenlp-python
24-
wget http://nlp.stanford.edu/software/stanford-corenlp-2012-07-09.tgz
25-
tar xvfz stanford-corenlp-2012-07-09.tgz
26-
27-
Then, to launch a server:
25+
Then launch the server:
2826

2927
python corenlp.py
3028

@@ -39,7 +37,7 @@ Assuming you are running on port 8080, the code in `client.py` shows an example
3937
import jsonrpc
4038
from simplejson import loads
4139
server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
42-
jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
40+
jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
4341

4442
result = loads(server.parse("Hello world. It is so beautiful"))
4543
print "Result", result
@@ -112,8 +110,19 @@ To use it in a regular script or to edit/debug it (because errors via RPC are op
112110
corenlp = StanfordCoreNLP() # wait a few minutes...
113111
corenlp.parse("Parse it")
114112

113+
114+
## Coreference Resolution
115+
116+
The library supports [coreference resolution](http://en.wikipedia.org/wiki/Coreference), meaning pronouns can be "dereferenced." If an entry in the `coref` list is, `[u'Hello world', 0, 1, 0, 2]`, the numbers mean:
117+
118+
* 0 = The reference appears in the 0th sentence (e.g. "Hello World")
119+
* 1 = The 2nd token, "world", is the [headword](http://en.wikipedia.org/wiki/Head_%28linguistics%29) of that sentence
120+
* 0 = 'Hello world' begins at the 0th token in the sentence
121+
* 2 = 'Hello world' ends before the 2nd token in the sentence.
122+
115123
<!--
116124
125+
117126
## Adding WordNet
118127
119128
Note: wordnet doesn't seem to be supported using this approach. Looks like you'll need Java.
@@ -129,23 +138,23 @@ tar xvfz WNprolog-3.0.tar.gz
129138
**Stanford CoreNLP tools require a large amount of free memory**. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing `-Xmx3g` to `-Xmx2g` or even less.
130139
If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:
131140

132-
java -cp stanford-corenlp-2012-07-09.jar:stanford-corenlp-2012-07-06-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
141+
java -cp stanford-corenlp-2014-08-27.jar:stanford-corenlp-3.4.1-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
133142

134143
You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available [on my webpage](http://web.media.mit.edu/~dustin)).
135144

136145

137-
# Contributors
146+
# License & Contributors
138147

139148
This is free and open source software and has benefited from the contribution and feedback of others. Like Stanford's CoreNLP tools, it is covered under the [GNU General Public License v2 +](http://www.gnu.org/licenses/gpl-2.0.html), which in short means that modifications to this program must maintain the same free and open source distribution policy.
140149

141-
This project has benefited from the contributions of:
150+
I gratefully welcome bug fixes and new features. If you have forked this repository, please submit a [pull request](https://help.github.com/articles/using-pull-requests/) so others can benefit from your contributions. This project has already benefited from contributions from these members of the open source community:
142151

143-
* @jcc Justin Cheng
152+
* [Emilio Monti](https://github.com/emilmont)
153+
* [Justin Cheng](https://github.com/jcccf)
144154
* Abhaya Agarwal
145155

146-
## Related Projects
156+
*Thank you!*
147157

148-
These two projects are python wrappers for the [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml), which includes the Stanford Parser, although the Stanford Parser is another project.
149-
- [stanford-parser-python](http://projects.csail.mit.edu/spatial/Stanford_Parser) uses [JPype](http://jpype.sourceforge.net/) (interface to JVM)
150-
- [stanford-parser-jython](http://blog.gnucom.cc/2010/using-the-stanford-parser-with-jython/) uses Python
158+
## Related Projects
151159

160+
Maintainers of the Core NLP library at Stanford keep an [updated list of wrappers and extensions](http://nlp.stanford.edu/software/corenlp.shtml#Extensions).

corenlp.py

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#!/usr/bin/env python
22
#
33
# corenlp - Python interface to Stanford Core NLP tools
4-
# Copyright (c) 2012 Dustin Smith
4+
# Copyright (c) 2014 Dustin Smith
55
# https://github.com/dasmith/stanford-corenlp-python
66
#
77
# This program is free software; you can redistribute it and/or
@@ -18,16 +18,24 @@
1818
# along with this program; if not, write to the Free Software
1919
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
2020

21-
import json, optparse, os, re, sys, time, traceback
21+
import json
22+
import optparse
23+
import os, re, sys, time, traceback
2224
import jsonrpc, pexpect
2325
from progressbar import ProgressBar, Fraction
2426
from unidecode import unidecode
27+
import logging
2528

2629

2730
VERBOSE = True
31+
2832
STATE_START, STATE_TEXT, STATE_WORDS, STATE_TREE, STATE_DEPENDENCY, STATE_COREFERENCE = 0, 1, 2, 3, 4, 5
2933
WORD_PATTERN = re.compile('\[([^\]]+)\]')
30-
CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\)\) -> \((\d*),(\d)*,\[(\d*),(\d*)\)\), that is: \"(.*)\" -> \"(.*)\"")
34+
CR_PATTERN = re.compile(r"\((\d*),(\d)*,\[(\d*),(\d*)\]\) -> \((\d*),(\d)*,\[(\d*),(\d*)\]\), that is: \"(.*)\" -> \"(.*)\"")
35+
36+
# initialize logger
37+
logging.basicConfig(level=logging.INFO)
38+
logger = logging.getLogger(__name__)
3139

3240

3341
def remove_id(word):
@@ -65,7 +73,7 @@ def parse_parser_results(text):
6573
"""
6674
results = {"sentences": []}
6775
state = STATE_START
68-
for line in unidecode(text).split("\n"):
76+
for line in text.encode('utf-8').split("\n"):
6977
line = line.strip()
7078

7179
if line.startswith("Sentence #"):
@@ -120,19 +128,21 @@ class StanfordCoreNLP(object):
120128
Command-line interaction with Stanford's CoreNLP java utilities.
121129
Can be run as a JSON-RPC server or imported as a module.
122130
"""
123-
def __init__(self):
131+
def __init__(self, corenlp_path=None):
124132
"""
125133
Checks the location of the jar files.
126134
Spawns the server as a process.
127135
"""
128-
jars = ["stanford-corenlp-2012-07-09.jar",
129-
"stanford-corenlp-2012-07-06-models.jar",
< FAB3 code>136+
jars = ["stanford-corenlp-3.4.1.jar",
137+
"stanford-corenlp-3.4.1-models.jar",
130138
"joda-time.jar",
131-
"xom.jar"]
139+
"xom.jar",
140+
"jollyday.jar"]
132141

133142
# if CoreNLP libraries are in a different directory,
134143
# change the corenlp_path variable to point to them
135-
corenlp_path = "stanford-corenlp-2012-07-09/"
144+
if not corenlp_path:
145+
corenlp_path = "./stanford-corenlp-full-2014-08-27/"
136146

137147
java_path = "java"
138148
classname = "edu.stanford.nlp.pipeline.StanfordCoreNLP"
@@ -144,12 +154,13 @@ def __init__(self):
144154
jars = [corenlp_path + jar for jar in jars]
145155
for jar in jars:
146156
if not os.path.exists(jar):
147-
print "Error! Cannot locate %s" % jar
157+
logger.error("Error! Cannot locate %s" % jar)
148158
sys.exit(1)
149159

150160
# spawn the server
151161
start_corenlp = "%s -Xmx1800m -cp %s %s %s" % (java_path, ':'.join(jars), classname, props)
152-
if VERBOSE: print start_corenlp
162+
if VERBOSE:
163+
logger.debug(start_corenlp)
153164
self.corenlp = pexpect.spawn(start_corenlp)
154165

155166
# show progress bar while loading the models
@@ -189,32 +200,33 @@ def _parse(self, text):
189200
# function of the text's length.
190201
# anything longer than 5 seconds requires that you also
191202
# increase timeout=5 in jsonrpc.py
192-
max_expected_time = min(5, 3 + len(text) / 20.0)
203+
max_expected_time = min(40, 3 + len(text) / 20.0)
193204
end_time = time.time() + max_expected_time
194-
205+
195206
incoming = ""
196207
while True:
197208
# Time left, read more data
198209
try:
199210
incoming += self.corenlp.read_nonblocking(2000, 1)
200-
if "\nNLP>" in incoming: break
211+
if "\nNLP>" in incoming:
212+
break
201213
time.sleep(0.0001)
202214
except pexpect.TIMEOUT:
203215
if end_time - time.time() < 0:
204-
print "[ERROR] Timeout"
205-
return {'error': "timed out after %f seconds" % max_expected_time,
206-
'input': text,
207-
'output': incoming}
216+
logger.error("Error: Timeout with input '%s'" % (incoming))
217+
return {'error': "timed out after %f seconds" % max_expected_time}
208218
else:
209219
continue
210220
except pexpect.EOF:
211221
break
212222

213-
if VERBOSE: print "%s\n%s" % ('='*40, incoming)
223+
if VERBOSE:
224+
logger.debug("%s\n%s" % ('='*40, incoming))
214225
try:
215226
results = parse_parser_results(incoming)
216227
except Exception, e:
217-
if VERBOSE: print traceback.format_exc()
228+
if VERBOSE:
229+
logger.debug(traceback.format_exc())
218230
raise e
219231

220232
return results
@@ -225,7 +237,9 @@ def parse(self, text):
225237
reads in the result, parses the results and returns a list
226238
with one dictionary entry for each parsed sentence, in JSON format.
227239
"""
228-
return json.dumps(self._parse(text))
240+
response = self._parse(text)
241+
logger.debug("Response: '%s'" % (response))
242+
return json.dumps(response)
229243

230244

231245
if __name__ == '__main__':
@@ -234,15 +248,15 @@ def parse(self, text):
234248
"""
235249
parser = optparse.OptionParser(usage="%prog [OPTIONS]")
236250
parser.add_option('-p', '--port', default='8080',
237-
help='Port to serve on (default 8080)')
251+
help='Port to serve on (default: 8080)')
238252
parser.add_option('-H', '--host', default='127.0.0.1',
239-
help='Host to serve on (default localhost; 0.0.0.0 to make public)')
253+
help='Host to serve on (default: 127.0.0.1. Use 0.0.0.0 to make public)')
240254
options, args = parser.parse_args()
241255
server = jsonrpc.Server(jsonrpc.JsonRpc20(),
242256
jsonrpc.TransportTcpIp(addr=(options.host, int(options.port))))
243257

244258
nlp = StanfordCoreNLP()
245259
server.register_function(nlp.parse)
246260

247-
print 'Serving on http://%s:%s' % (options.host, options.port)
261+
logger.info('Serving on http://%s:%s' % (options.host, options.port))
248262
server.serve()

0 commit comments

Comments
 (0)
0