8000 bugfix · kowey/stanford-corenlp-python@65f5c53 · GitHub
[go: up one dir, main page]

Skip to content

Commit 65f5c53

Browse files
committed
bugfix
1 parent a78f131 commit 65f5c53

File tree

2 files changed

+70
-198
lines changed

2 files changed

+70
-198
lines changed

README.md

Lines changed: 7 additions & 156 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@ This is a fork of [stanford-corenlp-python](https://github.com/dasmith/stanford-
77
* Update to Stanford CoreNLP v1.3.5
88
* Fix many bugs & improve performance
99
* Using jsonrpclib for stability and performance
10-
* Can edit the constants as argument such as Stanford Core NLP directory.
10+
* Can edit the constants as argument such as Stanford Core NLP directory
1111
* Adjust parameters not to timeout in high load
12-
* Fix a problem on input long texts by Johannes Castner [stanford-corenlp-python](https://github.com/jac2130/stanford-corenlp-python)
12+
* Fix a problem on input long texts, by Johannes Castner [stanford-corenlp-python](https://github.com/jac2130/stanford-corenlp-python)
1313
* Packaging
1414

1515
## Requirements
16-
* [jsonrpclib](https://github.com/joshmarshall/jsonrpclib)
1716
* [pexpect](http://www.noah.org/wiki/pexpect)
18-
* [unidecode](http://pypi.python.org/pypi/Unidecode) (optionally)
17+
* [unidecode](http://pypi.python.org/pypi/Unidecode)
18+
* [jsonrpclib](https://github.com/joshmarshall/jsonrpclib) (optionally)
1919

2020
## Download and Usage
2121

@@ -124,165 +124,16 @@ Not to use JSON-RPC, load the module instead:
124124

125125
If you need to parse long texts (more than 30-50 sentences), you have to use a batch_parse() function. It reads text files from input directory and returns a generator object of dictionaries parsed each file results:
126126

127-
from corenlp import batch_process
127+
from corenlp import batch_parse
128+
corenlp_dir = "stanford-corenlp-full-2013-04-04/"
128129
raw_text_directory = "sample_raw_text/"
129-
parsed = batch_process(raw_text_directory) # It returns a generator object
130+
parsed = batch_process(raw_text_directory, corenlp_dir) # It returns a generator object
130131
print parsed #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]
131132

132133
## Developer
133134
* Hiroyoshi Komatsu [hiroyoshi.komat@gmail.com]
134135
* Johannes Castner [jac2130@columbia.edu]
135136

136-
137-
Following are the README in original stanford-corenlp-python.
138-
139-
-------------------------------------
140-
141-
Python interface to Stanford Core NLP tools v1.3.3
142-
143-
This is a Python wrapper for Stanford University's NLP group's Java-based [CoreNLP tools](http://nlp.stanford.edu/software/corenlp.shtml). It can either be imported as a module or run as a JSON-RPC server. Because it uses many large trained models (requiring 3GB RAM on 64-bit machines and usually a few minutes loading time), most applications will probably want to run it as a server.
144-
145-
146-
* Python interface to Stanford CoreNLP tools: tagging, phrase-structure parsing, dependency parsing, named entity resolution, and coreference resolution.
147-
* Runs an JSON-RPC server that wraps the Java server and outputs JSON.
148-
* Outputs parse trees which can be used by [nltk](http://nltk.googlecode.com/svn/trunk/doc/howto/tree.html).
149-
150-
151-
It requires [pexpect](http://www.noah.org/wiki/pexpect) and (optionally) [unidecode](http://pypi.python.org/pypi/Unidecode) to handle non-ASCII text. This script includes and uses code from [jsonrpc](http://www.simple-is-better.org/rpc/) and [python-progressbar](http://code.google.com/p/python-progressbar/).
152-
153-
It runs the Stanford CoreNLP jar in a separate process, communicates with the java process using its command-line interface, and makes assumptions about the output of the parser in order to parse it into a Python dict object and transfer it using JSON. The parser will break if the output changes significantly, but it has been tested on **Core NLP tools version 1.3.3** released 2012-07-09.
154-
155-
## Download and Usage
156-
157-
To use this program you must [download](http://nlp.stanford.edu/software/corenlp.shtml#Download) and unpack the tgz file containing Stanford's CoreNLP package. By default, `corenlp.py` looks for the Stanford Core NLP folder as a subdirectory of where the script is being run.
158-
159-
In other words:
160-
161-
sudo pip install pexpect unidecode # unidecode is optional
162-
git clone git://github.com/dasmith/stanford-corenlp-python.git
163-
cd stanford-corenlp-python
164-
wget http://nlp.stanford.edu/software/stanford-corenlp-2012-07-09.tgz
165-
tar xvfz stanford-corenlp-2012-07-09.tgz
166-
167-
Then, to launch a server:
168-
169-
python corenlp.py
170-
171-
Optionally, you can specify a host or port:
172-
173-
python corenlp.py -H 0.0.0.0 -p 3456
174-
175-
That will run a public JSON-RPC server on port 3456.
176-
177-
Assuming you are running on port 8080, the code in `client.py` shows an example parse:
178-
179-
import jsonrpc
180-
from simplejson import loads
181-
server = jsonrpc.ServerProxy(jsonrpc.JsonRpc20(),
182-
jsonrpc.TransportTcpIp(addr=("127.0.0.1", 8080)))
183-
184-
result = loads(server.parse("Hello world. It is so beautiful"))
185-
print "Result", result
186-
187-
That returns a dictionary containing the keys `sentences` and (when applicable) `corefs`. The key `sentences` contains a list of dictionaries for each sentence, which contain `parsetree`, `text`, `tuples` containing the dependencies, and `words`, containing information about parts of speech, NER, etc:
188-
189-
{u'sentences': [{u'parsetree': u'(ROOT (S (VP (NP (INTJ (UH Hello)) (NP (NN world)))) (. !)))',
190-
u'text': u'Hello world!',
191-
u'tuples': [[u'dep', u'world', u'Hello'],
192-
[u'root', u'ROOT', u'world']],
193-
u'words': [[u'Hello',
194-
{u'CharacterOffsetBegin': u'0',
195-
u'CharacterOffsetEnd': u'5',
196-
u'Lemma': u'hello',
197-
u'NamedEntityTag': u'O',
198-
u'PartOfSpeech': u'UH'}],
199-
[u'world',
200-
{u'CharacterOffsetBegin': u'6',
201-
u'CharacterOffsetEnd': u'11',
202-
u'Lemma': u'world',
203-
u'NamedEntityTag': u'O',
204-
u'PartOfSpeech': u'NN'}],
205-
[u'!',
206-
{u'CharacterOffsetBegin': u'11',
207-
u'CharacterOffsetEnd': u'12',
208-
u'Lemma': u'!',
209-
u'NamedEntityTag': u'O',
210-
u'PartOfSpeech': u'.'}]]},
211-
{u'parsetree': u'(ROOT (S (NP (PRP It)) (VP (VBZ is) (ADJP (RB so) (JJ beautiful))) (. .)))',
212-
u'text': u'It is so beautiful.',
213-
u'tuples': [[u'nsubj', u'beautiful', u'It'],
214-
[u'cop', u'beautiful', u'is'],
215-
[u'advmod', u'beautiful', u'so'],
216-
[u'root', u'ROOT', u'beautiful']],
217-
u'words': [[u'It',
218-
{u'CharacterOffsetBegin': u'14',
219-
u'CharacterOffsetEnd': u'16',
220-
u'Lemma': u'it',
221-
u'NamedEntityTag': u'O',
222-
u'PartOfSpeech': u'PRP'}],
223-
[u'is',
224-
{u'CharacterOffsetBegin': u'17',
225-
u'CharacterOffsetEnd': u'19',
226-
u'Lemma': u'be',
227-
u'NamedEntityTag': u'O',
228-
u'PartOfSpeech': u'VBZ'}],
229-
[u'so',
230-
{u'CharacterOffsetBegin': u'20',
231-
u'CharacterOffsetEnd': u'22',
232-
u'Lemma': u'so',
233-
u'NamedEntityTag': u'O',
234-
u'PartOfSpeech': u'RB'}],
235-
[u'beautiful',
236-
{u'CharacterOffsetBegin': u'23',
237-
u'CharacterOffsetEnd': u'32',
238-
u'Lemma': u'beautiful',
239-
u'NamedEntityTag': u'O',
240-
u'PartOfSpeech': u'JJ'}],
241-
[u'.',
242-
{u'CharacterOffsetBegin': u'32',
243-
u'CharacterOffsetEnd': u'33',
244-
u'Lemma': u'.',
245-
u'NamedEntityTag': u'O',
246-
u'PartOfSpeech': u'.'}]]}],
247-
u'coref': [[[[u'It', 1, 0, 0, 1], [u'Hello world', 0, 1, 0, 2]]]]}
248-
249-
To use it in a regular script or to edit/debug it (because errors via RPC are opaque), load the module instead:
250-
251-
from corenlp import *
252-
corenlp = StanfordCoreNLP() # wait a few minutes...
253-
corenlp.parse("Parse it")
254-
255-
<!--
256-
257-
## Adding WordNet
258-
259-
Note: wordnet doesn't seem to be supported using this approach. Looks like you'll need Java.
260-
261-
Download WordNet-3.0 Prolog: http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
262-
tar xvfz WNprolog-3.0.tar.gz
263-
264-
-->
265-
266-
267-
## Questions
268-
269-
**Stanford CoreNLP tools require a large amount of free memory**. Java 5+ uses about 50% more RAM on 64-bit machines than 32-bit machines. 32-bit machine users can lower the memory requirements by changing `-Xmx3g` to `-Xmx2g` or even less.
270-
If pexpect timesout while loading models, check to make sure you have enough memory and can run the server alone without your kernel killing the java process:
271-
272-
java -cp stanford-corenlp-2012-07-09.jar:stanford-corenlp-2012-07-06-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props default.properties
273-
274-
You can reach me, Dustin Smith, by sending a message on GitHub or through email (contact information is available [on my webpage](http://web.media.mit.edu/~dustin)).
275-
276-
277-
# Contributors
278-
279-
This is free and open source software and has benefited from the contribution and feedback of others. Like Stanford's CoreNLP tools, it is covered under the [GNU General Public License v2 +](http://www.gnu.org/licenses/gpl-2.0.html), which in short means that modifications to this program must maintain the same free and open source distribution policy.
280-
281-
This project has benefited from the contributions of:
282-
283-
* @jcc Justin Cheng
284-
* Abhaya Agarwal
285-
286137
## Related Projects
287138

288139
These two projects are python wrappers for the [Stanford Parser](http://nlp.stanford.edu/software/lex-parser.shtml), which includes the Stanford Parser, although the Stanford Parser is another project.

corenlp/corenlp.py

Lines changed: 63 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
import shutil
2626
from progressbar import ProgressBar, Fraction
2727
from unidecode import unidecode
28-
from jsonrpclib.SimpleJSONRPCServer import SimpleJSONRPCServer
28+
from subprocess import call
2929

3030
VERBOSE = False
3131
STATE_START, STATE_TEXT, STATE_WORDS, STATE_TREE, STATE_DEPENDENCY, STATE_COREFERENCE = 0, 1, 2, 3, 4, 5
@@ -194,40 +194,49 @@ def extract_words_from_xml(sent_node):
194194
exted = map(lambda x: x['word'], sent_node['tokens']['token'])
195195
return exted
196196

197-
#turning the raw xml into a raw python dictionary:
197+
# Turning the raw xml into a raw python dictionary:
198198
raw_dict = xmltodict.parse(xml)
199+
document = raw_dict[u'root'][u'document']
200+
201+
# Making a raw sentence list of dictionaries:
202+
raw_sent_list = document[u'sentences'][u'sentence']
203+
204+
if document.get(u'coreference') and document[u'coreference'].get(u'coreference'):
205+
# Convert coreferences to the format like python
206+
coref_flag = True
207+
208+
# Making a raw coref dictionary:
209+
raw_coref_list = document[u'coreference'][u'coreference']
210+
211+
# To dicrease is for given index different from list index
212+
coref_index = [[[int(raw_coref_list[j][u'mention'][i]['sentence'])-1,
213+
int(raw_coref_list[j][u'mention'][i]['head'])-1,
214+
int(raw_coref_list[j][u'mention'][i]['start'])-1,
215+
int(raw_coref_list[j][u'mention'][i]['end'])-1]
216+
for i in xrange(len(raw_coref_list[j][u'mention']))]
217+
for j in xrange(len(raw_coref_list))]
218+
219+
coref_list = []
220+
for j in xrange(len(coref_index)):
221+
coref_list.append(coref_index[j])
222+
for k, coref in enumerate(coref_index[j]):
223+
exted = raw_sent_list[coref[0]]['tokens']['token'][coref[2]:coref[3]]
224+
exted_words = map(lambda x: x['word'], exted)
225+
coref_list[j][k].insert(0, ' '.join(exted_words))
226+
227+
coref_list = [[[coref_list[j][i], coref_list[j][0]]
228+
for i in xrange(len(coref_list[j])) if i != 0]
229+
for j in xrange(len(coref_list))]
230+
else:
231+
coref_flag = False
199232

200-
#making a raw sentence list of dictionaries:
201-
raw_sent_list = raw_dict[u'root'][u'document'][u'sentences'][u'sentence']
202-
#making a raw coref dictionary:
203-
raw_coref_list = raw_dict[u'root'][u'document'][u'coreference'][u'coreference']
204-
205-
#cleaning up the list ...the problem is that this doesn't come in pairs, as the command line version:
206-
207-
# To dicrease is for given index different from list index
208-
coref_index = [[[eval(raw_coref_list[j][u'mention'][i]['sentence'])-1,
209-
eval(raw_coref_list[j][u'mention'][i]['head'])-1,
210-
eval(raw_coref_list[j][u'mention'][i]['start'])-1,
211-
eval(raw_coref_list[j][u'mention'][i]['end'])-1]
212-
for i in xrange(len(raw_coref_list[j][u'mention']))]
213-
for j in xrange(len(raw_coref_list))]
214-
215-
coref_list = []
216-
for j in xrange(len(coref_index)):
217-
coref_list.append(coref_index[j])
218-
for k, coref in enumerate(coref_index[j]):
219-
exted = raw_sent_list[coref[0]]['tokens']['token'][coref[2]:coref[3]]
220-
exted_words = map(lambda x: x['word'], exted)
221-
coref_list[j][k].insert(0, ' '.join(exted_words))
222-
223-
coref_list = [[[coref_list[j][i], coref_list[j][0]]
224-
for i in xrange(len(coref_list[j])) if i != 0]
225-
for j in xrange(len(coref_list))]
226-
233+
# Convert sentences to the format like python
234+
# TODO: If there is only one sentence in input sentence,
235+
# raw_sent_list is dict and cannot decode following code...
227236
sentences = [{'dependencies': [[dep['dep'][i]['@type'],
228237
dep['dep'][i]['governor']['#text'],
229238
dep['dep'][i]['dependent']['#text']]
230-
for dep in raw_sent_list[j][u'dependencies']
239+
for dep in raw_sent_list.values()[j][u'dependencies']
231240
for i in xrange(len(dep['dep']))
232241
if dep['@type']=='basic-dependencies'],
233242
'text': extract_words_from_xml(raw_sent_list[j]),
@@ -238,11 +247,15 @@ def extract_words_from_xml(sent_node):
238247
('CharacterOffsetBegin', str(token['CharacterOffsetBegin'])),
239248
('PartOfSpeech', str(token['POS'])),
240249
('Lemma', str(token['lemma']))])]
241-
for token in raw_sent_list[j]['tokens'][u'token']]}
250+
for token in raw_sent_list[j][u'tokens'][u'token']]}
242251

243-
for j in xrange(len(raw_sent_list))]
252+
for j in xrange(len(raw_sent_list)) ]
253+
254+
if coref_flag:
255+
results = {'coref':coref_list, 'sentences':sentences}
256+
else:
257+
results = {'sentences': sentences}
244258

245-
results = {'coref':coref_list, 'sentences':sentences}
246259
if file_name:
247260
results['file_name'] = file_name
248261

@@ -261,7 +274,6 @@ def parse_xml_output(input_dir, corenlp_path="stanford-corenlp-full-2013-04-04/"
261274
#we get a list of the cleaned files that we want to parse:
262275

263276
files = [input_dir+'/'+f for f in os.listdir(input_dir)]
264-
file_name = re.sub('.xml$', '', f)
265277

266278
#creating the file list of files to parse
267279

@@ -273,19 +285,20 @@ def parse_xml_output(input_dir, corenlp_path="stanford-corenlp-full-2013-04-04/"
273285

274286
#creates the xml file of parser output:
275287

276-
os.system(command)
288+
call(command, shell=True)
277289

278290
#reading in the raw xml file:
291+
result = []
279292
try:
280293
for output_file in os.listdir(xml_dir):
281294
with open(xml_dir+'/'+output_file, 'r') as xml:
282-
parsed = xml.read()
283-
yield parse_parser_xml_results(parsed, file_name)
295+
# parsed = xml.read()
296+
file_name = re.sub('.xml$', '', os.path.basename(output_file))
297+
result.append(parse_parser_xml_results(xml.read(), file_name))
284298
finally:
285299
file_list.close()
286-
try:
287-
shutil.rmtree(xml_dir)
288-
except: pass
300+
shutil.rmtree(xml_dir)
301+
return result
289302

290303
class StanfordCoreNLP:
291304
"""
@@ -366,11 +379,12 @@ def clean_up():
366379
max_expected_time = max(300.0, len(to_send) / 3.0)
367380

368381
# repeated_input = self.corenlp.except("\n") # confirm it
369-
t = self.corenlp.expect(["\nNLP> ", pexpect.TIMEOUT, pexpect.EOF],
382+
t = self.corenlp.expect(["\nNLP> ", pexpect.TIMEOUT, pexpect.EOF,
383+
"\nWARNING: Parsing of sentence failed, possibly because of out of memory."],
370384
timeout=max_expected_time)
371385
incoming = self.corenlp.before
372386
if t == 1:
373-
# TIMEOUT, clean up anything when raise pexpect.TIMEOUT error
387+
# TIMEOUT, clean up anything left in buffer
374388
clean_up()
375389
print >>sys.stderr, {'error': "timed out after %f seconds" % max_expected_time,
376390
'input': to_send,
@@ -383,6 +397,12 @@ def clean_up():
383397
'output': incoming}
384398
self.corenlp.close()
385399
raise ProcessError("CoreNLP process terminates abnormally while parsing")
400+
elif t == 3:
401+
# out of memory
402+
print >>sys.stderr, {'error': "WARNING: Parsing of sentence failed, possibly because of out of memory.",
403+
'input': to_send,
404+
'output': incoming}
405+
return
386406

387407
if VERBOSE: print "%s\n%s" % ('='*40, incoming)
388408
try:
@@ -429,6 +449,7 @@ def batch_parse(input_folder, corenlp_path="stanford-corenlp-full-2013-04-04/",
429449
"""
430450
The code below starts an JSONRPC server
431451
"""
452+
from jsonrpclib.SimpleJSONRPCServer import SimpleJSONRPCServer
432453
VERBOSE = True
433454
parser = optparse.OptionParser(usage="%prog [OPTIONS]")
434455
parser.add_option('-p', '--port', default='8080',

0 commit comments

Comments
 (0)
0