8000 Adding all the parsed stuff and some updated values in the jsonrpc · aped/stanford-corenlp-python@3fb0056 · GitHub
[go: up one dir, main page]

Skip to content

Commit 3fb0056

Browse files
committed
Adding all the parsed stuff and some updated values in the jsonrpc
and corenlp files, to permit larger memory overhead and longer timeout tolerance.
1 parent 4477bf8 commit 3fb0056

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

73 files changed

+784135
-8
lines changed

1.json

Lines changed: 4459 additions & 0 deletions
Large diffs are not rendered by default.

10.json

Lines changed: 7429 additions & 0 deletions
Large diffs are not rendered by default.

11.json

Lines changed: 13883 additions & 0 deletions
Large diffs are not rendered by default.

12.json

Lines changed: 14818 additions & 0 deletions
Large diffs are not rendered by default.

13.json

Lines changed: 5548 additions & 0 deletions
Large diffs are not rendered by default.

14.json

Lines changed: 4545 additions & 0 deletions
Large diffs are not rendered by default.

15.json

Lines changed: 3051 additions & 0 deletions
Large diffs are not rendered by default.

16.json

Lines changed: 14549 additions & 0 deletions
Large diffs are not rendered by default.

17.json

Lines changed: 13915 additions & 0 deletions
Large diffs are not rendered by default.

18.json

Lines changed: 10922 additions & 0 deletions
Large diffs are not rendered by default.

19.json

Lines changed: 5865 additions & 0 deletions
Large diffs are not rendered by default.

2.json

Lines changed: 13787 additions & 0 deletions
Large diffs are not rendered by default.

20.json

Lines changed: 3656 additions & 0 deletions
Large diffs are not rendered by default.

21.json

Lines changed: 9562 additions & 0 deletions
Large diffs are not rendered by default.

22.json

Lines changed: 10954 additions & 0 deletions
Large diffs are not rendered by default.

23.json

Lines changed: 10644 additions & 0 deletions
Large diffs are not rendered by default.

24.json

Lines changed: 13017 additions & 0 deletions
Large diffs are not rendered by default.

25.json

Lines changed: 13213 additions & 0 deletions
Large diffs are not rendered by default.

26.json

Lines changed: 13605 additions & 0 deletions
Large diffs are not rendered by default.

27.json

Lines changed: 6175 additions & 0 deletions
Large diffs are not rendered by default.

28.json

Lines changed: 12845 additions & 0 deletions
Large diffs are not rendered by default.

29.json

Lines changed: 13481 additions & 0 deletions
Large diffs are not rendered by default.

3.json

Lines changed: 12618 additions & 0 deletions
Large diffs are not rendered by default.

30.json

Lines changed: 13574 additions & 0 deletions
Large diffs are not rendered by default.

31.json

Lines changed: 11812 additions & 0 deletions
Large diffs are not rendered by default.

32.json

Lines changed: 11905 additions & 0 deletions
Large diffs are not rendered by default.

33.json

Lines changed: 11607 additions & 0 deletions
Large diffs are not rendered by default.

34.json

Lines changed: 2935 additions & 0 deletions
Large diffs are not rendered by default.

35.json

Lines changed: 11344 additions & 0 deletions
Large diffs are not rendered by default.

36.json

Lines changed: 13409 additions & 0 deletions
Large diffs are not rendered by default.

37.json

Lines changed: 10744 additions & 0 deletions
Large diffs are not rendered by default.

38.json

Lines changed: 12807 additions & 0 deletions
Large diffs are not rendered by default.

39.json

Lines changed: 12423 additions & 0 deletions
Large diffs are not rendered by default.

4.json

Lines changed: 14252 additions & 0 deletions
Large diffs are not rendered by default.

40.json

Lines changed: 12682 additions & 0 deletions
Large diffs are not rendered by default.

41.json

Lines changed: 11305 additions & 0 deletions
Large diffs are not rendered by default.

42.json

Lines changed: 7998 additions & 0 deletions
Large diffs are not rendered by default.

43.json

Lines changed: 11663 additions & 0 deletions
Large diffs are not rendered by default.

44.json

Lines changed: 9726 additions & 0 deletions
Large diffs are not rendered by default.

45.json

Lines changed: 12390 additions & 0 deletions
Large diffs are not rendered by default.

46.json

Lines changed: 12247 additions & 0 deletions
Large diffs are not rendered by default.

47.json

Lines changed: 13144 additions & 0 deletions
Large diffs are not rendered by default.

48.json

Lines changed: 12692 additions & 0 deletions
Large diffs are not rendered by default.

49.json

Lines changed: 13271 additions & 0 deletions
Large diffs are not rendered by default.

5.json

Lines changed: 13294 additions & 0 deletions
Large diffs are not rendered by default.

50.json

Lines changed: 13886 additions & 0 deletions
Large diffs are not rendered by default.

51.json

Lines changed: 6919 additions & 0 deletions
Large diffs are not rendered by default.

52.json

Lines changed: 13929 additions & 0 deletions
Large diffs are not rendered by default.

53.json

Lines changed: 11535 additions & 0 deletions
Large diffs are not rendered by default.

54.json

Lines changed: 12930 additions & 0 deletions
Large diffs are not rendered by default.

55.json

Lines changed: 9735 additions & 0 deletions
Large diffs are not rendered by default.

56.json

Lines changed: 12785 additions & 0 deletions
Large diffs are not rendered by default.

57.json

Lines changed: 11881 additions & 0 deletions
Large diffs are not rendered by default.

58.json

Lines changed: 12951 additions & 0 deletions
Large diffs are not rendered by default.

59.json

Lines changed: 12827 additions & 0 deletions
Large diffs are not rendered by default.

6.json

Lines changed: 13375 additions & 0 deletions
Large diffs are not rendered by default.

60.json

Lines changed: 12364 additions & 0 deletions
Large diffs are not rendered by default.

61.json

Lines changed: 13502 additions & 0 deletions
Large diffs are not rendered by default.

62.json

Lines changed: 9539 additions & 0 deletions
Large diffs are not rendered by default.

63.json

Lines changed: 12723 additions & 0 deletions
Large diffs are not rendered by default.

64.json

Lines changed: 13349 additions & 0 deletions
Large diffs are not rendered by default.

65.json

Lines changed: 13666 additions & 0 deletions
Large diffs are not rendered by default.

66.json

Lines changed: 11587 additions & 0 deletions
Large diffs are not rendered by default.

67.json

Lines changed: 12946 additions & 0 deletions
Large diffs are not rendered by default.

68.json

Lines changed: 11869 additions & 0 deletions
Large diffs are not rendered by default.

69.json

Lines changed: 12318 additions & 0 deletions
Large diffs are not rendered by default.

7.json

Lines changed: 13424 additions & 0 deletions
Large diffs are not rendered by default.

8.json

Lines changed: 12117 additions & 0 deletions
Large diffs are not rendered by default.

9.json

Lines changed: 12164 additions & 0 deletions
Large diffs are not rendered by default.

corenlp.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -148,22 +148,22 @@ def __init__(self):
148148
sys.exit(1)
149149

150150
# spawn the server
151-
start_corenlp = "%s -Xmx1800m -cp %s %s %s" % (java_path, ':'.join(jars), classname, props)
151+
start_corenlp = "%s -Xmx5g -cp %s %s %s" % (java_path, ':'.join(jars), classname, props)
152152
if VERBOSE: print start_corenlp
153153
self.corenlp = pexpect.spawn(start_corenlp)
154154

155155
# show progress bar while loading the models
156156
widgets = ['Loading Models: ', Fraction()]
157157
pbar = ProgressBar(widgets=widgets, maxval=5, force_update=True).start()
158-
self.corenlp.expect("done.", timeout=20) # Load pos tagger model (~5sec)
158+
self.corenlp.expect("done.", timeout=200) # Load pos tagger model (~5sec)
159159
pbar.update(1)
160-
self.corenlp.expect("done.", timeout=200) # Load NER-all classifier (~33sec)
160+
self.corenlp.expect("done.", timeout=2000) # Load NER-all classifier (~33sec)
161161
pbar.update(2)
162-
self.corenlp.expect("done.", timeout=600) # Load NER-muc classifier (~60sec)
162+
self.corenlp.expect("done.", timeout=6000) # Load NER-muc classifier (~60sec)
163163
pbar.update(3)
164-
self.corenlp.expect("done.", timeout=600) # Load CoNLL classifier (~50sec)
164+
self.corenlp.expect("done.", timeout=6000) # Load CoNLL classifier (~50sec)
165165
pbar.update(4)
166-
self.corenlp.expect("done.", timeout=200) # Loading PCFG (~3sec)
166+
self.corenlp.expect("done.", timeout=2000) # Loading PCFG (~3sec)
167167
pbar.update(5)
168168
self.corenlp.expect("Entering interactive shell.")
169169
pbar.finish()
@@ -189,7 +189,7 @@ def _parse(self, text):
189189
# function of the text's length.
190190
# anything longer than 5 seconds requires that you also
191191
# increase timeout=5 in jsonrpc.py
192-
max_expected_time = min(5, 3 + len(text) / 20.0)
192+
max_expected_time = max(30, 3 + len(text) / 20.0)
193193
end_time = time.time() + max_expected_time
194194

195195
incoming = ""

jsonrpc.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -743,7 +743,7 @@ class TransportSocket(Transport):
743743
- improve this (e.g. make sure that connections are closed, socket-files are deleted etc.)
744744
- exception-handling? (socket.error)
745745
"""
746-
def __init__( self, addr, limit=4096, sock_type=socket.AF_INET, sock_prot=socket.SOCK_STREAM, timeout=5.0, logfunc=log_dummy ):
746+
def __init__( self, addr, limit=4096, sock_type=socket.AF_INET, sock_prot=socket.SOCK_STREAM, timeout=50.0, logfunc=log_dummy ):
747747
"""
748748
:Parameters:
749749
- addr: socket-address

parse_protests.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
#!/usr/bin/env python2.7
2+
3+
import json
4+
from corenlp import StanfordCoreNLP
5+
import codecs
6+
from lxml.html import parse
7+
from lxml.html.clean import clean_html
8+
from StringIO import StringIO
9+
import pdb
10+
11+
# Get the corpus-file open
12+
corpusjson = 'protest.json'
13+
jsonobject = json.load(codecs.open(corpusjson))
14+
15+
16+
# Get and clean the text:
17+
texts = (clean_html(parse(StringIO(obj[4].replace("\n", " ")))).getroot().text_content() for obj in jsonobject)
18+
print "Story text generator object created."
19+
20+
21+
# Turn it into a string object, then an html object, then back into string...
22+
#texts = clean_html(parse(StringIO(text))).getroot().text_content()
23+
24+
print "Setting up parser: "
25+
# Set up the parser
26+
stanford_parser = StanfordCoreNLP()
27+
28+
print "Creating parser generator object: "
29+
# Parse dat.
30+
parsed_texts = (stanford_parser.parse(unicode(text)) for text in texts)
31+
32+
# Save the result to a file
33+
# Not sure how enumerate() works with generators; ostensibly a wrapper which
34+
# retains laziness, but I don't wanna risk it and introduce more variables.
35+
i = 0 # So, it's gross. Whatever.
36+
for story in parsed_texts:
37+
i += 1
38+
with codecs.open(str(i)+".json", 'w') as fh:
39+
json.dump(json.loads(story), fh, indent=2)
40+

protest.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)
0