8000 wikipedia-kyoto-japanese-english: increase REXML entity expansion lim… · red-data-tools/red-datasets@a76b917 · GitHub
[go: up one dir, main page]

Skip to content

Commit a76b917

Browse files
authored
wikipedia-kyoto-japanese-english: increase REXML entity expansion limit during XML parsing (#198)
Using `Datasets::WikipediaKyotoJapaneseEnglish#each` raised an `entity expansion has grown too large (RuntimeError)`. This error occurs because the entity expansion limit in REXML is set by ruby/rexml#187, and `Datasets::WikipediaKyotoJapaneseEnglish#each` exceeds that limit. In Red Datasets, increasing the entity expansion limit is not a problem because we want to handle large datasets. Therefore, we temporarily increase the limit. ## How to reproduce ```console $ cd red-datasets && bundle $ bundle exec ruby example/wikipedia-kyoto-japanese-english.rb ... /home/otegami/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/rexml-3.3.4/lib/rexml/parsers/baseparser.rb:560:in `block in unnormalize': entity expansion has grown too large (RuntimeError) ... ```
1 parent 4ebf6ff commit a76b917

File tree

1 file changed

+14
-1
lines changed

1 file changed

+14
-1
lines changed

lib/datasets/wikipedia-kyoto-japanese-english.rb

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,9 @@ def each(&block)
8989
next unless base_name.end_with?(".xml")
9090
listener = ArticleListener.new(block)
9191
parser = REXML::Parsers::StreamParser.new(entry.read, listener)
92-
parser.parse
92+
with_increased_entity_expansion_text_limit do
93+
parser.parse
94+
end
9395
when :lexicon
9496
next unless base_name == "kyoto_lexicon.csv"
9597
is_header = true
@@ -106,6 +108,9 @@ def each(&block)
106108
end
107109

108110
private
111+
112+
ENTITY_EXPANSION_TEXT_LIMIT = 163_840
113+
109114
def download_tar_gz
110115
base_name = "wiki_corpus_2.01.tar.gz"
111116
data_path = cache_dir_path + base_name
@@ -114,6 +119,14 @@ def download_tar_gz
114119
data_path
115120
end
116121

122+
def with_increased_entity_expansion_text_limit
123+
default_limit = REXML::Security.entity_expansion_text_limit
124+
REXML::Security.entity_expansion_text_limit = ENTITY_EXPANSION_TEXT_LIMIT
125+
yield
126+
ensure
127+
REXML::Security.entity_expansion_text_limit = default_limit
128+
end
129+
117130
class ArticleListener
118131
include REXML::StreamListener
119132

0 commit comments

Comments
 (0)
0