Hunmorph Open Source Word Analysis

Uploaded by

rickshark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views9 pages

Hunmorph Open Source Word Analysis

Uploaded by

rickshark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Hunmorph: open source word analysis

Viktor Trón György Gyepesi Péter Halácsy

IGK, U of Edinburgh K-PRO Ltd. Centre of Media Research and Education
2 Buccleuch Place H-2092 Budakeszi Stoczek u. 2
EH8 9LW Edinburgh Villám u. 6. H-1111 Budapest
v.tron@ed.ac.uk ggyepesi@kpro.hu hp@mokk.bme.hu

András Kornai László Németh Dániel Varga

MetaCarta Inc. CMRE CMRE
350 Massachusetts Avenue Stoczek u. 2 Stoczek u. 2
Cambridge MA 02139 H-1111 Budapest H-1111 Budapest
andras@kornai.com nemeth@mokk.bme.hu daniel@mokk.bme.hu

Abstract The C/C++ runtime layer of our toolkit, called

hunmorph, was developed by extending the code-
Common tasks involving orthographic base of MySpell, a reimplementation of the well-
words include spellchecking, stemming, known Ispell spellchecker. Our technology, like
morphological analysis, and morpho- the Ispell family of spellcheckers it descends
logical synthesis. To enable signifi- from, enforces a strict separation between the
cant reuse of the language-specific re- language-specific resources (known as dictionary
sources across all such tasks, we have and affix files), and the runtime environment,
extended the functionality of the open which is independent of the target natural lan-
source spellchecker MySpell, yield- guage.
ing a generic word analysis library, the
runtime layer of the hunmorph toolkit.
We added an offline resource manage-
ment component, hunlex, which com-
plements the efficiency of our runtime
layer with a high-level description lan-
guage and a configurable precompiler.

0 Introduction
Word-level analysis and synthesis problems range
from strict recognition and approximate matching Figure 1: Architecture
to full morphological analysis and generation. Our
technology is predicated on the observation that Compiling accurate wide coverage machine-
all of these problems are, when viewed algorith- readable dictionaries and coding the morphology
mically, very similar: the central problem is to of a language can be an extremely labor-intensive
dynamically analyze complex structures derived task, so the benefit expected from reusing the
from some lexicon of base forms. Viewing word language-specific input database across tasks can
analysis routines as a unified problem means shar- hardly be overestimated. To facilitate this resource
ing the same codebase for a wider range of tasks, a sharing and to enable systematic task-dependent
design goal carried out by finding the parameters optimizations from a central lexical knowledge
which optimize each of the analysis modes inde- base, we designed and implemented a powerful of-
pendently of the language-specific resources. fline layer we call hunlex. Hunlex offers an easy
to use general framework for describing the lexi- verse application of an affix rule, the algorithm
con and morphology of any language. Using this checks whether its flags contain the one that the
description it can generate the language-specific affix rule is assigned to. This is a straight table-
aff/dic resources, optimized for the task at hand. driven approach, where affix flags can be inter-
The architecture of our toolkit is depicted in Fig- preted directly as lexical features that license en-
ure 1. Our toolkit is released under a permissive tire subparts of morphological paradigms. To pick
LGPL-style license and can be freely downloaded applicable affix rules efficiently, MySpell uses a
from mokk.bme.hu/resources/hunmorph. fast indexing technique to check affixation condi-
The rest of this paper is organized as follows. tions.
Section 1 is about the runtime layer of our toolkit. In theory, affix-rules should only specify gen-
We discuss the algorithmic extensions and imple- uine prefixes and suffixes to be stripped before lex-
mentational enhancements in the C/C++ runtime ical lookup. But in practice, for languages with
layer over MySpell, and also describe the newly rich morphology, the affix stripping mechanism is
created Java port jmorph. Section 2 gives an (ab)used to strip complex clusters of affix morphs
overview of the offline layer hunlex. In Section 3 in a single step. For instance, in Hungarian, due
we consider the free open source software alterna- to productive combinations of derivational and in-
tives and offer our conclusions. flectional affixation, a single nominal base can
yield up to a million word forms. To treat all
1 The runtime layer these combinations as affix clusters, legacy ispell
resources for Hungarian required so many com-
Our development is a prime example of code bined affix rule entries that its resource file sizes
reuse, which gives open source software devel- were not manageable.
opment most of its power. Our codebase is a To solve this problem we extended the affix
direct descendant of MySpell, a thread-safe C++ stripping technique to a multistep method: after
spell-checking library by Kevin Hendricks, which stripping an affix cluster in step i, the resulting
descends from Ispell Peterson (1980), which in pseudo-stem can be stripped of affix clusters in
turn goes back to Ralph Gorin’s spell (1971), step i + 1. Restrictions of rule application are
making it probably the oldest piece of linguistic checked with the help of flags associated to affixes
software that is still in active use and development analogously to lexical entries: this only required
(see fmg-www.cs.ucla.edu/fmg-members/ a minor modification of the data structure coding
geoff/ispell.html). affix entries and a recursive call for affix stripping.
The key operation supported by this codebase is By cross-checking flags of prefixes on the suffix
affix stripping. Affix rules are specified in a static (as opposed to the stem only), simultaneous pre-
resource (the aff file) by a sequence of conditions, fixation and suffixation can be made interdepen-
an append string, and a strip string: for example, dent, extending the functionality to describe cir-
in the rule forming the plural of body the strip cumfixes like German participle ge+t, or Hungar-
string would be y, and the affix string would be ian superlative leg+bb, and in general provide the
ies. The rules are reverse applied to complex input correct handling of prefix-suffix dependencies like
wordforms: after the append string is stripped and English undrinkable (cf. *undrink), see Németh
the edge conditions are checked, a pseudo-stem is et al. (2004) for more details.
hypothesized by appending the strip string to the Due to productive compounding in a lot of lan-
stem which is then looked up in the base dictio- guages, proper handling of composite bases is a
nary (which is the other static resource, called the feature indispensable for achieving wide coverage.
dic file). Ispell incorporates the possibility of specifying
Lexical entries (base forms) are all associated lexical restrictions on compounding implemented
with sets of affix flags, and affix flags in turn are as switches in the base dictionary. However, the
associated to sets of affix rules. If the hypothe- algorithm allows any affixed form of the bases that
sized base is found in the dictionary after the re- has the relevant switch to be a potential member
of a compound, which proves not to be restrictive The single most important algorithmic aspect that
enough. We have improved on this by the intro- distinguishes the recognition task from analysis
duction of position-sensitive compounding. This is the handling of ambiguous structures. In the
means that lexical features can specify whether original MySpell design, identical bases are con-
a base or affix can occur as leftmost, rightmost flated and once their switch-set licensing affixes
or middle constituent in compounds and whether are merged, there is no way to tell them apart.
they can only appear in compounds. Since these The correct handling of homonyms is crucial for
features can also be specified on affixes, this pro- morphological analysis, since base ambiguities
vides a welcome solution to a number of resid- can sometimes be resolved by the affixes. In-
ual problems hitherto problematic for open-source terestingly, our improvement made it possible to
spellcheckers. In some Germanic languages, ’fo- rule out homonymous bases with incorrect simul-
gemorphemes’, morphemes which serve linking taneous prefixing and suffixing such as English
compound constituents can now be handled easily out+number+’s. Earlier these could be handled
by allowing position specific compound licensing only by lexical pregeneration of relevant forms or
on the foge-affixes. Another important example is duplication of affixes.
the German common noun: although it is capital- Most importantly, ambiguity arises in relation
ized in isolation, lowercase variants should be ac- to the number of analyses output by the system.
cepted when the noun is a compound constituent. While with spell-checking the algorithm can ter-
By handling lowercasing as a prefix with the com- minate after the first analysis found, performing
pound flag enabled, this phenomenon can be han- an exhaustive search for all alternative analyses is
dled in the resource file without resort to language a reasonable requirement in morphological analy-
specific knowledge hard-wired in the code-base. sis mode as well as in some stemming tasks. Thus
the exploration of the search space also becomes
1.1 From spellchecking to morphological an active parameter in our enhanced implementa-
analysis tion of the algorithm:
We now turn to the extensions of the MySpell
algorithm that were required to equip hunmorph • search until the first correct analysis
with stemming and morphological analysis func-
• search restricted multiple analyses (e.g., dis-
tionality. The core engine was extended with an
abling compounds)
optional output handling interface that can process
arbitrary string tags associated with the affix-rules • search all alternative analyses
read from the resources. Once this is done, sim-
ply outputting the stem found at the stage of dic- Search until the first analysis is a functionality for
tionary lookup already yields a stemmer. In mul- recognizers used for spell-checking and stemming
tistep affix stripping, registering output informa- for accelerated document indexing. Preemption
tion associated with the rules that apply renders of potential compound analyses by existing lexi-
the system capable of morphological analysis or cal bases serves as a general way of filtering out
other word annotation tasks. Thus the processing spurious ambiguities when a reduction is required
of output tags becomes a mode-dependent param- in the space of alternative analyses. In these cases,
eter that can be: frequent compounds which trick the analyzer can
be precompiled to the lexicon. Finally, there is a
• switched off (spell-checking) possibility to give back a full set of possible anal-
yses. This output then can be passed to a tagger
• turned on only for tag lookup in the dictio- that disambiguates among the candidate analyses.
nary (simple stemming) Parameters can be used that guide the search (such
as ’do lexical lookup first at all stages’ or ’strip the
• turned on fully to register tags with all rule- shortest affix first’), which yield candidate rank-
applications (morphological analysis) ings without the use of numerical weights or statis-
tics. These rankings can be used as disambigua- If not, the Step 1 consumer requests further anal-
tion heuristics based on a general idea of blocking yses from the Step 2 consumer. If, however, the
(e.g., Times would block an analysis of time+s). answer is positive, the Step 1 consumer returns its
All further parametrization is managed offline by analysis to the Step 0 (initial) consumer, which de-
the resource compiler layer, see Section 2. cides whether further analyses are needed.
In terms of functionality, there are a number of
1.2 Reimplementing the runtime layer differences between the Java and the C++ variants.
In our efforts to gear up the MySpell codebase jmorph records the full parse tree of rule appli-
to a fully functional word analysis library we suc- cations. By offering various ways of serializing
cessfully identified various resource-related, algo- this data structure, it allows for more structured
rithmic and implementational bottlenecks of the information in the outputs than would be possible
affix-rule based technology. With these lessons by simple concatenation of the tag chunks asso-
learned, a new project has been launched in or- ciated with the rules. Class-based restrictions on
der to provide an even more flexible and efficient compounding is implemented and will eventually
open source runtime layer. A principled object- supersede the overgeneralizing position-based re-
oriented refactorization of the same basic algo- strictions that the C++ variant and our resources
rithm described above has already been imple- currently use.
mented in Java. This port, called jmorph also uses Two major additional features of jmorph are
the aff/dic resource formats. its capability of morphological synthesis as well
In jmorph, various algorithmic options guiding as acting as a guesser (hypothesizing lemmas).
the search (shortest/longest matching affix) can Synthesis is implemented by forward application
be controlled for each individual rule. The im- of affix rules starting with the base. Rules have
plementation keeps track of affix and compound to be indexed by their tag chunks for the search,
matches checking conditions only once for a given so synthesis introduces the non-trivial problem of
substring and caching partial results. As a conse- chunking the input tag string. This is currently im-
quence, it ends up being measurably faster than plemented by plug-ins for individual tag systems,
the C++ implementation with the same resources. however, this should ideally be precompiled off-
The main loop of jmorph is driven by config- line since the space of possible tags is limited.
uring consumers, i.e., objects which monitor the
recursive step that is running. For example the 2 Resource development and offline
analysis of the form beszédesek ’talkative.PLUR’ precompilation
begins by inspecting the global configuration of
the analysis: this initial consumer specifies how Due to the backward compatibility of the runtime
many analyses, and what kind, need to be found. layer with MySpell-style resources, our software
In Step 1, the initial consumer finds the rule that can be used as a spellchecker and simplistic stem-
strips ek with stem beszédes, builds a consumer mer for some 50 languages for which MySpell
that can apply this rule to the output of the analy- resources are available, see lingucomponent.
sis returned by the next consumer, and launches openoffice.org/spell dic.html.
the next step with this consumer and stem. In For languages with complex morphology, com-
Step 2, this consumer finds the rule stripping es piling and maintaining these resources is a painful
with stem beszéd, which is found in the lexicon. undertaking. Without using a unified framework
beszéd is not just a string, it is a complete lexi- for morphological description and a principled
cal object which lists the rules that can apply to method of precompilation, resource developers for
it and all the homonyms. The consumer creates a highly agglutinative languages like Hungarian (see
new analysis that reflects that beszédes is formed magyarispell.sourceforge.net) have to re-
from beszéd by suffixing es (a suffix object), and sort to a maze of scripts to maintain and precom-
passes this back to its parent consumer, which ver- pile aff and dic files. This problem is intolerably
ifies whether the ek suffixation rule is applicable. magnified once morphological tags or additional
lexicographic information are to be entered in or- PL
der to provide resources for the analysis routines TAG: "[PLUR]"
of our runtime layer. OUT: PL_POSS
The offline layer of our toolkit seeks to remedy # house -> houses
this by offering a high-level description language , +s MATCH: [^shoxy] IF: regular
in which grammar developers can specify rule- # kiss -> kisses
based morphologies and lexicons (somewhat in , +es MATCH: [^c]s IF: regular
the spirit of lexc Beesley and Karttunen (2003), # ...
the frontend to Xerox’s Finite State Toolkit). This # ethics
promises rapid resource development which can , + MATCH: cs IF: regular
then be used in various tasks. Once primary re- # body -> bodies <C> is a regexp macro
sources are created, hunlex, the offline precom- , +ies MATCH: <C>y CLIP:1 IF: regular
piler can generate aff and dic resources op- # zloty -> zlotys
timized for the runtime layer based on various , +s MATCH: <C>y IF: y-ys
compile-time configurations. # macro -> macros
Figure 2 illustrates the description language , +s MATCH: [ô]o IF: regular
with a fragment of English morphology describ- # potato -> potatoes
ing plural formation. Individual rules are sepa- , +es MATCH: [ô]o IF: o-oes
rated by commas. The syntax of the rule descrip- # wife -> wives
tions organized around the notion of information , +ves MATCH: fe CLIP: 2 IF: f-ves
blocks. Blocks are introduced by keywords (like # leaf -> leaves
IF:) and allow the encoding of various properties , +ves MATCH: f CLIP: 1 IF: f-ves
of a rule (or a lexical entry), among others speci- ;
fying affixation (+es), substitution, character trun-
cation before affixation (CLIP: 1), regular ex- Figure 2: hunlex grammar fragment
pression matches (MATCH: [ô]o), positive and
negative lexical feature conditions on application
(IF: f-v_altern), feature inheritance, output into morphemes serves to index those rules which
(continuation) references (OUT: PL_POSS), out- can be referenced in output conditions, For exam-
put tags (TAG: "[PLUR]"). ple, in the above the plural morpheme specifies
One can specify the rules that can be applied to that the plural possessive rules can be applied to
the output of a rule and also one can specify appli- its output (OUT: PL_POSS). This design makes it
cation conditions on the input to the rule. These possible to handle some morphosyntactic dimen-
two possibilities allow for many different styles sions (part of speech) very cleanly separated from
of morphological description: one based on in- the conditions regulating the choice of allomorphs,
put feature constraints, one based on continuation since the latter can be taken care of by input fea-
classes (paradigm indexes), and any combination ture checking and pattern matching conditions of
between these two extremes. On top of this, reg- rules. The lexicon has the same syntax as the
ular expression matches on the input can also be grammar only that morphemes stand for lemmas
used as conditions on rule application. and variant rules within the morpheme correspond
Affixation rules “grouped together” here under to stem allomorphs.
PLUR can be thought of as allomorphic rules of the Rules with zero affix morph can be used as
plural morpheme. Practically, this allows informa- filters that decorate their inputs with features
tion about the morpheme shared among variants based on their orthographic shape or other features
(e.g., morphological tag, recursion level, some present. This architecture enables one to let only
output information) to be abstracted in a pream- exceptions specify certain features in the lexicon
ble which then serves as a default for the individ- while regular words left unspecified are assigned
ual rules. Most importantly, the grouping of rules a default feature by the filters (see PL_FILTER in
REGEXP: C [bcdfgklmnprstvwxyz]; form the ‘affix-rule’ data-structure by enriching its
internal representation according to the semantic
DEFINE: N content of the block. At the end of each unit,
OUT: SG PL_FILTER the empty rule is passed to the composition of
TAG: NOUN block functions to result in a specific rule. Thanks
; to OCaml’s flexibility of function abstraction and
composition, this design makes it easy to imple-
PL_FILTER ment macros of arbitrary blocks directly as func-
OUT: tions. When the grammar is parsed, rules are ar-
PL ranged in a directed (possibly cyclic) graph with
FILTER: edges representing possible rule applications as
f-ves given by the output specifications.
y-ys Precompilation proceeds by performing a re-
o-oes cursive closure on this graph starting from lexi-
regular cal nodes. Rules are indexed by ’levels’ and con-
, DEFAULT: tiguous rule-nodes that are on the same level are
regular merged along the edges if constraints on rule ap-
; plication (feature and match conditions, etc.) are
satisfied. These precompiled affix-clusters and
Figure 3: Macros and filters in hunlex complex lexical items are to be placed in the aff
and dic file, respectively.
Instead of affix merging, closure between rules
Figure 3) potentially conditioned the same way as
a and b on different levels causes the affix clus-
any rule application. Feature inheritance is fully
ters in the closure of b to be registered as rules in
supported, that is, filters for particular dimensions
a hash and their indexes recorded on a. After the
of features (such as the plural filter in Figure 3)
entire lexicon is read, these index sets registered
can be written as independent units. This design
on rules are considered. The affix cluster rules to
makes it possible to engineer sophisticated filter
be output into the affix file are arranged into max-
chains decorating lexical items with various fea-
imal subsets such that if two output affix cluster
tures relevant for their morphological behavior.
rules a and b are in the same set, then every item
With this at hand, extending the lexicon with a reg-
or affix to which a can be applied, b can also be
ular lexeme just boils down to specifying its base
applied. These sets of affix clusters correspond to
and part of speech. On the other hand, indepen-
partial paradigms which each full paradigm either
dent sets of filter rules make feature assignments
includes or is disjoint with. The resulting sets of
transparent and maintainable.
output rules are assigned to a flag and items ref-
In order to support concise and maintainable erencing them will specify the appropriate com-
grammars, the description language also allows bination of flags in the output dic and aff file.
(potentially recursive) macros to abbreviate arbi- Since equivalent affix cluster rules are conflated,
trary sets of blocks or regular expressions, illus- the compiled resources are always optimal in the
trated in Figure 3. following three ways.
The resource compiler hunlex is a stand- First, the affix file is redundancy free: no two af-
alone program written in OCaml which comes fix rules have the same form. With hand-coded af-
with a command-line as well as a Makefile as fix files this can almost never be guaranteed since
toplevel control interface. The internal workings one is always inclined to group affix rules by lin-
of hunlex are as follows. guistically motivated paradigms thereby possibly
As the morphological grammar is parsed by the duplicating entries. A redundancy-free set of affix
precompiler, rule objects are created. A block is rules will enhance performance by minimizing the
read and parsed into functions which each trans- search space for affixes. Note that conflation of
identical rules by the runtime layer is not possible choice of tagset for different encodings, sup-
without reindexing the flags which would be very port for sense indexes
computationally intensive if done at runtime.
Second, given the redundancy free affix-set, • arbitrary selection of morphemes
maximizing homogeneous rulesets assigned to a
• setting levels of morphemes (grouping of
flag minimizes the number of flags used. Since the
morphs that are precompiled as a cluster to
internal representation of flags depends on their
be stripped with one rule application by the
number, this has the practical advantage of reduc-
runtime layer)
ing memory requirements for the runtime layer.
Third, identity of output affix rules is calculated • fine-tuning which morphemes are stripped
relative to mode and configuration settings, there- during stemming
fore identical morphs with different morphologi-
cal tags will be conflated for recognizers (spell- • arbitrary selection of morphophonological
checking) where ambiguity is irrelevant, while for features that are to be observed or ignored
analysis it can be kept apart. This is impossible (allows for enhancing robustness by e.g., tol-
to achieve without a precompilation stage. Note erating non-standard regularizations)
that finite state transducer-based systems perform
essentially the same type of optimizations, elimi- The input description language allows for arbi-
nating symbol redundancy when two symbols be- trary attributes (ones encoding part of speech, ori-
have the same in every rule, and eliminating state gin, register, etc.) to be specified in the descrip-
redundancy when two states have the exact same tion. Since any set of attributes can be selected to
continuations. be compiled into the runtime resources, it takes
Though the bulk of the knowledge used by no more than precompiling the central database
spellcheckers, by stemmers, and by morphologi- with the appropriate configuration for the runtime
cal analysis and generation tools is shared (how analyzer to be used as an arbitrary word annota-
affixes combine with stems, what words allow tion tool, e.g., style annotator or part of speech
compounding), the ideal resources for these var- tagger. We also provide an implementation of a
ious tasks differ to some extent. Spellcheck- feature-tree based tag language which we success-
ers are meant to help one to conform to ortho- fully used for the description of Hungarian mor-
graphic norms and therefore should be error sen- phology.
sitive, stemmers and morphological analyzers are If the resources are created for some filtering
expected to be more robust and error tolerant espe- task, say, extracting (possibly inflected) proper
cially towards common violations of standard use. nouns in a text, resource optimization described
Although this seems at first to justify the individ- above can save considerable amounts of time com-
ual efforts one has to invest in tailoring one’s re- pared to full analysis followed by post-processing.
sources to the task at hand, most of the resource While the relevant portion of the dictionary might
specifics are systematic, and therefore allow for be easily filtered therefore speeding up lookup, tai-
automatic fine-tuning from a central knowledge loring a corresponding redundancy-free affix file
base. Configuration within hunlex allows the would be a hopeless enterprise without the pre-
specification of various features, among others: compiler.
As we mentioned, our offline layer can be con-
• selection of registers and degree of normativ- figured to cluster any or no sets of affixes together
ity based on usage qualifiers in the database on various levels, and therefore resources can be
(allows for boosting robustness for analysis optimized for either memory use (affix by affix
or stick to normativity for synthesis and spell- stripping) or speed (generally toward one level
checking) stripping). This is a major advantage given po-
tential applications as diverse as spellchecking on
• flexible selection of output information: the word processor of an old 386 at one end, and
industrial scale stemming on terabytes of web con- error-tolerant analysis for spellchecking Oflazer
tent for IR at the other. (1996). Nonetheless, XFST is not free software,
In sum, our offline layer allows for the princi- and as long as the work is not driven by aca-
pled maintenance of a central resource, saving the demic curiosity alone, the LGPL-style license of
redundant effort that would otherwise have to be our toolkit, explicitly permitting reuse for com-
invested in encoding very similar knowledge in a mercial purposes as well, can already decide the
task-specific manner for each word level analysis choice.
task. There are other free open source ana-
lyzer technologies, either stand-alone an-
3 Conclusion alyzers such as the Stuttgart Finite State
Toolkit (SFST, available only under the
The importance of word level analysis can hardly GPL, see www.ims.uni-stuttgart.de/
be questioned: spellcheckers reach the extremely projekte/gramotron/SOFTWARE/SFST.html,
wide audience of all word processor users, stem- Smid et al. (2004)) or as part of a power-
mers are used in a variety of areas ranging from ful integrated NLP platform such as In-
information retrieval to statistical machine transla- tex/NooJ (freely available for academic re-
tion, and for non-isolating languages morpholog- search to individuals affiliated with a university
ical analysis is the initial phase of every natural only, see intex.univ-fcomte.fr; a clone
language processing pipeline. called Unitex is available under LGPL, see
Over the past decades, two closely intertwined www-igm.univ-mlv.fr/~unitex.) Unfortu-
methods emerged to handle word analysis tasks, nately, NooJ has its limitations when it comes
affix stripping and finite state transducers (FSTs). to implementing complex morphologies (Vajda
Since both technologies can provide industrial et al., 2004) and SFST provides no high-level
strength solutions for most tasks, when it comes offline component for grammar description and
to choice of actual software and its practical use, configurable resource creation.
the differences that have the greatest impact are We believe that the liberal license policy and the
not lodged in the algorithmic core. Rather, two powerful offline layer contributed equally to the
other factors play a role: the ease with which one huge interest that our project generated, in spite
can integrate the software into applications and the of its relative novelty. MySpell was not just our
infrastructure offered to translate the knowledge of choice: it is also the spell-checking library incor-
the grammarian to efficient and maintainable com- porated into OpenOffice.org, a free open-source
putational blocks. office suite with an ever wider circle of users. The
To be sure, in an end-to-end machine learning Hungarian build of OpenOffice is already running
paradigm, the mundane differences between how our C++ runtime library, but OpenOffice is now
the systems interact with the human grammari- considering to completely replace MySpell with
ans would not matter. But as long as the gram- our code. This would open up the possibility of
mars are written and maintained by humans, an of- introducing morphological analysis capabilities in
fline framework providing a high-level language to the program, which in turn could serve as the first
specify morphologies and supporting configurable step towards enhanced grammar checking and hy-
precompilation that allows for resource sharing phenation.
across word-analysis tasks addresses a major bot- Though in-depth grammars and lexica are avail-
tleneck in resource creation and management. able for nearly as many languages in FST-
The Xerox Finite State Toolkit provides com- based frameworks (InXight Corporation’s Lin-
prehensive high-level support for morphology guistX platform supports 31 languages), very lit-
and lexicon development (Beesley and Karttunen, tle of this material is available for grammar hack-
2003). These descriptions are compiled into mini- ing or open source dictionary development. In ad-
mal deterministic FST-s, which give excellent run- dition to permissive license and easy to integrate
time performance and can also be extended to infrastructure, the fact that the hunmorph routines
are backward compatible with already existing and References
freely available spellchecking resources for some
Kenneth R. Beesley and Lauri Karttunen. 2003.
50 languages goes a long way toward explaining
Finite State Morphology. CSLI Publications.
its rapid spread.
For Hungarian, hunlex already serves as the Daniel Karp, Yves Schabes, Martin Zaidel, and
development framework for the MORPHDB project Dania Egedi. 1992. A freely available wide cov-
which merges three independently developed lex- erage morphological analyzer for english. In
ical databases by critically unifying their contents Proceedings of the 14th International Confer-
and supplying it with a comprehensive morpho- ence on Computational Linguistics (COLING-
logical grammar. It also provided a framework 92) Nantes, France.
for our English morphology project that used the László Németh, Viktor Trón, Péter Halácsy,
XTAG morphological database for English (see András Kornai, András Rung, and István Sza-
ftp.cis.upenn.edu/pub/xtag/morph-1.5, kadát. 2004. Leveraging the open-source is-
Karp et al. (1992)). A project describing the pell codebase for minority language analysis.
morphology of the Beás dialect of Romani with In Proceedings of SALTMIL 2004. European
hunlex is also under way. Language Resources Association. URL http:
The hunlex resource precompiler is not archi- //www.lrec-conf.org/lrec2004.
tecturally bound to the aff/dic format used by Kemal Oflazer. 1996. Error-tolerant finite-state
our toolkit, and we are investigating the possibility recognition with applications to morphological
of generating FST resources with it. This would analysis and spelling correction. Computational
decouple the offline layer of our toolkit from the Linguistics, 22(1):73–89.
details of the runtime technology, and would be an James Lyle Peterson. 1980. Computer programs
important step towards a unified open source so- for spelling correction: an experiment in pro-
lution for method-independent resource develop- gram design, volume 96 of Lecture Notes in
ment for word analysis software. Computer Science. Springer.
Helmut Smid, Arne Fitschen, and Ulrich Heid.
Acknowledgements 2004. SMOR: A German computational mor-
phology covering derivation, composition, and
The development of hunmorph and hunlex is fi- inflection. In Proceedings of the IVth Interna-
nancially supported by the Hungarian Ministry tional Conference on Language Resources and
of Informatics and Telecommunications and by Evaluation (LREC 2004), pages 1263–1266.
MATÁV Telecom Co., and is led by the Centre
for Media Research and Education at the Budapest Péter Vajda, Viktor Nagy, and Emı́lia Dancsecs.
University of Technology. We would like to thank 2004. A Ragozási szótártól a NooJ morfológiai
the anonymous reviewers of the ACL Software moduljáig [from a morphological dictionary to
Workshop for their valuable comments on this pa- a morphological module for NooJ]. In 2nd Hun-
per. garian Computational Linguistics Conference,
pages 183–190.

Availability

Due to the confligting needs of Unix, Windows,

and MacOs users, the packaging/build environ-
ment for our software has not yet been final-
ized. However, a version of our tools, and
some of the major language resources that have
been created using these tools, are available at
mokk.bme.hu/resouces.

Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
No ratings yet
Word Root Finder: A Morphological Segmentor Based On CRF: J Oseph Z. Chang J Ason S. Chang
8 pages
Creating and Weighting Hunspell Dictionaries As Fi
No ratings yet
Creating and Weighting Hunspell Dictionaries As Fi
17 pages
Word Formation in Natural Language Processing Systems
No ratings yet
Word Formation in Natural Language Processing Systems
3 pages
Hunspell: Spell Check & Morphology
No ratings yet
Hunspell: Spell Check & Morphology
3 pages
Chapter 1
No ratings yet
Chapter 1
41 pages
Advancing Full-Text Search Lemmatization Techniques With Paradigm Retrieval From OpenCorpora
No ratings yet
Advancing Full-Text Search Lemmatization Techniques With Paradigm Retrieval From OpenCorpora
5 pages
Morphology Notes
No ratings yet
Morphology Notes
5 pages
Sanskrit With Lexical Analysis - Uploaded by Imran
No ratings yet
Sanskrit With Lexical Analysis - Uploaded by Imran
7 pages
Words & Transducers
No ratings yet
Words & Transducers
7 pages
Associative: Model of Morphological Analysis: Empirical Inquiry 1
No ratings yet
Associative: Model of Morphological Analysis: Empirical Inquiry 1
16 pages
A Software Tool For Building A Statistical Prefix Processor
No ratings yet
A Software Tool For Building A Statistical Prefix Processor
6 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Pymorphy2 Paper
No ratings yet
Pymorphy2 Paper
12 pages
Ir Mod 4
No ratings yet
Ir Mod 4
15 pages
Linguistics & NLP: Morphology Basics
No ratings yet
Linguistics & NLP: Morphology Basics
14 pages
Modeling and Learning Multilingual Inflectional Morphology in A Minimally Supervised Framework
No ratings yet
Modeling and Learning Multilingual Inflectional Morphology in A Minimally Supervised Framework
221 pages
NLP
No ratings yet
NLP
4 pages
A Fast Morphological Algorithm With Unknown Word Guessing Induced by A Dictionary For A Web Search Engine
No ratings yet
A Fast Morphological Algorithm With Unknown Word Guessing Induced by A Dictionary For A Web Search Engine
8 pages
Tanvi Chiman 10 BE3 EXP3 A SMA
No ratings yet
Tanvi Chiman 10 BE3 EXP3 A SMA
3 pages
Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Parsing Agglutinative Word Structures and Its Application To Spelling Checking For Turkish
No ratings yet
Parsing Agglutinative Word Structures and Its Application To Spelling Checking For Turkish
7 pages
Semantic Labeling Langsec2020
No ratings yet
Semantic Labeling Langsec2020
8 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
Application of Finite Automata Representing Large Vocabularies
No ratings yet
Application of Finite Automata Representing Large Vocabularies
27 pages
Irs Unit-2 Modified
No ratings yet
Irs Unit-2 Modified
7 pages
Irs Ii
No ratings yet
Irs Ii
39 pages
Linguistics Students' Guide
No ratings yet
Linguistics Students' Guide
4 pages
CACIC 20070725 Induction Trees LopezDeLuise - v7
No ratings yet
CACIC 20070725 Induction Trees LopezDeLuise - v7
12 pages
Elixir Thesis
No ratings yet
Elixir Thesis
107 pages
7
No ratings yet
7
4 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Linguistics: Morphology Basics
No ratings yet
Linguistics: Morphology Basics
41 pages
Natural Language Processing Questions
No ratings yet
Natural Language Processing Questions
5 pages
Data Structures for IRS Experts
No ratings yet
Data Structures for IRS Experts
43 pages
A Structured Analysis On Morpheme Segmentation For Agglutinative Languages
No ratings yet
A Structured Analysis On Morpheme Segmentation For Agglutinative Languages
6 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
Finnish 2008
No ratings yet
Finnish 2008
64 pages
Session 1
No ratings yet
Session 1
33 pages
DVT U4 My Notes
No ratings yet
DVT U4 My Notes
15 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
Demorphy, German Language Morphological Analyzer
No ratings yet
Demorphy, German Language Morphological Analyzer
7 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
HG3051 Lec06 DIY
No ratings yet
HG3051 Lec06 DIY
59 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Sanskrit Lexical Analysis
0% (1)
Sanskrit Lexical Analysis
39 pages
Linguistic Issues and Methods in Computa A4919075
No ratings yet
Linguistic Issues and Methods in Computa A4919075
7 pages
Using A Cognitive Architecture in Incremental Sentence Processing
No ratings yet
Using A Cognitive Architecture in Incremental Sentence Processing
77 pages
Unitex Manual
No ratings yet
Unitex Manual
217 pages
NLP - Sem
No ratings yet
NLP - Sem
31 pages
2 NLP
No ratings yet
2 NLP
36 pages
NLP 2
No ratings yet
NLP 2
13 pages
Seminar Guidline
No ratings yet
Seminar Guidline
13 pages
An Analysis of The Askmsr Question-Answering System: Eric Brill, Susan Dumais and Michele Banko
No ratings yet
An Analysis of The Askmsr Question-Answering System: Eric Brill, Susan Dumais and Michele Banko
8 pages
Document Centered Approach To Text Normalization
No ratings yet
Document Centered Approach To Text Normalization
8 pages
KEA Practical Automatic Keyphrase Extraction
No ratings yet
KEA Practical Automatic Keyphrase Extraction
2 pages
An Unsupervised Model For Text Message Normalization
No ratings yet
An Unsupervised Model For Text Message Normalization
8 pages
Classifying Arabic Web Pages Toolkit
No ratings yet
Classifying Arabic Web Pages Toolkit
4 pages
A Stop List For General Text
No ratings yet
A Stop List For General Text
17 pages
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
No ratings yet
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
5 pages
A Suggestion-Based RDF Instance Matching System: January 2017
No ratings yet
A Suggestion-Based RDF Instance Matching System: January 2017
6 pages
2019 Book CyberSecurity PDF
No ratings yet
2019 Book CyberSecurity PDF
184 pages
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
No ratings yet
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
26 pages
How Good Is Your Model?: Andreas Müller
No ratings yet
How Good Is Your Model?: Andreas Müller
54 pages
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin-Learning From Data - A Short course-AMLBook - Com (2012) PDF
86% (7)
Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin-Learning From Data - A Short course-AMLBook - Com (2012) PDF
215 pages
Enterprise Java
No ratings yet
Enterprise Java
79 pages
Tcpdump Primer With Examples
No ratings yet
Tcpdump Primer With Examples
11 pages
Kami Export - Braiden Cody - Activity A.5.a Scavenger Hunt - Green Architecture - Lesson A - What Is
No ratings yet
Kami Export - Braiden Cody - Activity A.5.a Scavenger Hunt - Green Architecture - Lesson A - What Is
4 pages
Notes and Ques For SQL
No ratings yet
Notes and Ques For SQL
5 pages
Python Online Quiz
No ratings yet
Python Online Quiz
12 pages
G120C - Lista de Parametros
No ratings yet
G120C - Lista de Parametros
484 pages
Build A MiniPwner - Mayhem Lab
No ratings yet
Build A MiniPwner - Mayhem Lab
3 pages
Mukesh Ram
No ratings yet
Mukesh Ram
4 pages
PowerPoint 2013 vs 2016 Features
No ratings yet
PowerPoint 2013 vs 2016 Features
6 pages
NCC Group Annual Report For Year Ended 31 May 2015
No ratings yet
NCC Group Annual Report For Year Ended 31 May 2015
154 pages
SaaS Providers & Cloud Computing Insights
No ratings yet
SaaS Providers & Cloud Computing Insights
22 pages
GSM Project
No ratings yet
GSM Project
22 pages
B.E. 2019 Pattern Insem Exam. Timetable Oct-2022 - 16092022
No ratings yet
B.E. 2019 Pattern Insem Exam. Timetable Oct-2022 - 16092022
16 pages
Bronx Service Crew Training Report
No ratings yet
Bronx Service Crew Training Report
5 pages
DPCO
No ratings yet
DPCO
22 pages
COMPUTERAPPLICATIONS (M.C.A) (Madhu)
No ratings yet
COMPUTERAPPLICATIONS (M.C.A) (Madhu)
60 pages
CAO Chapter 1
No ratings yet
CAO Chapter 1
25 pages
Engineering & Tech Library Catalog
No ratings yet
Engineering & Tech Library Catalog
49 pages
Arc 19 Capacity Planning 0201
No ratings yet
Arc 19 Capacity Planning 0201
24 pages
ESP8266 WiFi Communication Guide
No ratings yet
ESP8266 WiFi Communication Guide
10 pages
Oops Full Notes-R20
No ratings yet
Oops Full Notes-R20
134 pages
AI Agents Azure Foundry Presentation
No ratings yet
AI Agents Azure Foundry Presentation
16 pages
Chapter 17 It Controls Part III Systems Development Program Changes and Application Controls
No ratings yet
Chapter 17 It Controls Part III Systems Development Program Changes and Application Controls
15 pages
Official Resume
No ratings yet
Official Resume
1 page
Experiment No. 2 Web Page Using HTML5 Title: Objective
No ratings yet
Experiment No. 2 Web Page Using HTML5 Title: Objective
34 pages
Assignment No 6
No ratings yet
Assignment No 6
13 pages
Jeffrey e Young Reinventing Your Life PDF
No ratings yet
Jeffrey e Young Reinventing Your Life PDF
182 pages
User Guide: Wonderware PAC
No ratings yet
User Guide: Wonderware PAC
160 pages
Symopsis 1
No ratings yet
Symopsis 1
12 pages
03 Functional Specification CMF
No ratings yet
03 Functional Specification CMF
5 pages

Hunmorph Open Source Word Analysis

Uploaded by

Hunmorph Open Source Word Analysis

Uploaded by

Hunmorph: open source word analysis

Viktor Trón György Gyepesi Péter Halácsy

András Kornai László Németh Dániel Varga

Abstract The C/C++ runtime layer of our toolkit, called

Due to the confligting needs of Unix, Windows,

You might also like