Diewald, 2022 - Google Patents
Matrix and double-array representations for efficient finite state tokenizationDiewald, 2022
View PDF- Document ID
- 5890568513607868029
- Author
- Diewald N
- Publication year
- Publication venue
- Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
External Links
Snippet
This paper presents an algorithm and implementation for efficient tokenization of space- delimited languages based on a deterministic finite state automaton. Two representations of the underlying data structure are presented and a model implementation for German is …
- 239000011159 matrix material 0 title description 13
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2765—Recognition
- G06F17/277—Lexical analysis, e.g. tokenisation, collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/22—Manipulating or registering by use of codes, e.g. in sequence of text characters
- G06F17/2217—Character encodings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/271—Syntactic parsing, e.g. based on context-free grammar [CFG], unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2705—Parsing
- G06F17/2715—Statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2863—Processing of non-latin text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/2795—Thesaurus; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/27—Automatic analysis, e.g. parsing
- G06F17/274—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2809—Data driven translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/21—Text processing
- G06F17/211—Formatting, i.e. changing of presentation of document
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/20—Handling natural language data
- G06F17/28—Processing or translating of natural language
- G06F17/2872—Rule based translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30943—Information retrieval; Database structures therefor; File system structures therefor details of database functions independent of the retrieved data type
- G06F17/30964—Querying
- G06F17/30979—Query processing
- G06F17/30985—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/3061—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2207/00—Indexing scheme relating to methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F2207/02—Indexing scheme relating to groups G06F7/02 - G06F7/026
- G06F2207/025—String search, i.e. pattern matching, e.g. find identical word or best match in a string
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US5610812A (en) | Contextual tagger utilizing deterministic finite state transducer | |
| Xue et al. | The penn chinese treebank: Phrase structure annotation of a large corpus | |
| US7552051B2 (en) | Method and apparatus for mapping multiword expressions to identifiers using finite-state networks | |
| US7191177B2 (en) | Keyword extracting device | |
| US20060047500A1 (en) | Named entity recognition using compiler methods | |
| JPH0351020B2 (en) | ||
| US20080208566A1 (en) | Automated word-form transformation and part of speech tag assignment | |
| Antony et al. | Computational morphology and natural language parsing for Indian languages: a literature survey | |
| Clark et al. | Pre-processing very noisy text | |
| US7346511B2 (en) | Method and apparatus for recognizing multiword expressions | |
| Jung et al. | End-to-end Korean part-of-speech tagging using copying mechanism | |
| Wong et al. | iSentenizer‐μ: Multilingual Sentence Boundary Detection Model | |
| Agbago et al. | Truecasing for the Portage system | |
| Elshafei | Machine generation of Arabic diacritical marks | |
| Diewald | Matrix and double-array representations for efficient finite state tokenization | |
| Nguyen et al. | An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting | |
| Nguyen et al. | Named entity recognition for Vietnamese | |
| Zhou et al. | A hybrid approach to Chinese word segmentation around CRFs | |
| Hu et al. | Chinese named entity recognition with CRFs: Two levels | |
| Broda et al. | Towards a set of general purpose morphosyntactic tools for Polish | |
| KR100283100B1 (en) | Statistical Application Extraction Method and Method for Massive Coral | |
| JP2002351870A (en) | Method for analyzing morpheme | |
| AlGahtani et al. | Joint Arabic segmentation and part-of-speech tagging | |
| EP1429257B1 (en) | Method and apparatus for recognizing multiword expressions | |
| Sassano | Using a partially annotated corpus to build a dependency parser for Japanese |