[go: up one dir, main page]

HK1161646A1 - Automatic context sensitive language generation, correction and enhancement using an internet corpus - Google Patents

Automatic context sensitive language generation, correction and enhancement using an internet corpus Download PDF

Info

Publication number
HK1161646A1
HK1161646A1 HK12101697.0A HK12101697A HK1161646A1 HK 1161646 A1 HK1161646 A1 HK 1161646A1 HK 12101697 A HK12101697 A HK 12101697A HK 1161646 A1 HK1161646 A1 HK 1161646A1
Authority
HK
Hong Kong
Prior art keywords
correction
words
word
input
sentence
Prior art date
Application number
HK12101697.0A
Other languages
Chinese (zh)
Other versions
HK1161646B (en
Inventor
‧卡羅夫贊格威爾
Y‧卡罗夫赞格威尔
Original Assignee
金格軟體有限公司
金格软体有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/IL2008/001051 external-priority patent/WO2009016631A2/en
Application filed by 金格軟體有限公司, 金格软体有限公司 filed Critical 金格軟體有限公司
Priority claimed from PCT/IL2009/000130 external-priority patent/WO2010013228A1/en
Publication of HK1161646A1 publication Critical patent/HK1161646A1/en
Publication of HK1161646B publication Critical patent/HK1161646B/en

Links

Landscapes

  • Machine Translation (AREA)

Description

Automatic context-dependent language generation, correction and enhancement using internet corpus
Reference to related applications
U.S. provisional patent application No.60/953,209 entitled "METHODS for detecting SENSITIVE ERROR DETECTION AND CORRECTION" filed on 8/1/2007 AND PCT patent application PCT/IL2008/001051 filed on 31/7/2008, hereby incorporated by reference for their disclosure AND hereby claiming their priority as 37CFR1.78(a) (4) AND (5) (i), are hereby incorporated by reference.
Technical Field
The present invention relates generally to computer-assisted language generation and correction, and more particularly to computer-assisted language generation and correction suitable for machine translation.
Background
The following publications are believed to represent the prior art:
U.S. patent nos. 5,659,771; no.5,907,839; no.6,424,983; no.7,296,019; no.5,956,739 and No.4,674,065
U.S. published patent applications Nos. 2006/0247914 and 2007/0106937
Disclosure of Invention
The present invention seeks to provide improved systems and functionality for computer-assisted language generation.
According to a preferred embodiment of the present invention, there is provided a computer-assisted language generation system including:
a sentence retrieval function operating based on an input text containing words to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text; and
a sentence generation function operating using a plurality of sentences retrieved from the Internet corpus by the sentence retrieval function to generate at least one correct sentence expressing the input text.
Preferably, the sentence retrieval functionality comprises:
an independent phrase generator for dividing the input text into one or more independent phrases;
a stem generator and classifier for operating on each independent phrase to generate stems occurring in words and assigning importance weights thereto; and
a replacement generator for generating a replacement stem corresponding to the stem.
According to a preferred embodiment of the present invention, the computer-assisted language generation system and further comprising a stem to sentence index that interacts with the internet corpus to retrieve the plurality of sentences containing words corresponding to the words in the input text.
Preferably, the statement generation functionality comprises:
a sentence simplification function for simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping function for grouping similar simplified sentences provided by the sentence simplification function; and
a simplified sentence group ranking function for ranking the group of similar simplified sentences.
According to a preferred embodiment of the invention, the simplified sentence set ranking function operates using at least some of the following criteria:
A. the number of simplified sentences contained in the group;
B. the degree of correspondence of stems of words in the group with stems in independent phrases and their substitutions;
C. the set includes a degree of words that do not correspond to the words and their substitutions in the independent phrase.
Preferably, the simplified sentence set ranking function operates using at least a portion of the following process:
defining a weight of the stem to indicate an importance of the word in the language;
calculating a positive matching grade corresponding to the criterion B;
calculating a negative matching grade corresponding to the criterion C;
a composite ranking is calculated based on:
the number of simplified sentences contained in a group, and the number corresponding to criterion A;
the positive match is ranked; and
the negative match is ranked.
According to an embodiment of the invention, the computer-assisted language generation system further comprises a machine translation function for providing the input text.
According to a preferred embodiment of the present invention, there is provided a machine translation system including:
a machine translation function;
a sentence retrieval function operating based on an input text provided by the machine translation function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to words in the input text; and
a sentence generation function operating using a plurality of sentences retrieved by the sentence retrieval function from the internet corpus to generate at least one correct sentence expressing the input text generated by the machine translation function.
Preferably, the machine translation function provides a plurality of alternatives corresponding to words in the input text, and the sentence retrieval function is for retrieving a plurality of sentences including words corresponding to the alternatives from the internet corpus.
According to an embodiment of the invention, the language generation includes text correction.
According to a preferred embodiment of the present invention, there is provided a text correction system including:
a sentence retrieval function that operates based on an input text provided by the text correction function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text; and
a sentence correction function operating using a plurality of sentences retrieved from the Internet corpus by the sentence retrieval function to generate at least one correct sentence expressing the input text.
Preferably, the system further comprises a sentence search function for providing the input text based on a query word input by a user.
According to a preferred embodiment of the present invention, there is provided a sentence search system including:
a sentence search function for providing input text based on a query word input by a user;
a sentence retrieval function operating based on the input text provided by the sentence search function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text; and
a sentence generation function operating using a plurality of sentences retrieved from the Internet corpus by the sentence retrieval function to generate at least one correct sentence expressing the input text generated by the sentence search function.
Preferably, the computer-assisted language generation system further comprises a speech to text conversion function for providing the input text.
According to a preferred embodiment of the present invention, there is provided a speech-to-text conversion system including:
a speech to text conversion function for providing input text;
a sentence retrieval function operating based on the input text provided by the sentence search function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text; and
a sentence generation function operating using a plurality of sentences retrieved by the sentence retrieval function from the Internet corpus to generate at least one correct sentence expressing the input text generated by the speech-to-text conversion function.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a substitution generator to generate a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence; a selector to select among at least the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and a correction generator for providing a correction output based on the selection made by the selector.
Preferably, the selector is for making the selection based on at least one of the following correction functions: spelling correction; correcting misused words; correcting grammar; and vocabulary enhancement.
According to a preferred embodiment of the invention, the selector is adapted to make the selection based on at least two of the following correction functions: spelling correction; correcting misused words; correcting grammar; and vocabulary enhancement.
Further, the selector is to make the selection based on at least one of the following timing corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and misused word correction and grammar correction prior to vocabulary enhancement.
Additionally or alternatively, the input sentence is provided by one of the following functions: a word processor function; a machine translation function; a speech to text conversion function; an optical character recognition function; and an instant messaging function; and the selector is to make the selection based on at least one of the following correction functions: correcting misused words; correcting grammar; and vocabulary enhancement.
Preferably, the correction generator comprises a corrected language input generator for providing a corrected language output based on a selection made by the selector without requiring user intervention. Additionally or alternatively, the grammar correction functionality includes at least one of punctuation correction functionality, verb inflection change correction functionality, singular/plural correction functionality, article correction functionality, and preposition correction functionality.
According to a preferred embodiment of the present invention, the syntax correction function includes at least one of a substitution correction function, an insertion correction function, and an omission correction function.
Preferably, the selector comprises a context-based scoring function for ranking the plurality of alternatives based at least in part on a frequency of occurrence of a Contextual Feature Sequence (CFS) in an internet corpus. Additionally, the context-based scoring function is also for ranking the plurality of alternatives based at least in part on normalized CFS frequency of occurrence in the Internet corpus.
The various embodiments outlined above may be combined or include a computer-assisted language correction system that includes at least one of spelling correction functionality, misused word correction functionality, grammar correction functionality and vocabulary enhancement functionality; and a contextual feature sequence function that works in conjunction with at least one of the spelling correction function, the misused word correction function, the grammar correction function and the vocabulary enhancement function and uses an internet corpus.
Preferably, the grammar correction function includes at least one of a punctuation correction function, a verb inflection change correction function, a singular/plural correction function, a article correction function, and a preposition correction function. Additionally or alternatively, the grammar correction functionality includes at least one of a substitute correction functionality, an insert correction functionality, and an omit correction functionality.
According to a preferred embodiment of the present invention, the computer-assisted language correction system includes at least two of the spelling correction function, the misused word correction function, the grammar correction function and the vocabulary enhancement function, and wherein the contextual feature sequence function works in conjunction with at least two of the spelling correction function, the misused word correction function, the grammar correction function and the vocabulary enhancement function and uses an internet corpus.
Preferably, the computer-assisted language correction system further comprises at least three of the spelling correction function, the misused word correction function, the grammar correction function and the vocabulary enhancement function, and wherein the contextual feature sequence function works in conjunction with at least three of the spelling correction function, the misused word correction function, the grammar correction function and the vocabulary enhancement function and uses an internet corpus.
According to a preferred embodiment of the present invention, the computer-assisted language correction system further comprises the spelling correction functionality; the misused word correction function; the grammar correction function; and the vocabulary enhancement function, and wherein the contextual feature sequence function works in conjunction with the spelling correction function, the misused word correction function, the grammar correction function, and the vocabulary enhancement function and uses an internet corpus.
Preferably, the correction generator comprises a correction language generator for providing a corrected language output based on a selection made by the selector without requiring production intervention.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a substitution generator to generate a text-based representation based on the language input, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence; a selector to select at least among a plurality of alternatives for each of the plurality of words in the linguistic input based at least in part on relationships among selected ones of the plurality of alternatives for at least some of the plurality of words in the linguistic input; and a correction generator for providing a correction output based on the selection made by the selector.
Preferably, the language input includes at least one of an input sentence and an input text. Additionally or alternatively, the language input is speech and the generator converts the language input in speech form into a text-based representation that provides a plurality of alternatives for a plurality of words in the language input.
According to a preferred embodiment of the invention, the language input is at least one of: inputting a text; output of the optical character recognition function; the output of the machine translation function; and an output of a word processing function, and the generator converts the language input in text form into a text-based representation that provides a plurality of alternatives for a plurality of words in the language input.
Preferably, the selector is for making the selection based on at least two of the following correction functions: spelling correction; correcting misused words; correcting grammar; and vocabulary enhancement. Further, the selector is to make the selection based on at least one of the following timing corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and misused word correction and grammar correction prior to vocabulary enhancement.
According to a preferred embodiment of the invention, the language input is speech and the selector is adapted to make the selection based on at least one of the following correction functions: correcting misused words; correcting grammar; and vocabulary enhancement.
Preferably, the selector is operable to make the selection by performing at least two of the following functions: selecting a first set or combination of words comprising fewer words than all of the plurality of words used for initial selection in the language input; then, ordering the elements of the first word set or word combination to establish a selected priority; and thereafter, when selecting among a plurality of alternatives for elements of the first set of words, selecting other words of the plurality of words but not all words as context to affect the selection. Additionally or alternatively, the selector is to make the selection by performing the following functions: when selecting an element having at least two words, evaluating each of the plurality of alternatives for each of the at least two words in conjunction with each of the plurality of alternatives for another of the at least two words.
According to a preferred embodiment of the present invention, the correction generator comprises a corrected language input generator for providing a corrected language output based on a selection made by the selector without requiring user intervention.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a misused word suspector for evaluating at least a majority of words in a linguistic input based on their suitability in the context of the linguistic input; and a correction generator to provide a correction output based at least in part on the evaluation performed by the suspector.
Preferably, the computer-assisted language correction system further comprises: a substitution generator for generating a text-based representation based on the language input, the text-based representation providing a plurality of substitutions of at least one of the at least most words in the language input; and a selector for selecting at least among said plurality of alternatives for each of said at least one of said at least most words in said linguistic input; and wherein the correction generator is to provide the correction output based on the selection made by the selector. Additionally or alternatively, the computer-assisted language correction system further comprises: a suspect word output indicator for indicating a degree to which at least some of the at least most words in the linguistic input are suspected of being misspoken words.
According to a preferred embodiment of the present invention, the correction generator comprises an auto-correcting language generator for providing a corrected text output based at least in part on the evaluation performed by the suspector, without requiring user intervention.
Preferably, the language input is speech and the selector is for making the selection based on at least one of the following corrective functions: correcting misused words; correcting grammar; and vocabulary enhancement.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a misused word suspector for evaluating words in the linguistic input; a substitution generator for generating a plurality of substitutions of at least some of the words in the linguistic input that are evaluated by the doubtful word by the doubtful device, at least one of the plurality of substitutions of a word in the linguistic input being consistent with contextual characteristics of the word in the linguistic input in an internet corpus; a selector for selecting between at least said plurality of alternatives; and a correction generator to provide a correction output based at least in part on the selection made by the selector.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a misused word suspector for evaluating words in the linguistic input and identifying suspect words; a substitution generator for generating a plurality of substitutions of the suspect word; a selector for ranking each of the suspect words and ones of the plurality of alternatives generated therefor by the alternative generator according to a plurality of selection criteria and applying a bias in favor of the suspect words with respect to the ones of the plurality of alternatives generated therefor by the alternative generator; and a correction generator to provide a correction output based at least in part on the selection made by the selector.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a substitution generator for generating a plurality of substitutions of at least one word in an input based on the input; a selector for ranking each of the at least one word and some of the plurality of alternatives generated therefor by the alternative generator according to a plurality of selection criteria and applying a bias in favor of the at least one word relative to some of the plurality of alternatives generated therefor by the alternative generator, the bias being a function of an input uncertainty measure for indicating an uncertainty of a person providing the input; and a correction generator for providing a correction output based on the selection made by the selector.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a wrong-word suspector for evaluating at least a majority of words in a speech input, said suspector being responsive, at least in part, to an input uncertainty measure indicative of an uncertainty of a person providing said input, said suspector providing a suspected wrong-word output; and a substitution generator for generating a plurality of substitutions of the suspected error word identified by the suspected error word output; a selector for selecting between each suspect wrong word and the plurality of alternatives generated by the alternative generator; and a correction generator for providing a correction output based on the selection made by the selector.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: at least one of a spelling correction module, a misused word correction module, a grammar correction module, and a vocabulary enhancement module for receiving a multi-word input and providing a correction output, each of the at least one of a spelling correction module, a misused word correction module, a grammar correction module, and a vocabulary enhancement module including a replacement word candidate generator comprising: a speech similarity function to propose replacement words based on speech similarity to words in the input and to indicate a measure of speech similarity; and a string similarity function to propose replacement words based on string similarity to words in the input and to indicate a measure of string similarity for each replacement word; and a selector for selecting a word in the output or a replacement word candidate proposed by the replacement word candidate generator by using the measure of speech similarity and the measure of string similarity together with a context-based selection function.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a suspect word recognition function for receiving a multi-word language input and providing a suspect word output indicative of a suspect word; a feature recognition function for recognizing features including the suspicious word; a replacement selector for identifying a replacement for the suspect word; a feature appearance function for using a corpus and providing an appearance output that ranks each feature including the replacement according to a frequency of use of the each feature in the corpus; and a selector for providing a correction output using the occurrence output, the feature recognition function comprising a feature filtering function comprising at least one of: a function for eliminating features containing suspected errors; a function for negatively biasing features that contain words introduced in early correction iterations of the multi-word input and that have a confidence level that is less than a predetermined threshold of confidence level; and a function for eliminating a feature contained in another feature having an occurrence frequency greater than a predetermined frequency threshold.
Preferably, the selector is for making the selection based on at least two of the following correction functions: spelling correction; correcting misused words; correcting grammar; and vocabulary enhancement. Further, the selector is to make the selection based on at least one of the following timing corrections: spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and misused word correction and grammar correction prior to vocabulary enhancement.
According to a preferred embodiment of the invention, the language input is speech and the selector is adapted to make the selection based on at least one of the following correction functions: correcting grammar; and misused word correction; and vocabulary enhancement.
Preferably, the correction generator comprises a corrected language input generator for providing a corrected language output based on a selection made by the selector without requiring user intervention.
According to a preferred embodiment of the present invention, the selector is also for making the selection based at least in part on a user input uncertainty metric. Additionally, the user input uncertainty metric is a function based on an uncertainty measure of the input provided by the person. Additionally or alternatively, the selector also uses a user input history learning function.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a suspect word recognition function for receiving a multi-word language input and providing a suspect word output indicative of a suspect word; a feature recognition function for recognizing features including the suspicious word; a replacement selector for identifying a replacement for the suspect word; a function of using a corpus and providing an appearance output that ranks features including the replacement according to their frequency of use in the corpus; and a correction output generator for providing a correction output using the occurrence output, the feature recognition function including at least one of: an N-gram recognition function and a co-occurrence recognition function, and at least one of a skip-over-grammar recognition function, a conversion-grammar recognition function, and a user previous-use feature recognition function.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a grammar error suspector for evaluating at least a majority of words in a linguistic input based on their suitability in a context of the linguistic input; and a correction generator to provide a correction output based at least in part on the evaluation performed by the suspector.
Preferably, the computer-assisted language correction system further comprises: a substitution generator for generating a text-based representation based on the language input, the text-based representation providing a plurality of substitutions of at least one of the at least most words in the language input; and a selector for selecting at least among the plurality of alternatives for each of the at least one of the at least most words in the linguistic input, and wherein the correction generator is for providing the correction output based on the selection made by the selector.
According to a preferred embodiment of the present invention, the computer-assisted language correction system further comprises: a suspect word output indicator for indicating a degree to which at least some of the at least most words in the linguistic input are suspected to contain grammatical errors.
Preferably, the correction generator comprises an auto-correcting language generator for providing a corrected text output based at least in part on the evaluation performed by the suspector, without requiring user intervention.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a syntax error suspector for evaluating words in the language input; a substitution generator for generating a plurality of substitutions of at least some of the words in the linguistic input that are evaluated by the doubter as suspect words, at least one of the plurality of substitutions of words in the linguistic input being consistent with contextual characteristics of the words in the linguistic input; a selector for selecting at least among the plurality of alternatives; and a correction generator to provide a correction output based at least in part on the selection made by the selector.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a grammar error suspector for evaluating words in the linguistic input and identifying suspect words; a substitution generator for generating a plurality of substitutions of the suspect word; a selector for ranking each of the suspect words and ones of the plurality of alternatives generated therefor by the alternative generator according to a plurality of selection criteria and applying a bias in favor of the suspect words with respect to the ones of the plurality of alternatives generated therefor by the alternative generator; and a correction generator to provide a correction output based at least in part on the selection made by the selector.
Preferably, the correction generator comprises a corrected language input generator for providing a corrected language output based on a selection made by the selector without requiring user intervention.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: each replacement correction is context-based scored based at least in part on a frequency of occurrence of a Contextual Feature Sequence (CFS) in the internet corpus.
Preferably, the computer-assisted language correction system further comprises at least one of the following functions working in conjunction with the context-based scoring: a spelling correction function; a misused word correction function; a grammar correction function; and vocabulary enhancement functions.
According to a preferred embodiment of the present invention, the context-based scoring is also based at least in part on a normalized frequency of occurrence of CFSs in an internet corpus. Additionally or alternatively, the context-based score is also based at least in part on a CFS importance score. Additionally, the CFS importance score is a function of at least one of: the operation of the part of speech tagging and statement analysis functions; the length of the CFS; the frequency of occurrence of each word in the CFS and the CFS type.
According to another preferred embodiment of the present invention, there is provided a computer-assisted language correction system including vocabulary enhancement functions, the vocabulary enhancement functions including: a vocabulary challenged word recognition function; a replacement vocabulary enhancement generation function; and a context-based scoring function based at least in part on a frequency of occurrence of Contextual Feature Sequences (CFSs) in an internet corpus, the alternative vocabulary enhancement generation function comprising a synonym dictionary pre-processing function for generating alternative vocabulary enhancements.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a substitution generator to generate a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence; a selector for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence; a confidence assigner for assigning a confidence to the alternatives selected from the plurality of alternatives; and a correction generator to provide a correction output based on the selection made by the selector and based at least in part on the confidence level.
Preferably, the plurality of alternatives are evaluated based on a Context Feature Sequence (CFS) and the confidence level is based on at least one of the following parameters: the number, type, and rating of CFSs selected; a measure of statistical significance of the frequency of occurrence of the plurality of substitutions in the context of the CFS; a degree of correspondence in selection of one of the plurality of alternatives based on the preference metric for each of the CFSs and based on the word similarity scores for the plurality of alternatives; a non-contextual similarity score for the one of the plurality of alternatives above a first predetermined minimum threshold; and a degree of availability of context data, the degree being indicated by a number of CFSs having CFS scores greater than a second predetermined minimum threshold and having preference scores above a third predetermined threshold.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a punctuation error suspector for evaluating at least some words and punctuation in a linguistic input based on their suitability within the context of the linguistic input based on their frequency of occurrence in an Internet corpus; and a correction generator to provide a correction output based at least in part on the evaluation performed by the suspector.
Preferably, the correction generator comprises at least one of the following functions: lost punctuation correction functions, redundant punctuation correction functions, and punctuation replacement correction functions.
The various embodiments outlined above may be combined or include a computer-assisted language correction system comprising: a syntax element error suspector for evaluating at least some words in a linguistic input based on their suitability within a context of the linguistic input based on a frequency of occurrence of a characteristic grammar of the linguistic input in an Internet corpus; and a correction generator to provide a correction output based at least in part on the evaluation performed by the suspector.
Preferably, the correction generator comprises at least one of the following functions: a lost syntax element correction function, a redundant syntax element correction function, and a syntax element replacement correction function. Additionally or alternatively, the grammatical element is one of an article, a preposition, and a conjunctive.
Drawings
The present invention will be more fully understood and appreciated from the following detailed description taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a simplified block diagram illustration of a system and functionality for computer-assisted language correction, constructed and operative in accordance with a preferred embodiment of the present invention;
FIG. 2 is a simplified flow diagram illustrating a spelling correction function preferably used in the system and function of FIG. 1;
FIG. 3 is a simplified flow diagram illustrating a misused word and grammar correction function preferably used in the system and function of FIG. 1;
FIG. 4 is a simplified flow diagram illustrating vocabulary enhancement functions preferably used in the system and functions of FIG. 1;
fig. 5 is a simplified block diagram illustrating a Contextual Feature Sequence (CFS) function preferably used in the system and function of fig. 1.
FIG. 6A is a simplified flowchart illustrating a spelling correction function forming part of the functionality of FIG. 2 in accordance with a preferred embodiment of the present invention;
FIG. 6B is a simplified flowchart illustrating the misused word and grammar correction functionality forming part of the functionality of FIG. 3 in accordance with a preferred embodiment of the present invention;
FIG. 6C is a simplified flow diagram illustrating vocabulary enhancement functions forming part of the functions of FIG. 4 in accordance with a preferred embodiment of the present invention;
FIG. 7A is a simplified flow diagram illustrating functionality useful in the functionality of FIGS. 2 and 3 for generating replacement corrections;
FIG. 7B is a simplified flow diagram illustrating functionality useful in the functionality of FIG. 4 for generating replacement enhancements;
FIG. 8 is a simplified flow diagram illustrating functionality for non-contextual word similarity based scoring of individual replacement corrections and preferably contextual scoring using an Internet corpus, useful in the spelling correction functionality of FIG. 2;
FIG. 9 is a simplified flow diagram illustrating functionality for non-contextual word similarity based scoring of individual replacement corrections and preferably contextual scoring using an Internet corpus, useful in the misused word and grammar correction functionality of FIGS. 3, 10 and 11 and in the vocabulary enhancement functionality of FIG. 4;
FIG. 10 is a simplified flowchart illustrating the operation of the missing article, preposition, and punctuation correction functions;
FIG. 11 is a simplified flowchart illustrating the operation of the superfluous article, preposition, and punctuation correction functionality;
FIG. 12 is a simplified block diagram illustration of a system and functionality for computer-assisted language translation and generation, constructed and operative in accordance with a preferred embodiment of the present invention;
FIG. 13 is a simplified flow diagram of sentence retrieval functionality which preferably forms part of the system and functionality of FIG. 12;
14A and 14B together are a simplified flow diagram illustrating a statement generation function that preferably forms part of the system and function of FIG. 12; and
fig. 15 is a simplified flow diagram illustrating functionality for generating an alternative useful in the functionality of fig. 13, 14A and 14B.
Detailed Description
Referring now to FIG. 1, FIG. 1 is a simplified block diagram illustration of a system and functionality for computer-assisted language correction, constructed and operative in accordance with a preferred embodiment of the present invention. As seen in FIG. 1, text for correction is provided to language correction module 100 from one or more sources, including, without limitation, word processor functionality 102, machine translation functionality 104, speech-to-text conversion functionality 106, optical character recognition functionality 108, and any other source of text 110, such as instant messaging or the Internet.
The language correction module 100 preferably includes a spelling correction function 112, a misused word and grammar correction function 114 and a vocabulary enhancement function 116.
One particular feature of the present invention is that each of the spelling correction function 112, the misused word and grammar correction function 114 and the vocabulary enhancement function 116 interact with a Contextual Feature Sequence (CFS) function 118, the CFS function 118 using an internet corpus 120.
For purposes of the description herein, a contextual feature sequence or CFS is defined to include N-grams (N-grams), skip-grams (skip-grams), switch-grams (switch-grams), co-occurrences (co-occurrences), "user previous use features," and combinations thereof, which are in turn defined hereinafter with reference to FIG. 5. Note that for simplicity and clarity of description, most of the examples that follow use only n-grams. It should be understood that the present invention is not limited thereto.
The use of an internet corpus is important because it provides important statistics for a very large number of contextual feature sequences, resulting in highly robust language correction functionality. In fact, combinations of more than two words have poor statistics in traditional non-internet corpora, but have acceptable or good statistics in internet corpora.
An internet corpus is a collection of text from the world wide web, usually by crawling (crawl) on the internet and collecting text from web pagesLarge representative samples of natural language text. Preferably, dynamic text is also collected, such as chat transcripts, text from web forums, and text from blogs. The collected text is used to accumulate statistics on the natural language text. The size of the internet corpus may be, for example, 1 trillion (1,000,000,000,000) words or several trillion words, as compared to the more common corpus sizes of up to two billion words. A small sample of a network, such as a web corpus, includes 100 hundred million words, which is much smaller than a small sample of a network, such as GOOGLEIs one percent of the indexed web text. The present invention can work with network samples such as a web corpus, but preferably uses much larger network samples for the text correction task.
Preferably, the internet corpus is used in one of two ways:
one or more internet search engines are used by using CFS as a search query. The number of results for each such query provides the frequency of occurrence of the CFS.
Local indexes are built over time by crawling and indexing the internet. The number of occurrences of each CFS provides the CFS frequency. The local index and search query may be based on selectable portions of the internet and may be identified with these selected portions. Similarly, portions of the Internet may be excluded or appropriately weighted to correct for anomalies between Internet usage and general language usage. In this way, websites that are reliable in language usage (such as news and government websites) may be given greater weight than other websites (such as chat or user forums).
Preferably, the input text is first provided to the spelling correction function 112, and then to the misused word and grammar correction function 114. The input text may be any suitable text and, in the context of word processing, is preferably part of a document, such as a sentence. The vocabulary enhancement function 116 preferably operates on the text that has been provided to the spelling correction function 112 and the misused word and grammar correction function 114 in accordance with user preferences.
Preferably, language correction module 100 provides an output that includes corrected text accompanied by one or more suggested replacements for each corrected word or each group of corrected words.
Referring now to fig. 2, fig. 2 is a simplified flow diagram illustrating a spelling correction function preferably used in the system and function of fig. 1. As shown in fig. 2, the spelling correction function preferably includes the following steps:
the spelling errors in the input text are preferably recognized using a conventional dictionary with rich correct names and words that are commonly used on the internet;
spelling errors are grouped into clusters (clusters), which may comprise single or multiple words (continuous or nearly continuous) with spelling errors, and the cluster to be corrected is selected. This selection attempts to find the cluster containing the largest amount of correct context data. Preferably, the cluster having the longest sequence or sequences of correctly spelled words in its vicinity is selected. The above steps are described in more detail later with reference to fig. 6A.
Generating one or preferably more replacement corrections for each cluster, preferably based on an algorithm described later with reference to FIG. 7A;
preferably, each substitution correction is scored based at least in part on non-contextual word similarity and context scoring, preferably using an internet corpus, based on a spelling correction substitution scoring algorithm described below with reference to fig. 8;
for each cluster, selecting a single spelling correction based on the scores and giving the most preferred alternative spelling correction; and
providing a corrected text output containing a single spelling correction for each misspelled cluster, the single spelling correction replacing the misspelled cluster.
The operation of the functionality of fig. 2 may be better understood from consideration of the following example:
the following input text is received:
Physical ecudation can assits in strenghing muscles.Some students shouldeksersiv daily to inprove their strenth and helth becals thay ea so fate。
the following words are recognized as misspellings:
ecudation,assits;strenghing;eksersiv;inprove;strenth;helth;becals;thay,ea。
note that "fact" is not recognized as a spelling error because it appears in the dictionary.
The following clusters were selected as shown in table 1:
TABLE 1
Cluster # Cluster
1 eksersiv
2 inprove their strenth
3 ecudation
4 assits in strenghing
5 helth becals thay ea
With respect to cluster 2, note that "third" is correctly spelled, but is still included in the cluster because it is surrounded by misspelled words.
Cluster 1 "ekseriv" is selected for correction because it has the longest correctly spelled word sequence or sequences in its vicinity.
The following substitution corrections are generated for the misspelled word "ekseriv":
excessive,expressive,obsessive,assertive,exercise,extensive,exclusive,exertion,excised,exorcism。
each replacement correction is assigned a non-contextual word similarity score based on similarity to the pronunciation and character string of the misspelled word, for example as shown in table 2:
TABLE 2
Replacement of Non-contextual word similarity score
excessive 0.90
expressive 0.83
exercise 0.80
exorcism 0.56
The non-contextual score may be derived in various ways. One example is by using the Levelnshtein distance algorithm available on http:// en. wikipedia. org/wiki/Levenshtein _ distance. The algorithm may be applied to a string of words, a word-sound representation, or a combination of both.
Each replacement is also given a context score based on its suitability in the context of the input sentence, as shown in table 3. In this example, the context used is "Some students short < eksers iv > day"
TABLE 3
The context score is preferably derived as described below with reference to fig. 8 and is based on the frequency of the Context Feature Sequences (CFSs) in the internet corpus.
The word "exercise" is selected as the best alternative based on the context score and the non-context word similarity score as described below with reference to fig. 8.
All clusters are corrected in a similar manner. After spelling correction according to a preferred embodiment of the present invention, the spelling corrected input text is:
Physical education can assist in strengthening muscles.Some studentsshould exercise daily to improve their strength and health because they are sofate.
note that there are still misused words in the spelling corrected input text. The word "gate" needs to be corrected by the misused word and grammar correction algorithm described later with reference to fig. 3.
Referring now to fig. 3, fig. 3 is a simplified flow diagram illustrating the misused word and grammar correction functionality preferably used in the system and functionality of fig. 1. The misused word and grammar correction functionality provides correction for correctly spelled but misused words in the context of the input text and correction for grammatical errors including substitution of grammatically correct words with grammatically incorrect words, use of redundant words, and loss of words and punctuation.
As shown in FIG. 3, the misused word and grammar correction functionality preferably includes the following steps:
preferably, the suspected misused words and words with grammatical errors are identified in the spell corrected input text output from the spell correction function of fig. 2 by evaluating the suitability of at least a majority of the words within the context of the input sentence;
grouping suspected misused words and words with grammatical errors into clusters, the clusters preferably being non-overlapping; and
the cluster to be corrected is selected. The identifying, grouping and selecting steps are preferably based on an algorithm as described later with reference to fig. 6B.
Preferably, one or preferably a plurality of replacement corrections are generated for each cluster based on a replacement correction generation algorithm described later with reference to FIG. 7A;
generating one or preferably more alternative corrections for each cluster based on the missing article, preposition, and punctuation correction algorithm described later with reference to fig. 10;
generating one or preferably more alternative corrections for each cluster based on the superfluous article, preposition, and punctuation correction algorithms described later with reference to fig. 11;
preferably, each substitution correction is scored at least partially on a context basis and scored on a word similarity basis based on the misused word and grammar correction substitution scoring algorithm described hereinafter with reference to FIG. 9;
for each cluster, selecting a single misused word and grammar correction based on the scoring described above, also described later with reference to fig. 9, and giving the most preferred alternative misused word and grammar correction; and
a spelled, misused word and grammar corrected text output is provided that contains a single misused word and grammar correction for each cluster, the corrections replacing incorrect clusters.
Preferably, said scoring comprises: some applications of multiple alternatives to a suspect word favor a bias of the suspect word that is a function of an input uncertainty metric that is indicative of the uncertainty of a person providing the input.
The operation of the functionality of fig. 3 may be better understood by considering the following example:
the following input text is received:
I have money book
the following words are identified as suspect misused words:
money,book
the following clusters were generated:
money book
the following is an example (partial list) of replacement corrections that are generated for the cluster:
money books;money back;money box;money bulk;money Buick;moneyebook;money bank;mini book;mummy book;Monet book;honey book;mannerly book;mono book;Monday book;many books;mini bike;mummyback;monkey bunk;Monday booked;Monarchy back;Mourned brook
the results of at least part of the context score and non-context word similarity based scores using an internet corpus are given in table 4:
TABLE 4
Cluster Non-contextual similarity score Contextual scoring Total score
money back 0.72 0.30 0.216
many books 0.84 1.00 0.840
mini bike 0.47 0.75 0.352
money box 0.79 0.40 0.316
money bank 0.65 0.50 0.325
Monday booked 0.70 0.50 0.350
monkey bunk 0.54 0.00 0.000
It should be appreciated that there are various ways to derive the overall score. The preferred overall score is based on the algorithm described later with reference to fig. 9.
Based on the above scores, alternative "menu books" are selected. Thus, the corrected text is:
I have many books.
referring now to fig. 4, fig. 4 is a simplified flow diagram illustrating vocabulary enhancement functionality for use in the system and functionality of fig. 1. As shown in fig. 4, the vocabulary enhancement function preferably includes the following steps:
identifying vocabulary challenge (vocabularies-challenged) words with questionable suboptimal vocabulary usage in the spellings, misused words and grammar corrected input text output from the misused word and grammar correction function of fig. 3;
grouping the vocabulary challenged words into clusters, which are preferably non-overlapping;
the cluster to be corrected is selected. The identifying, grouping and selecting steps are preferably based on an algorithm as described later with reference to fig. 6C.
Preferably, one or preferably a plurality of alternative vocabulary enhancements are generated for each cluster based on the vocabulary enhancement generation algorithm described below with reference to FIG. 7B;
preferably, each alternative vocabulary enhancement is scored based on non-contextual word similarity and context scoring, preferably using an internet corpus, based on a vocabulary enhancement alternative scoring algorithm described below with reference to fig. 9;
for each cluster, selecting a single lexical enhancement based on the above-described scores also described later with reference to FIG. 9, and giving the most preferred alternative lexical enhancement; and
vocabulary enhancement suggestions are provided for each suboptimal vocabulary cluster.
The operation of the functionality of fig. 4 may be better understood by considering the following example:
the following spellings, misused words and grammar corrected input text are provided:
Wearing colorful clothes will separate us from the rest of the children in theschool.
using the functionality described later with reference to FIG. 6C, the following clusters are selected for lexical enhancement:
separate
using the functionality described later with reference to FIG. 7B, the following alternate cluster corrections are generated based on the pre-processed dictionary database described in FIG. 7B, as shown in Table 5 (partial list):
TABLE 5
Individual replacement vocabulary enhancements are scored using an internet corpus based at least in part on their suitability within the context of the input text and also based on their similarity in meaning to the vocabulary challenged word "separate".
Using the functionality described later with reference to fig. 5, the following CFSs (partial list) are generated:
′will separate′,′separate us′,′clothes will separate′,′will separate us′,′separate us from′
using the functionality described later with reference to stage IIA of fig. 9, the partial list of alternative cluster corrections in the CFS list above is generated with the frequency of occurrence matrix in the internet corpus shown in table 6:
TABLE 6
All CFSs with 0 occurrence frequency for all replacement corrections are eliminated. In this example, the following feature syntax is eliminated:
′clothes will separate′
thereafter, all CFSs included entirely in the other CFSs having at least the lowest threshold frequency of occurrence are eliminated. For example, the following feature syntax is eliminated:
′will separate′,′separate us′
in this example, the remaining CFSs are feature grammars:
′will separate us′,′separate us from′
using the last preference score described later in stages IID and IIE with reference to fig. 9, the alternative "differential" is selected, and the enhanced statement is:
Wearing colorful clothes will differentiate us from the rest of the children inthe school.
referring now to FIG. 5, FIG. 5 is a simplified block diagram illustrating a Contextual Feature Sequence (CFS) function 118 (FIG. 1) useful in the systems and functions for computer-assisted language correction of the preferred embodiments of the present invention.
The CFS function 118 preferably includes a feature extraction function that includes an N-gram extraction function, and optionally at least one of the following: skip the grammar extraction function; a conversion grammar extraction function; a co-occurrence extraction function; and the user previously used the feature extraction function.
The term N-gram, a term known in the art, refers to a sequence of N consecutive words in the input text. The N-gram extraction function may use conventional part-of-speech tagging and sentence analysis functions to avoid generating specific N-grams that are not expected to occur at high frequency in a corpus (preferably an internet corpus) based on grammatical considerations.
For the purposes of this description, the term "bypass grammar extraction function" means a function for extracting a "bypass grammar", which is a modified N-gram that omits specific unnecessary words or phrases such as adjectives, adverbs, adjective phrases and adverb phrases, or contains only words having a predetermined grammatical relationship such as a main predicate, a predicate, an adverb verb or a verb time phrase. The bypass grammar extraction functionality may use conventional part-of-speech tagging and sentence analysis functionality to help decide which words may be skipped in a given context.
For the purpose of the present description, the term "conversion grammar extraction function" means a function of recognizing "conversion grammar", which is a modified N-gram in which the appearance order of specific words is converted. The translation grammar extraction functionality may use conventional part-of-speech tagging and sentence analysis functionality to help decide which words may be translated in order of occurrence in a given context.
For the purposes of this description, the term "co-occurrence extraction function" means a function of identifying word combinations in an input sentence or an input document containing many input sentences having co-occurrence of input text words and indications of distance and direction to the input words for all words in the input text other than the words included in the N-gram, the conversion grammar or the bypass grammar, after filtering out common words such as prepositions, articles, conjunctions and other words whose function is mainly a grammatical function.
For the purposes of this description, the term "feature extraction function previously used by a user" means a function of recognizing words used by the user in other documents after filtering out common words such as prepositions, articles, conjunctions, and other words whose function is mainly a grammatical function.
For the purposes of this description, the N-gram, bypass grammar, conversion grammar, and combinations thereof are referred to as feature grammars.
For the purposes of this description, N-grams, bypass grammars, transition grammars, co-occurrences, "features previously used by a user," and combinations thereof are referred to as context feature sequences or CFSs.
The functionality of fig. 5 preferably operates on individual words or word clusters in the input text.
The operation of the functionality of fig. 5 may be better understood by considering the following example:
the following input text is provided:
Cherlock Homes the lead character and chief inspecter has been cold in bythe family doctor Dr Mortimer,to invesigate the death of sir Charles″
for the cluster "Cherlock Homes" in the input text, the following CFS is generated:
n-gram:
2-element grammar: cherlock Homes; the Homes the
3-element grammar: cherlock Homes the; the lead of Homes
4-element grammar: cherlock Homes the lead; the Homes the lead character
5-element grammar: cherlock Homes the lead character
Skip syntax:
Cherlock Homes the character;Cherlock Homes the chief inspecter;Cherlock Homes the inspecter;Cherlock Homes has been cold
Switch gram:
The lead character Cherlock Home
co-occurrence in the input text:
Character;inspector;investigate;death
co-occurrence in a document containing input text:
Arthur Conan Doyle;story
co-occurrence in other documents of the user:
mystery
for the cluster "cold" in the input text, the following CFS is generated:
n-gram:
2-element grammar: ben cold; cold in
3-element grammar: has benen cold; ben cold in; cold in by
4-element grammar: an inside has been seen cold; has been cold in; ben cold in by; coldin by the
5-element grammar: (iii) chief accelerator has been colen; an inside has been seen cold in; hasben cold in by; ben cold in by the; cold in by the family
Skip syntax:
cold in to investigate;Cherlock has been cold;cold by the doctor;coldby Dr Mortimer;character has been cold
each CFS is assigned an "importance score" based on at least one of the following, preferably more than one of the following, and most preferably all of the following:
a. and (5) operation of traditional part-of-speech tagging and statement analysis functions. A CFS comprising multiple sections of multiple analysis tree nodes is given a relatively low score. The larger the number of analysis tree nodes included in a CFS, the lower the score of the CFS.
Length of cfs. The longer the CFS, the higher the score.
Frequency of occurrence of each word in the cfs different from the input word. The higher the frequency of occurrence of such words, the lower the score.
Type of cfs. For example, N-grams are preferred over co-occurrence. The co-occurrence in the input sentence is preferable to the co-occurrence in the input document, and the co-occurrence in the input document is preferable to the "feature previously used by the user".
Referring to the example above, typical scores are shown in table 7:
TABLE 7
These CFSs and their importance scores are used in a function described below with reference to fig. 8 and 9 to perform context-based scoring of various alternative cluster corrections based on the frequency of occurrence of CFSs in the internet corpus.
Referring now to fig. 6A, fig. 6A is a simplified flow diagram illustrating functionality for: identifying misspelled words in the input text; grouping misspelled words into clusters, which are preferably non-overlapping; and selecting a cluster to be corrected.
As shown in fig. 6A, recognition of misspelled words is preferably done by using a conventional dictionary that is rich in the correct names and words that are commonly used on the internet.
Grouping misspelled words into clusters is preferably done by: consecutive or nearly consecutive misspelled words and misspelled words having a grammatical relationship are grouped into a single cluster.
The selection of the cluster to be corrected is preferably performed by: an attempt is made to find the cluster containing the largest amount of non-suspect context data. Preferably, the cluster having the longest sequence or sequences of correctly spelled words in its vicinity is selected.
Referring now to FIG. 6B, FIG. 6B is a simplified flow diagram illustrating functionality for: identifying suspected misused words and words with grammatical errors in the spelling-corrected input text; grouping suspected misused words and words with grammatical errors into clusters, the clusters preferably being non-overlapping; and selecting a cluster to be corrected.
Identifying a suspected misused word preferably proceeds as follows:
generating a characteristic grammar for each word in the spelling-corrected input text;
recording the frequency of occurrence of each feature grammar in a corpus, preferably an internet corpus;
the number of suspect feature grammars for each word is recorded. The suspect feature grammars have frequencies that are much lower than their expected frequencies or below a minimum frequency threshold. The expected frequency of the feature syntax is estimated based on the frequencies of constituent elements of the feature syntax and combinations thereof.
A word is suspect if the number of suspect feature grammars containing the word exceeds a predetermined threshold.
According to a preferred embodiment of the present invention, the frequency of occurrence (FREQF-G) of each feature grammar in the spelling-corrected input text in the corpus, preferably the internet corpus, is determined. The frequency of occurrence of each word in the spelling-corrected input text in the corpus (FREQ W) is also determined, and additionally the frequency of occurrence of each feature grammar without the word (FREQFG-W) is determined.
The expected frequency of occurrence (EFREQ F-G) of each feature grammar is calculated as follows:
EFREQ F-G ═ FREQ F-G-W ═ FREQ W/(sum of frequencies of all words in corpus)
If the ratio FREQ F-G/EFREQ F-G of the frequency of occurrence of each feature grammar in the spelling-corrected input text in the corpus, preferably the internet corpus, to the expected frequency of occurrence of each feature grammar is less than a predetermined threshold, or if FREQ F-G is less than another predetermined threshold, the feature grammar is considered to be a suspect feature grammar. Each word included in the suspect feature grammar is considered a suspect misused word or a word having a suspect grammatical error.
The operation of the function in FIG. 6B for identifying suspected misused words and words with grammatical errors in spelling-corrected input text may be better understood by considering the following example:
the following spelling-corrected input text is provided:
I have money book
the feature grammar includes the following sections:
I;I have;I have money;I have money book
table 8 indicates the frequency of occurrence of the above feature grammar in the internet corpus:
TABLE 8
The expected frequency of occurrence is calculated for each 2-gram as follows:
EFREQ F-G ═ (FREQ F-G-W FREQ W)/(sum of frequencies of all words in corpus)
For example, for a 2-gram
Expected 2-gram frequency of 2-gram (x, y) ═ 1-gram frequency of x y)/number of words in the internet corpus. For example, the trillion (1,000,000,000,000) word.
The ratio of the frequency of occurrence of each feature grammar in the spelling-corrected input text in the corpus (preferably in the internet corpus) to the expected frequency of occurrence of each feature grammar is calculated as follows:
FREQ F-G/EFREQ F-G
the ratio of the frequency of occurrence in the corpus (preferably in the internet corpus) of each of the above 2-gram in the spelling-corrected input text to the expected frequency of occurrence of each of the above 2-gram is shown in table 9.
TABLE 9
2-dimensional grammar FREQ F-G EFREQ F-G FREQ F-G/EFREQ F-G
I have 154980000 4118625.7 37.60
have money 390300 187390.8 2.08
money book 3100 20487.9 0.15
It can be seen that the FREQ F-G of "money book" is much lower than its expected frequency, and therefore, the FREQ F-G/EFREQ F-G can be considered to be less than a predetermined threshold such as 1, and thus the "money book" cluster is suspect.
It can be seen that both the 3-gram and the 4-gram, including the word "money book", have a frequency of 0 in the internet corpus. This may also be the basis for a suspicion of "money book".
Grouping suspected misused words and words with grammatical errors into clusters is preferably performed as follows: grouping consecutive or nearly consecutive suspected misused words into a single cluster; and grouping the suspected misused words having a grammatical relationship therebetween into the same cluster.
The selection of the cluster to be corrected is preferably performed by: an attempt is made to find the cluster containing the largest amount of non-suspect context data. Preferably, the cluster having the longest one or more non-suspect word sequences in its vicinity is selected.
Referring now to fig. 6C, fig. 6C is a simplified flow diagram illustrating functionality for: identifying vocabulary challenged words having suspect suboptimal vocabulary usage in spelled, misused words and grammar-corrected input text; grouping the vocabulary challenged words into clusters, which are preferably non-overlapping; and selecting a cluster to be corrected.
Recognizing the vocabulary challenged word preferably proceeds as follows:
pre-processing a thesaurus dictionary (thesaurus) to assign a language richness score to each word, the score indicating a ranking of the word in a hierarchy in which written language is preferred over spoken language; and wherein in internet sources, articles and books are preferred over chats and forums, for example, and wherein less used words are preferred over more frequently used words;
further pre-processing the thesaurus to eliminate words that are unlikely candidates for vocabulary enhancement based on the results of the preceding pre-processing step and grammar rules;
performing additional pre-processing to indicate for each remaining word a lexical enhancement candidate having a higher language richness score than the input word; and
checking whether each word in the input text after spelling, misused words and grammar correction appears as a remaining word in the plurality of preprocessed synonym dictionaries, and identifying each such word appearing as a remaining word as a candidate for vocabulary enhancement.
Grouping the vocabulary challenged words into preferably non-overlapping clusters is optional and preferably proceeds as follows:
grouping consecutive vocabulary challenged words into a single cluster; and
the vocabulary challenged words with grammatical relations are grouped into the same cluster.
The selection of the cluster to be corrected is preferably performed by: an attempt is made to find a cluster containing the largest number of non-lexical challenged words. Preferably, the cluster having the longest sequence or sequences of non-lexical challenged words in its vicinity is selected.
Referring now to fig. 7A, fig. 7A is a simplified flowchart illustrating functionality useful in the functionality of fig. 2 and 3 for generating replacement corrections for clusters.
If the original input word is spelled correctly, it is considered as a replacement.
As shown in fig. 7A, a plurality of replacement corrections is first generated for each word in a cluster in the following manner:
retrieving (retrieve) a plurality of words similar to each word in the cluster based on their written appearance expressed in string similarity andretrieved from a lexicon based on pronunciation or phonetic similarity. This functionality is known and is free software available on the internet, such as GNUASPell and GoogleGspell. The retrieved and prioritized words provide a first plurality of replacement corrections. For example, given the input word feezix, the word "physics" will be retrieved from the dictionary based on similar pronunciations, even though it has only one character in common, i.e., "i". The word "felix" will be retrieved based on its string similarity, although it does not have a similar pronunciation.
Additional alternatives may be generated by using rules based on known alternative usages and accumulated user input. For example, u → you, r → are, Im → I am.
Further alternatives may be generated based on grammar rules, preferably using predefined lists. Some examples are as follows:
singular/plural rules: if the input sentence is "fall off trees in the autumn ", resulting in a plurality of substitutions" leaves ".
The article rule: if the input text is "old lady "", the alternative articles "an" and "the" are produced.
Preposition rules: if the input text is "I am interestedFootball ", then the replacement prepositions" in "," at "," to "," on "," through "… … are generated.
Verb inflection change (inflection) rule: if the input text is "Hethe room "then produces the alternative verbs inflected changes" left "," leaves "," had left "… ….
Merging words and dividing word rules: if the input text is "getfitter ", then the replacement" alto "is generated.
If the input text is "we have toout ", a replacement" watch "is generated.
If the input text is "do mann", then the replacement" sit ups "is generated.
A particular feature of one preferred embodiment of the present invention is the use of contextual information such as CFS (and more particularly such as feature syntax) to generate replacement corrections, not just for scoring replacement corrections for such "context retrieval". Frequently occurring word combinations, such as CFSs, and more particularly, feature grammars, may be retrieved from existing corpus, such as internet corpus.
The following examples describe this aspect of the invention:
if The input statement is "The cat has", then the word" kts "is not sufficiently similar in pronunciation or writing to the word" kittens "such that" kittens "cannot be an alternative without this aspect of the invention.
According to this aspect of the invention, by looking up the words that normally appear after the N-gram "cat has" in the internet corpus, i.e., all words found as an in the query "cat has", the following alternatives are retrieved:
nine lives;left;fleas;dandruff;kittens;tapeworms;adopted;retractileclaws;been;urinated;diarrhea;eaten;swallowed;hairballs;gone;alwaysbeen
according to a preferred embodiment of the present invention, the alternatives for the "contextual search" are then filtered so that only alternatives for contextual searches having some degree of phonetic or written similarity to the original word (in this example, "kts") remain. In this example, the alternative "kittens" with the highest phonetic and writing similarity is retrieved.
Where the input text is automatically generated by an external system such as an optical character recognition, speech to text or machine translation system, further alternatives may be received directly from such a system. These additional alternatives are typically generated during operation of such systems. For example, in a machine translation system, alternate translations of foreign language words may be provided to the system for use as alternates.
Once all substitutions have been generated for each word in the cluster, cluster substitutions for the entire cluster are generated by: all possible combinations of alternatives are determined and then filtered based on their frequency of occurrence in a corpus, preferably an internet corpus.
The following examples are illustrative:
if the input cluster is "money book" and the substitution of the word "money" is:
Monday;many;monkey
and alternatives to the word "book" are:
books;box;back
the following cluster replacement occurs:
money books;money box;money back;Monday books;Monday box;Monday back;many books;many box;many back;monkey books;monkeybox;monkey back;many book;monkey book;Monday book
referring now to fig. 7B, fig. 7B is a simplified flowchart illustrating alternate enhanced functionality for generating clusters useful in the functionality of fig. 4.
As shown in fig. 7B, a number of alternative enhancements are first generated in the following manner:
if the original input word is correctly spelled, it is considered as a replacement.
A plurality of words obtained from a synonym dictionary or other lexical database, such as Princeton WordNet, available for free on the internet, are retrieved that are lexically related to each word in the cluster as synonyms, supersets, or subsets. The retrieved and prioritized words provide a first plurality of alternative enhancements.
Additional replacements are generated by using rules based on known replacement uses and accumulated user input.
One particular feature of the preferred embodiment of the present invention is the use of contextual information such as CFS (and more particularly such as feature syntax) to generate replacement enhancements, not just for scoring replacement enhancements for such "context retrieval". Frequently occurring word combinations, such as CFSs, and more particularly, feature grammars, may be retrieved from existing corpus, such as internet corpus.
Once all substitutions have been generated for each word within a cluster, a replacement for the entire cluster is generated by: all possible combinations of individual word substitutions are determined and the resulting multi-word combinations are filtered based on their frequency of occurrence in an existing corpus, such as the internet.
The following example illustrates the functionality of FIG. 7B:
the following input text is provided:
it was nice to meet you
the following substitutions (partial list) of the word "nice" are generated by a lexical database such as Princeton WordNet:
pleasant、good、kind、polite、fine、decent、gracious、courteous、considerate、enjoyable、agreeable、satisfying、delightful、entertaining、amiable、friendly、elegant、precise、careful、meticulous
the following replacement of the word "nice" is generated by applying a predetermined rule:
cool
for example, in response to a query such as "it was to meet", the following contextual search alternatives for the word "nice" are generated:
great;a pleasure;wonderful;lovely;good;impossible;fun;awesome;refreshing;exciting;agreed;fantastic;decided;inspiring
referring now to fig. 8, fig. 8 is a simplified flow diagram illustrating functionality for context-based and word similarity-based scoring of various alternative enhancements that is useful in the spelling correction functionality of fig. 2.
As shown in fig. 8, the context-based and word similarity-based scoring of individual replacement corrections occurs in the following general stages:
I. non-contextual scoring-individual cluster substitutions are scored based on similarity to clusters in the input text according to their written appearance and pronunciation similarity. The score does not take into account any contextual similarity outside of a given cluster.
Scoring each of the individual cluster replacements also based on an extracted Contextual Feature Sequence (CFS), which is provided as described above with reference to fig. 5. The scoring includes the following sub-stages:
in the context of CFSs extracted as described above with reference to fig. 5, the occurrence frequency analysis is preferably performed using an internet corpus for each alternative cluster correction produced by the functionality of fig. 7A.
Wherein CFS selection and weighting of individual CFSs is performed based on the results of the frequency of occurrence analysis of sub-stage IIA. The weighting is also based on the relative intrinsic importance of the individual CFSs. It should be appreciated that some CFSs may be given a weight of 0 and thus not selected. The selected CFSs are preferably given relative weights.
Assign a frequency of occurrence metric to each alternative correction of each CFS selected in sub-phase IIB.
Wherein a reduced set of replacement cluster corrections is generated based on results of the sub-stage IIA frequency of occurrence analysis, the sub-stage IIC frequency of occurrence metric, and the CFS selection and weighting of sub-stage IIB.
Selecting the cluster with the highest non-contextual similarity score in phase I from the reduced set in sub-phase IID for use as a reference cluster correction.
Assigning a frequency of occurrence metric to the reference cluster correction of sub-phase IIE of each CFS selected in phase IIB.
Assigning to each CFS selected in sub-stage IIB a ratio metric representing a ratio of the occurrence frequency metric of each alternative correction of the feature to the occurrence frequency metric assigned to the reference cluster of sub-stage IIE.
Selecting the most preferred replacement cluster correction based on the phase I results and the phase II results.
Assign a confidence score to the most preferred replacement cluster correction.
A more detailed description of the functions described above in phases II-IV is given below:
with reference to sub-stage IIA, all CFSs including the cluster to be corrected are generated as described above in fig. 5. CFS containing suspected errors (except for errors in the input cluster) are eliminated.
A matrix is generated that indicates the frequency of occurrence in the corpus (preferably the internet corpus) of each alternate correction of the clusters in each CFS. All CFSs for which all replacement corrections have an occurrence frequency of 0 are eliminated. Thereafter, all CFSs included entirely in the other CFSs having at least the lowest threshold occurrence frequency are eliminated.
The following example illustrates the generation of an occurrence frequency matrix:
the following input text is provided:
I lik tw play outside a lot
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
lik tw
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
like to;like two;lick two;lack two;lack true;like true
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘lik tw’;‘I Lik tw’;‘Lik tw play’;‘I Lik tw play’;‘Lik tw playoutside’;‘I Lik tw play outside’;‘Lik tw play outside a’
using the functionality described above with reference to stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 10 is generated for the above alternate cluster correction list in the above CFS list:
watch 10
All CFSs for which all replacement corrections have a frequency of occurrence of 0 are eliminated. In this example, the following feature syntax is eliminated:
‘lik tw play outside a’
thereafter, all CFSs included entirely in the other CFSs having at least the minimum threshold occurrence frequency are eliminated. In this example, the following feature syntax is eliminated:
‘lik tw’;‘I lik tw’;‘lik tw play’;‘I lik tw play’;‘lik tw play outside’
in this example, only the remaining CFSs are feature syntaxes:
′I lik tw play outside′。
the resulting matrix is shown in table 11:
TABLE 11
CFS/replacement cluster correction ‘I lik tw play outside’
like to 330
like two 0
lick two 0
lack two 0
lack true 0
like true 0
The above example illustrates the generation of a matrix according to a preferred embodiment of the invention. In this example, it is clear that "like to" is the preferred alternative correction. It will be appreciated that in practice, the selection is often not so straightforward. Thus, in the further examples given below, functionality is provided for making a much more difficult selection in a replacement correction.
Returning to the consideration of sub-phase IIB, optionally, each remaining CFS is assigned a score as described above with reference to FIG. 5. Additionally, CFSs that contain words introduced in earlier correction iterations of the multi-word input and have a confidence less than a predetermined confidence threshold are negatively biased.
In the general case, similar to the case described above in sub-phase IIC, a normalized frequency matrix is generated for indicating the normalized frequency of occurrence of each CFS in the internet corpus. The normalized frequency matrix is typically generated from the frequency matrix by: dividing each CFS frequency by a function of the frequency of occurrence of the associated cluster replacement.
This normalization serves to attenuate the effect of the fundamental difference in the ensemble of the individual replacement corrections. An appropriate normalization factor corrects the overall frequency of occurrence in the corpus as a whole, based on the respective alternatives, regardless of the particular CFS.
The following example illustrates the generation of a normalized frequency of occurrence matrix:
the following input text is provided:
footprints of a mysterious haund said to be six feet tall
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
haund
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
hound;hand;sound
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘a mysterious haund’;‘haund said’
using the functionality described above with reference to stage IIC, the occurrence frequency matrix in the internet corpus and the normalized occurrence frequency matrix as shown in table 12 are generated for the above alternate cluster correction list in the above CFS list:
TABLE 12
It can be appreciated from the above example that the word with the highest frequency of occurrence may not necessarily have the highest normalized frequency of occurrence because of the fundamental differences in the entirety of the respective replacement corrections. In the above example, "hound" has the highest normalized frequency of occurrence, and it is apparent from the context of the input text that the correct word is "hound" rather than "hand" which has a higher frequency of occurrence in the internet corpus.
A particular feature of the invention is that a normalized frequency of occurrence is preferably used in the selection of the substitution corrections, which attenuates fundamental differences in the ensemble of the respective substitution corrections. It will be appreciated that other measures of frequency of occurrence other than normalized frequency of occurrence may alternatively or additionally be used as a measure. In cases where the frequency of occurrence is relatively low or particularly high, additional or alternative metrics may be beneficial.
It will be appreciated from the discussion that follows that additional functionality is often useful in selecting among various alternative corrections. These functions are described below.
In sub-phase IID, each replacement cluster correction that is less preferred than another replacement cluster correction is eliminated according to the following two metrics:
i. a lower word similarity score than another alternative cluster correction; and
have a lower frequency of occurrence for all CFSs, and preferably also a lower normalized frequency of occurrence, than another replacement cluster correction.
The following example illustrates the elimination of the replacement correction as described above:
the following input text is provided:
I leav un a big house
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
leav un
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
leave in;live in;love in
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘I leav un a’;‘leav un a big’
using the functionality described above with reference to stage IIC, the occurrence frequency matrix in the internet corpus and the normalized occurrence frequency matrix as shown in table 13 are generated for the above alternate cluster correction list in the above CFS list:
watch 13
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 14:
TABLE 14
Replacement cluster correction Similarity score
leave in 0.9
live in 0.8
love in 0.7
The replacement cluster correction "love in" is eliminated because it has a lower similarity score and a lower frequency of occurrence and a lower normalized frequency of occurrence than "live in". At this stage, the replacement cluster correction "leave in" is not eliminated because its similarity score is higher than "live in".
As can be appreciated from the above, the operational result of the function of stage IID is a reduced frequency matrix for each of the reduced plurality of CFSs, and preferably also a reduced normalized frequency matrix, the reduced frequency matrix for indicating a frequency of occurrence of each of the reduced plurality of replacement corrections, the reduced normalized frequency matrix for preferably indicating a normalized frequency of occurrence of each of the reduced plurality of replacement corrections, wherein each of the replacement corrections has a similarity score. The reduced set of replacement cluster corrections is preferably used for all further replacement cluster selection functions as seen from the following examples.
For each alternative correction in the reduced frequency matrix and preferably also in the reduced normalized frequency matrix, a final preference measure is generated. One or more of the following substitution metrics may be used to generate a final preference score for each substitution correction.
The term "frequency function" is used below to refer to a function of frequency, normalized frequency, or both.
A. One possible preference metric is the highest frequency of occurrence function of each alternate cluster correction in the reduced one or more matrices of any CFS in the reduced one or more matrices. For example, each replacement cluster correction would be scored as follows:
the following input text is provided:
A big agle in the sky
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
agle
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
ogle;eagle;angel
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘big agle’;‘agle in the sky’
using the functionality described above with reference to stage IIC, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices shown in table 15 are generated for the above alternate cluster correction list in the above CFS list:
watch 15
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 16:
TABLE 16
Replacement cluster correction Similarity ofDegree score
Ogle 0.97
Eagle 0.91
Angel 0.83
The alternative "eagle" is chosen because it includes the CFS with the largest frequency of occurrence.
B. Another possible preference metric is the average frequency of occurrence function of all CFSs for each replacement correction. For example, the individual replacement corrections will be scored as follows:
the following input text is provided:
A while ago sthe lived 3 dwarfs
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
sthe
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
the;they;she;there
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘ago sthe lived’;‘sthe lived 3’
using the functionality described above with reference to stage IIC, the occurrence frequency matrices in the internet corpus, normalized occurrence frequency matrices, and average occurrence frequency matrices shown in tables 17 and 18 are generated for the above alternate cluster correction list in the above CFS list:
TABLE 17
Watch 18
Note that "there" is selected based on the average frequency of occurrence.
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 19:
watch 19
Replacement cluster correction Similarity score
the 0.97
they 0.86
she 0.88
there 0.67
Note that the replacement cluster correction with the highest similarity score was not selected.
C. Another possible preference metric is the frequency of occurrence function for each CFS multiplied by a weighted sum of the scores for that CFS calculated by the function described above with reference to fig. 5 over all CFSs for each replacement correction.
D. Generating a particular replacement correction/CFS preference metric by any one or more, more preferably most, and most preferably all of the following operations on the replacement correction in the reduced one or more matrices, as described above with reference to sub-phases IIE-IIG:
i. the replacement cluster correction with the highest non-contextual similarity score is selected as the reference cluster.
Generating a modified matrix, wherein in each preference matrix, the frequency of occurrence function of each substitution correction in each feature grammar is replaced by a ratio of the frequency of occurrence function of each substitution correction to the frequency of occurrence function of the reference cluster.
A modified matrix of the type described above in ii is further modified to replace the ratio in each preference measure as a function of the ratio: a function of this ratio reduces the computational importance of large differences in the ratio. A suitable such function is a logarithmic function. The purpose of this operation is to reduce the importance of large differences in frequency of occurrence in the last preference score of the most preferred replacement correction while maintaining the importance of large differences in frequency of occurrence in the last preference score of the least preferred replacement correction and thus eliminating the least preferred replacement correction.
Additionally modifying a modified matrix of the type described in ii or iii above by multiplying the applicable ratio or ratio function in each preference metric by the appropriate CFS score. This provides emphasis based on the correct grammar usage and other factors reflected in the CFS score.
v. additionally modifying a modified matrix of the type described above in ii, iii or iv by generating a function of the applicable ratio, the ratio function, the frequency of occurrence and the normalized frequency of occurrence. The preference function is generated by multiplying the applicable ratio or ratio function in each preference metric by the frequency of occurrence of the CFS.
E. Based on the particular replacement correction/CFS preference metric as described above in D, a final preference metric is calculated for each replacement correction by multiplying the similarity score of the replacement correction by the sum of the particular replacement correction/CFS preference metrics of all CFSs of the replacement correction.
Examples illustrating the use of such modified matrices are as follows:
the following input text is provided:
I will be able to tach base with you next week
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
tach
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
teach;touch
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘able to tach’;‘to tach base’
using the functionality described above with reference to sub-stages IIA and IIC above, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices shown in table 20 are generated for the above alternate cluster correction lists in the above CFS list:
watch 20
Note that, for one feature, both the occurrence frequency of "teach" and the normalized occurrence frequency are greater than the frequency of "touch", but for another feature, both the occurrence frequency of "touch" and the normalized occurrence frequency are greater than the frequency of "teach". For a correct choice of replacement correction, the ratio metric described above with reference to sub-phase IIG is preferably used as described below.
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 21:
TABLE 21
Replacement cluster correction Similarity score
teach 0.94
touch 0.89
It can be seen that the reference cluster is "teach" because it has the highest similarity score. Nonetheless, "touch" is selected based on the last preference score as described above. As can be appreciated from consideration of the above matrix indicating that "teach" has the highest frequency of occurrence and the highest normalized frequency of occurrence, this is not intuitive. In this example, the final preference score indicates that "touch" is selected instead of "teach" because the ratio of the frequency of occurrence of a feature in favor of "touch" is much greater than the ratio of the frequency of occurrence of another feature in favor of "teach".
F. Optionally, the replacement correction may be filtered based on a comparison of the frequency function value and preference metric of the replacement correction with the frequency function value and preference metric of the reference cluster using one or more of the following decision rules:
1. for at least one feature having a CFS score greater than a predetermined threshold, filtering out replacement corrections having a similarity score lower than the predetermined threshold and having a CFS frequency function less than that of the reference cluster.
2. For at least one feature having a CFS score greater than another predetermined threshold, filtering out replacement corrections having a similarity score lower than the predetermined threshold and having a preference metric less than the predetermined threshold.
A. determining the CFS score for each CFS;
b. for each CFS, determining a CFS frequency function of the reference cluster and the replacement correction, thereby determining whether the reference cluster or the replacement correction has a higher frequency function for that CFS;
c. summing the CFS scores of the replacement corrected CFSs having a higher frequency than the reference cluster;
d. summing the CFS scores of CFSs of reference clusters having higher frequencies than the replacement corrections; and
e. if the sum in c is less than the sum in d, the replacement correction is filtered out.
The following example illustrates the filtering function as described above.
The following input text is provided:
I am faelling en love
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
faelling en
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
falling on;falling in;feeling on;feeling in
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘am faelling en’;‘faelling en love’;‘am faelling en love’;‘I am faellingen’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 22 is generated here for the above alternate cluster correction list in the above CFS list:
TABLE 22
All CFSs that are entirely included in the other CFSs having at least the lowest threshold frequency of occurrence are eliminated. For example, the following feature syntax is eliminated:
‘am faelling en’;‘faelling en love’
in this example, the remaining CFSs are feature grammars:
‘am faelling en love’;‘I am faelling en’
in this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 23:
TABLE 23
Replacement cluster correction Similarity score
falling on 0.89
falling in 0.89
feeling on 0.82
feeling in 0.82
The replacement corrections "falling on", "feeling on" and "feeling in" are filtered out because they have an occurrence frequency of 0 for one of the CFSs.
G. As discussed above with reference to phase III, the replacement corrections retained in the filtering in F are ranked based on the last preference metric derived as described above at a-E. The replacement correction with the highest final preference score is selected.
H. As discussed above with reference to phase IV, a confidence is assigned to the selected replacement correction. The confidence is calculated based on one or more of the following parameters:
a. the number, type and score of selected CFSs provided in sub-stage IIB above;
b. statistical significance of frequency of occurrence of individual replacement cluster corrections in the context of the CFS;
c. a degree of correspondence in the selection of replacement corrections based on the preference metric for each CFS and the word similarity score for the respective replacement correction;
d. (ii) a non-contextual similarity score of the selected replacement cluster corrections above a predetermined minimum threshold (stage I);
e. a degree of availability of the context data, the degree being indicated by a number of CFSs in the reduced matrix having CFS scores greater than a predetermined minimum threshold and having preference scores exceeding another predetermined threshold.
If the confidence level is greater than a predetermined threshold, the selected replacement correction is implemented without user interaction. If the confidence level is less than the predetermined threshold but greater than a lower predetermined threshold, the selected alternative correction is implemented but the user interaction is invited. If the confidence level is less than the lower predetermined threshold, then the invitation is based on a user selection of a prioritized list of replacement corrections.
The following example illustrates the use of confidence scores:
the following input text is provided:
He was not feeling wehl when he returned
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
wehl
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
wale、well
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘was not feeling wehl’;‘not feeling wehl when’;‘feeling wehl whenhe’;‘wehl when he returned’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 24 is generated here for the above alternate cluster correction list in the above CFS list:
watch 24
The above examples illustrate: the choice of "well" instead of "way" has a high confidence level according to all the criteria given above in H.
In the following example, the confidence is somewhat less because the replacement correction 'back' has a higher frequency of occurrence than 'beacon' in the CFS 'beacon in the summer', but 'beacon' has a higher frequency of occurrence than 'back' in the CFS 'on the beacon in' and 'the beacon in the'. Based on the criterion h (c), an alternative correction 'beach' is selected with an intermediate confidence.
The following input text is provided:
I like to work on the bech in the summer
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
bech
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
beach;beech;back
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘on the bech in’;‘the bech in the’;‘bech in the summer’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 25 is generated for the above alternate cluster correction list in the above CFS list:
TABLE 25
Based on the criterion h (c), an alternative correction 'beach' is selected with an intermediate confidence.
In the following example, based on the criterion h (a), the confidence is smaller:
the following input text is received:
Exarts are restoring the British Museum′s round reading room
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
Exarts
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
Experts;Exerts;Exits
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘Exarts are’;‘Exarts are restoring’;‘Exarts are restoring the’;‘Exartsare restoring the British’
using the functionality described above with reference to stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 26 is generated for the above alternate cluster correction list in the above CFS list:
watch 26
All CFSs for which all replacement corrections have a frequency of occurrence of 0 are eliminated. In this example, the following feature syntax is eliminated:
‘Exarts are restoring’;‘Exarts are restoring the’;‘Exarts are restoringthe British’
in this example, only the remaining CFSs are feature syntaxes:
‘Exarts are’
as can be seen from the above example, the only CFS that remains in the filtering process is "exartsearch". As a result, the confidence is relatively low because the selection is based on only a single CFS, which is relatively short and includes only one word in addition to the suspect word, which is a frequently occurring word.
Referring now to fig. 9, fig. 9 is a simplified flowchart illustrating functionality for context-based scoring and word similarity-based scoring of various alternative corrections that is useful in the misused word and grammar correction functionality of fig. 3, 10 and 11 and in the vocabulary enhancement functionality of fig. 4.
As shown in FIG. 9, the context-based and word similarity-based scoring of individual replacement corrections proceeds in the following general stages:
I. non-contextual scoring-individual cluster substitutions are scored based on similarity to clusters in the input text according to their written appearance and pronunciation similarity. The score does not take into account any contextual similarity outside of a given cluster.
Scoring each of the individual cluster replacements also based on an extracted Contextual Feature Sequence (CFS), which is provided as described above with reference to fig. 5. The scoring includes the following sub-stages:
in the context of CFSs extracted as described above with reference to fig. 5, the frequency of occurrence analysis is preferably performed using an internet corpus for each alternative cluster correction resulting from the functionality of fig. 7A or 7B.
Wherein CFS selection and weighting of individual CFSs is performed based on the results of the frequency of occurrence analysis of sub-stage IIA. The weighting is also based on the relative intrinsic importance of the individual CFSs. It should be appreciated that some CFSs may be given a weight of 0 and thus not selected. The selected CFSs are preferably given relative weights.
Assign a frequency of occurrence metric to each alternative correction of each CFS selected in sub-phase IIB.
Wherein a reduced set of replacement cluster corrections is generated based on results of the sub-stage IIA frequency of occurrence analysis, the sub-stage IIC frequency of occurrence metric, and the CFS selection and weighting of sub-stage IIB.
The input cluster is selected for use as a reference cluster correction.
Assigning a frequency of occurrence metric to the reference cluster correction of sub-phase IIE of each CFS selected in phase IIB.
Assigning to each feature selected in sub-stage IIB a ratio metric representing a ratio of the occurrence frequency metric of each alternative correction to the occurrence frequency metric assigned to the reference cluster of sub-stage IIB.
Selecting the most preferred replacement cluster correction based on the phase I results and the phase II results.
Assign a confidence score to the most preferred replacement cluster correction.
A more detailed description of the functions described above in phases II-IV is given below:
with reference to sub-stage IIA, all CFSs including the cluster to be corrected are generated as described above in fig. 5. CFS containing suspected errors (except for errors in the input cluster) are eliminated.
A matrix is generated that indicates the frequency of occurrence in the corpus (preferably the internet corpus) of each alternate correction of the clusters in each CFS. All CFSs for which all replacement corrections have an occurrence frequency of 0 are eliminated. Thereafter, all CFSs included entirely in the other CFSs having at least the lowest threshold occurrence frequency are eliminated.
The following example illustrates the generation of an occurrence frequency matrix:
the following input text is provided:
I lick two play outside a lot
using the functionality described above with reference to fig. 6A, the following clusters are selected for correction:
lick two
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
like to;like two;lick two;lack two;lack true;like true
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘lick two’;‘I lick two’;‘lick two play’;‘I lick two play’;‘lick two playoutside’;‘I lick two play outside’;‘lick two play outside a’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 27 is generated for the above alternate cluster correction list in the above CFS list:
watch 27
All CFSs for which all replacement corrections have a frequency of occurrence of 0 are eliminated. In this example, the following feature syntax is eliminated:
‘lick two play outside a’
thereafter, all CFSs included entirely in the other CFSs having at least the minimum threshold occurrence frequency are eliminated. For example, the following feature syntax is eliminated:
‘lick two’;‘I lick two’;‘lick two play’;‘I lick two play’;‘lick two playoutside’
in this example, the only remaining CFSs are the following characteristic grammars:
′I lick two play outside′.
the resulting matrix is shown in table 28:
watch 28
CFS/replacement cluster correction ‘I lick two play outside’
like to 330
like two 0
lick two 0
lack two 0
lack true 0
like true 0
The above example illustrates the generation of a matrix according to a preferred embodiment of the invention. In this example, it is clear that "like to" is the preferred alternative correction. It will be appreciated that in practice, the selection is often not so straightforward. Thus, in the further examples given below, functionality is provided for making a much more difficult selection in a replacement correction.
Returning to the consideration of sub-phase IIB, optionally, each remaining CFS is assigned a score as described above with reference to FIG. 5. Additionally, CFSs that contain words introduced in earlier correction iterations of the multi-word input and have a confidence less than a predetermined confidence threshold are negatively biased.
In the general case, similar to the case described above in sub-phase IIC, a normalized frequency matrix is generated for indicating the normalized frequency of occurrence of each CFS in the internet corpus. The normalized frequency matrix is typically generated from the frequency matrix by: dividing each CFS frequency by a function of the frequency of occurrence of the associated cluster replacement.
This normalization serves to attenuate the effect of the fundamental difference in the ensemble of the individual replacement corrections. An appropriate normalization factor corrects the overall frequency of occurrence in the corpus as a whole, based on the respective alternatives, independent of the CFS.
The following example illustrates the generation of a normalized frequency of occurrence matrix:
typically, the following input text is provided by speech recognition:
footprints of a mysterious[hound/hand]said to be six feet tall
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
hound
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
hound;hand;sound
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘a mysterious hound’;‘hound said’
using the functionality described above with reference to sub-phase IIC, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices as shown in table 29 are generated here for the above alternate cluster correction list in the above CFS list:
watch 29
It can be appreciated from the above example that the word with the highest frequency of occurrence may not necessarily have the highest normalized frequency of occurrence because of the fundamental differences in the entirety of the respective replacement corrections. In the above example, "hound" has the highest normalized frequency of occurrence, and it is apparent from the context of the input text that the correct word is "hound" rather than "hand" which has a higher frequency of occurrence in the internet corpus.
A particular feature of the present invention is the use of a normalized frequency in selecting among the replacement corrections that attenuates the fundamental differences in the ensemble of the individual replacement corrections. It will be appreciated that other measures of frequency of occurrence other than normalized frequency of occurrence may alternatively or additionally be used as a measure. In cases where the frequency of occurrence is relatively low or particularly high, additional or alternative metrics may be beneficial.
It will be appreciated from the discussion that follows that additional functionality is often useful in selecting among various alternative corrections. These functions are described below.
In sub-phase IID, each replacement cluster correction that is less preferred than another replacement cluster correction is eliminated according to the following two metrics:
i. a lower word similarity score than another alternative cluster correction; and
have a lower frequency of occurrence for all CFSs, and preferably also a lower normalized frequency of occurrence, than another replacement cluster correction.
The following example illustrates the elimination of the replacement correction as described above:
the following input text is provided:
I leave on a big house
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
leave on
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
leave in;live in;love in;leave on
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘I leave on a’;‘leave on a big’
using the functionality described above with reference to stage IIE, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices as shown in table 30 are generated for the above alternate cluster correction list in the above CFS list:
watch 30
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 31:
watch 31
Replacement cluster correction Similarity score
leave in 0.90
live in 0.78
love in 0.67
leave on 1.00
The replacement cluster correction "love in" is eliminated because it has a lower similarity score and a lower frequency of occurrence and a lower normalized frequency of occurrence than "live in". The replacement cluster correction "leave in" is not eliminated at this stage because its similarity score is larger than "live in".
As can be appreciated from the above, the operational result of the function of the sub-phase IID is a reduced frequency matrix for each of the reduced plurality of CFSs, and preferably also a reduced normalized frequency matrix, the reduced frequency matrix for indicating a frequency of occurrence of each of the reduced plurality of replacement corrections, the reduced normalized frequency matrix for preferably indicating a normalized frequency of occurrence of each of the reduced plurality of replacement corrections, wherein each of the replacement corrections has a similarity score. The reduced set of replacement cluster corrections is preferably used for all further replacement cluster selection functions as seen from the following examples.
For each alternative correction in the reduced frequency matrix and preferably also in the reduced normalized frequency matrix, a final preference measure is generated. One or more of the following substitution metrics may be used to generate a final preference score for each substitution correction.
The term "frequency function" is used below to refer to a function of frequency, normalized frequency, or both.
A. One possible preference metric is the highest frequency of occurrence function of each alternate cluster correction in the reduced one or more matrices of any CFS in the reduced one or more matrices. For example, each replacement cluster correction would be scored as follows:
the following input text is provided:
I am vary satisfied with your work
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
vary
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
vary;very
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘am vary’;‘vary satisfied’;‘I am vary satisfied with’
using the functionality described above with reference to sub-phase IIC, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices shown in tables 32 and 33 are generated for the above alternate cluster correction list in the above CFS list:
watch 32
Watch 33
In this example, "very" has the highest frequency of occurrence function, as can be seen from both the frequency of occurrence and the normalized frequency of occurrence.
B. Another possible preference metric is the average frequency of occurrence function of all CFSs for each replacement correction. For example, the individual replacement corrections will be scored as follows:
the following input text is provided:
A while ago the lived 3 dwarfs
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
the
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
the;they;she;there
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘ago the lived’;‘the lived 3’
using the functionality described above with reference to sub-phase IIC, the occurrence frequency matrices in the internet corpus, normalized occurrence frequency matrices, and average occurrence frequency matrices shown in tables 34 and 35 are generated for the above alternate cluster correction list in the above CFS list:
watch 34
Watch 35
Note that "then" is selected based on the average frequency of occurrence, although "there" has a CFS whose frequency of occurrence is the largest frequency of occurrence in the matrix.
In this example, the non-contextual similarity scores for the replacement cluster corrections are shown in table 36:
watch 36
Replacement cluster correction Similarity score
the 1.00
they 0.86
she 0.76
there 0.67
Note that the replacement cluster correction with the highest similarity score was not selected.
C. Another possible preference metric is the frequency of occurrence function for each CFS multiplied by a weighted sum of the scores for that CFS calculated by the function described above with reference to fig. 5 over all CFSs for each replacement correction.
D. Generating a particular replacement correction/CFS preference metric by any one or more, more preferably most, and most preferably all of the following operations on the replacement correction in the reduced one or more matrices, as described above with reference to sub-phases IIE-IIG:
i. the cluster selected for correction from the original input text is selected as the reference cluster.
Generating a modified matrix, wherein in each preference matrix, the frequency of occurrence function of each substitution correction in each feature grammar is replaced by a ratio of the frequency of occurrence function of each substitution correction to the frequency of occurrence function of the reference cluster.
A modified matrix of the type described above in ii is further modified to replace the ratio in each preference measure as a function of the ratio: a function of this ratio reduces the computational importance of large differences in the ratio. A suitable such function is a logarithmic function. The purpose of this operation is to reduce the importance of large differences in frequency of occurrence in the last preference score of the most preferred replacement correction while maintaining the importance of large differences in frequency of occurrence in the last preference score of the least preferred replacement correction and thus eliminating the least preferred replacement correction.
Additionally modifying a modified matrix of the type described in ii or iii above by multiplying the applicable ratio or ratio function in each preference metric by the appropriate CFS score. This provides emphasis based on the correct grammar usage and other factors reflected in the CFS score.
v. additionally modifying a modified matrix of the type described above in ii, iii or iv by multiplying the applicable ratio or ratio function in each preference measure by a function of the user uncertainty measure. Some examples of user input uncertainty metrics include: a number of editing actions performed in the word processor related to the input word or cluster relative to editing actions on other words of the document; a writing timing of an input word or cluster executed in the word processor relative to writing timings of other words of the document; and a timing of utterance of the input word or cluster performed in the voice recognition input function with respect to a timing of utterance of other words by the user. The user input uncertainty metric provides an indication of how certain the user has been for this selection of a word. This step uses the calculated bias for the reference cluster and modifies it by a function of the user certainty or uncertainty about that cluster.
Additionally modifying a modified matrix of the type described in ii, iii or iv above by generating a function of the applicable ratio, the ratio function, the frequency of occurrence and the normalized frequency of occurrence. The preference function is generated by multiplying the applicable ratio or ratio function in each preference metric by the frequency of occurrence of the CFS.
E. Based on the particular replacement correction/CFS preference metric as described above in D, a final preference metric is calculated for each replacement correction by multiplying the similarity score of the replacement correction by the sum of the particular replacement correction/CFS preference metrics of all CFSs of the replacement correction.
An example of such a modified matrix is as follows:
the following input text is provided:
I will be able to teach base with you next week
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
teach
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
teach;touch
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘able to teach’;‘to teach base’
using the functionality described above with reference to sub-stages IIA and IIC above, the occurrence frequency matrices in the internet corpus and the normalized occurrence frequency matrices shown in table 37 are generated for the above alternate cluster correction lists in the above CFS list:
watch 37
Note that, for one feature, both the occurrence frequency of "teach" and the normalized occurrence frequency are greater than the frequency of "touch", but for another feature, both the occurrence frequency of "touch" and the normalized occurrence frequency are greater than the frequency of "teach". For a correct choice of replacement correction, the ratio metric described above with reference to sub-phase IIG is preferably used as described below.
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 38:
watch 38
Replacement cluster correction Similarity score
teach 1.00
touch 0.89
It can be seen that the reference cluster is "teach" because it has the highest similarity score. Nonetheless, "touch" is selected based on the last preference score as described above. As can be appreciated from consideration of the above matrix indicating that "teach" has the highest frequency of occurrence and the highest normalized frequency of occurrence, this is not intuitive. In this example, the final preference score indicates that "touch" is selected instead of "teach" because the ratio of the frequency of occurrence of a feature in favor of "touch" is much greater than the ratio of the frequency of occurrence of another feature in favor of "teach".
F. Optionally, the replacement correction may be filtered based on a comparison of the frequency function value and preference metric of the replacement correction with the frequency function value and preference metric of the reference cluster using one or more of the following decision rules:
1. for at least one feature having a CFS score greater than a predetermined threshold, filtering out replacement corrections having a similarity score lower than the predetermined threshold and having a CFS frequency function less than that of the reference cluster.
2. For at least one feature having a CFS score greater than another predetermined threshold, filtering out replacement corrections having a similarity score lower than the predetermined threshold and having a preference metric less than the predetermined threshold.
A. determining the CFS score for each CFS;
b. for each CFS, determining a CFS frequency function of the reference cluster and the replacement correction, thereby determining whether the reference cluster or the replacement correction has a higher frequency function for that CFS;
c. summing the CFS scores of the replacement corrected CFSs having a higher frequency than the reference cluster;
d. summing the CFS scores of CFSs of reference clusters having higher frequencies than the replacement corrections; and
e. if the sum in c is less than the sum in d, the replacement correction is filtered out.
The following example illustrates the filtering function as described above.
The following input text is typically provided by a speech recognition function:
I want[two/to/too]items,please.
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
[two/to/too]
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
too;to;two
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘I want two’;‘want two items’
using the functionality described above with reference to stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 39 is generated here for the above alternate cluster correction list in the above CFS list:
watch 39
The replacement corrections "too" and "to" are filtered out because they have an appearance frequency of 0 for one of the CFSs, although they have a high appearance frequency of the other CFS. Thus, here, the CFS that remains is "two".
G. As discussed above with reference to phase III, the replacement corrections retained in the filtering in F are ranked based on the last preference metric derived as described above at a-E. The replacement correction with the highest final preference score is selected.
H. As discussed above with reference to phase IV, a confidence is assigned to the selected replacement correction. The confidence is calculated based on one or more of the following parameters:
a. the number, type and score of selected CFSs provided in sub-stage IIB above;
b. statistical significance of frequency of occurrence of individual replacement cluster corrections in the context of the CFS;
c. a degree of correspondence in the selection of replacement corrections based on the preference metric for each CFS and the word similarity score for the respective replacement correction;
d. (ii) a non-contextual similarity score of the selected replacement cluster corrections above a predetermined minimum threshold (stage I);
e. a degree of availability of the context data, the degree being indicated by a number of CFSs in the reduced matrix having CFS scores greater than a predetermined minimum threshold and having preference scores exceeding another predetermined threshold.
If the confidence level is greater than a predetermined threshold, the selected replacement correction is implemented without user interaction. If the confidence level is less than the predetermined threshold but greater than a lower predetermined threshold, the selected alternative correction is implemented but the user interaction is invited. If the confidence level is less than the lower predetermined threshold, then the invitation is based on a user selection of a prioritized list of replacement corrections.
The following example illustrates the use of confidence scores:
the following input text is provided:
He was not feeling wale when he returned
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
wale
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
wale;well
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘was not feeling wale’;‘not feeling wale when’;‘feeling wale whenhe’;‘wale when he returned’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 40 is generated here for the above alternate cluster correction list in the above CFS list:
watch 40
The above examples illustrate: the choice of "well" instead of "way" has a high confidence level according to all the criteria given above in H.
In the following example, the confidence is somewhat less because the replacement correction 'back' has a higher frequency of occurrence than 'beacon' in the CFS 'beacon in the summer', but 'beacon' has a higher frequency of occurrence than 'back' in the CFS 'on the beacon in' and 'the beacon in the'. Based on the criterion h (c), an alternative correction 'beach' is selected with an intermediate confidence.
The following input text is provided:
I like to work on the beech in the summer
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
beech
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
beach;beech;back
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘on the beech in’;‘the beech in the’;‘beech in the summer’
using the functionality described above with reference to stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 41 is generated for the above alternate cluster correction list in the above CFS list:
table 41
Based on the criterion h (c), an alternative correction 'beach' is selected with an intermediate confidence.
In the following example, based on the criterion h (a), the confidence is smaller:
the following input text is received:
Exerts are restoring the British Museum′s round reading room
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
Exerts
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
Expert;Exerts;Exits
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘Exerts are’;‘Exerts are restoring’;‘Exerts are restoring the’;‘Exertsare restoring the British’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 42 is generated for the above alternate cluster correction list in the above CFS list:
watch 42
All CFSs for which all replacement corrections have a frequency of occurrence of 0 are eliminated. In this example, the following feature syntax is eliminated:
‘Exerts are restoring’;‘Exerts are restoring the’;‘Exerts are restoringthe British’
in this example, only the remaining CFSs are feature syntaxes:
‘Exerts are’
as can be seen from the above example, the only CFS that remains in the filtering process is "exertsre". As a result, the confidence is relatively low because the selection is based on only a single CFS, which is relatively short and includes only one word in addition to the suspect word, which is a frequently occurring word.
The following example illustrates the use of the final preference scoring metric described in stages D and E above.
The following input text is provided:
Some kids don′t do any sport and sit around doing nothing and getting fastso you will burn some calories and get a lot fitter if you exercise.
using the functionality described above with reference to fig. 6B, the following clusters are selected for correction:
fast
using the functionality described above with reference to fig. 7A, the following alternate cluster corrections (partial list) are generated:
fat;fast
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘and getting fast’;‘getting fast so’;‘fast so you’;‘fast so you will’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 43 is generated here for the above alternate cluster correction list in the above CFS list:
watch 43
In this example, the non-contextual similarity scores of the replacement cluster corrections are shown in table 44.
Watch 44
Replacement cluster correction Similarity score
fast 1
fat 0.89
Using the last preference score metric described in stages D and E above, an alternative correction "fat" is selected with low confidence.
Referring now to fig. 10, fig. 10 is a detailed flow chart illustrating the operation of the lost item correction function. The missing item correction function is used to correct missing articles, prepositions, punctuation, and other items that have a predominantly grammatical function in the input text. This function preferably operates on the spell corrected input text output from the spell correction function of fig. 1.
The identification of suspected missing items is preferably performed in the following manner:
a feature grammar is first generated for the spelling-corrected input text. The frequency of occurrence (FREQ F-G) of each feature grammar in the spelling-corrected input text in a corpus, preferably an Internet corpus, is determined.
The expected frequency of occurrence (EFREQ F-G) of each feature grammar is calculated as follows:
assume that the feature grammar contains a grammar recognized as W1-WnN words of (a).
WiIndicating the ith word in the feature grammar.
The expected frequency of occurrence for a given feature grammar is taken as: the words in the feature-based grammar of the feature grammar are divided into following words W1、...、Wn-1The highest of the expected frequencies obtained for two consecutive portions of each.
The feature grammar is divided into following words W based on the words in the feature grammariThe expected frequency of division of two consecutive parts of (a) may be expressed as follows:
with respect to WiIs EFREQ F-G ═ (FREQ (W)1-Wi)*FREQ(Wi+1-Wn) /(sum of frequencies of all words in the corpus)
Calculating each feature grammar is based on the expected frequency of all possible partitions of the word in the feature grammar into two successive parts.
If in relation to WiIs less than a predetermined threshold, W is considered to be WiThe feature syntax of (a) is suspect in the following respects: w in the feature grammariAnd Wi+1There are missing articles, prepositions, or punctuation between them.
Suspicious word connections between two consecutive words in the spelling-corrected input text are preferably selected for correction by attempting to find word connections (word connections) surrounded by a maximum amount of non-suspicious context data. Preferably, the word join is selected that has the longest one or more non-suspect word join sequences in its vicinity.
One or preferably more alternative insertions are preferably made for each word connection based on a predefined set of punctuation, articles, prepositions, conjunctions or other items (usually excluding nouns, verbs or adjectives) that may be lost.
Hereinafter, a context-based and word similarity-based score for each replacement insertion is preferably provided based on a corrected replacement scoring algorithm as described above with reference to FIG. 9.
The following examples are illustrative:
the following input text is provided:
I can′t read please help me
using the functionality described above with reference to fig. 5, the following feature syntax (partial list) is generated:
I can′t read;can′t read please;read please help;please help me
using the functionality described above, a frequency of occurrence matrix in the Internet corpus is generated for the above list of feature grammars, which is typically shown in Table 45:
TABLE 45
Feature grammar Frequency of occurrence
I can′t read 5600
can′t read please 0
read please help 55
please help me 441185
For each feature grammar, a calculation is made as to each word W in the feature grammar according to the following expressioniExpected frequency of occurrence of (c):
with respect to WiIs EFREQ F-G ═ (FREQ (W)1-Wi)*FREQ(Wi+1-Wn))/(all words in the phraseSum of frequencies in stock
Exemplary results for some of these calculations are shown in tables 46 and 47:
TABLE 46
Watch 47
Feature grammar Frequency of occurrence
read 157996585
please help 1391300
As seen from the above results, the actual frequency of occurrence of each feature grammar is less than its expected frequency of occurrence. This indicates that there is a suspicion of the absence of items such as punctuation.
An alternate insertion list following the word "read" is generated. The list preferably comprises a predetermined list of punctuation, articles, conjunctions and prepositions. Specifically, it will include periods.
The partial list of alternatives is:
‘read please’;‘read.Please’;‘read of please’;‘read a please’
using the functionality described above with reference to fig. 5, the following CFSs are generated:
‘I can’t read[?]’;‘read[?]please help’;‘[?]please help me’
using the functionality described in stage IIA of fig. 9, the frequency of occurrence matrix in the internet corpus shown in table 48 is generated for the upper alternate cluster correction list in the upper CFS list:
when 'is included in a cluster, the CFS occurrence frequency including the cluster with' is respectively retrieved for the text before and after 'i.e., the feature syntax "can't read.
Watch 48
Note that: when calculating the frequency of occurrence of a feature grammar in a corpus, the '″' is ignored from the beginning of the feature grammar. For example, the frequency of ". Please help me" is the same as the frequency of "Please help me".
Using the functionality described in stages D and E of fig. 9, the final preference metric selection replaces the correction "read.
I can′t read.Please help me.
The following example illustrates the functionality of adding missing prepositions.
The following input text is provided:
I sit the sofa
using the functions described below, the following clusters were selected for correction:
‘sit the’
using the functions described below, the following alternate cluster corrections (partial list) were generated:
sit on the;sit of the;sit the
using the functionality described above with reference to fig. 5, the following CFSs are generated:
‘I sit the’;‘sit the sofa’
using the functionality described in stage IIA with reference to fig. 9, the frequency of occurrence matrix in the internet corpus shown in table 49 is generated for the upper alternate cluster correction list in the upper CFS list:
watch 49
CFS/replacement cluster correction ‘I sit[?]the’ ‘sit[?]the sofa’
sit on the 26370 7400
sit of the 0 0
sit the 2100 0
Using the functionality described in stages IID and IIE of fig. 9, the final preference metric selection replaces the correction "sit on the" and the corrected input text is:
I sit on the sofa.
referring now to fig. 11, fig. 11 is a detailed flowchart illustrating the operation of the redundant item correction function. The superfluous item correction function is used to correct superfluous articles, prepositions, punctuation, and other items that have a predominantly grammatical function in the input text. This function preferably operates on the spell corrected input text output from the spell correction function of fig. 1.
It should be appreciated that the functionality of fig. 11 may be combined with the functionality of fig. 10, or alternatively performed in parallel therewith, prior to it, or after its operation.
The identification of suspicious redundant items is preferably performed in the following manner:
a search is performed on the spelling-corrected input text to identify items that belong to a predefined set of possibly superfluous punctuation, articles, prepositions, conjunctions, and other items (typically excluding nouns, verbs, or adjectives).
For each such item, a characteristic grammar is generated for all portions of the input text that contain the misused words and grammar corrections, spelling corrections for that item. The frequency of occurrence is calculated for each such feature grammar and for the corresponding feature grammar in which the item is omitted.
An item is considered suspect if the frequency of occurrence of the feature grammar in which the item is omitted exceeds the frequency of occurrence of the corresponding feature grammar in which the item is present.
Suspect items in the spelling-corrected input text after the misused word and grammar correction are preferably selected for correction by attempting to find items surrounded by the largest amount of non-suspect context data. Preferably, the item having the longest non-suspect word sequence or sequences in its vicinity is selected.
A possible item deletion is made for each suspect item. The context-based and word similarity-based scores for the various replacements (i.e., deleted items or not deleted items) are provided below, preferably based on the corrected replacement scoring algorithm described above with reference to fig. 9.
The following examples are illustrative.
The following input text is provided:
It is a nice,thing to wear.
the input text is searched to identify any items belonging to a predetermined list of common, redundant items, such as punctuation, prepositions, conjunctions, and articles.
In this example, commas "," are identified as belonging to such a list.
Using the functions described above with reference to fig. 5, the feature syntax shown in table 50 including commas "is generated, and the same feature syntax (partial list) without commas is also generated.
Watch 50
Feature grammar with commas Comma-free feature grammar
is a nice,thing is a nice thing
a nice,thing to a nice thing to
nice,thing to wear nice thing to wear
Using the functionality as described above, a frequency of occurrence matrix in the internet corpus is generated for the above list of feature grammars, typically as shown in table 51:
watch 51
As shown in the above matrix, the occurrence frequency of the feature syntax with the "omitted" exceeds the occurrence frequency of the corresponding feature syntax with the "omitted". Thus, it will be "considered suspect redundant".
The possible deletion of commas is considered based on the following context-based scores for comma-preserving and comma-omitting substitutions:
‘nice,’;‘nice’
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘a nice,’;‘nice,thing’;‘is a nice,’;‘a nice,thing’;‘nice,thing to’
using the functionality described above with reference to stage IIA of fig. 9, the frequency of occurrence matrix in the internet corpus shown in table 52 is generated for the above alternate cluster correction list in the above CFS list.
Table 52
All CFSs that are entirely included in the other CFSs having at least the minimum threshold frequency of occurrence are eliminated. For example, the following feature syntax is eliminated:
‘a nice,’;‘nice,thing’
in this example, the remaining CFSs are feature grammars:
‘is a nice,’;‘a nice,thing’;‘nice,thing to’
the replacement correction "nice" without comma is selected using the last preference score described in stages D and E of fig. 9 above. The input text after comma deletion is:
It is a nice thing to wear.
the following example illustrates the function of removing the superfluous article.
The following input text is provided:
We should provide them a food and water.
using the functionality described above with reference to fig. 11, the following clusters are selected for correction:
a food
using the functionality described above with reference to fig. 11, the following alternative cluster corrections (partial list) are generated:
a food;food
using the functionality described above with reference to fig. 5, the following CFSs (partial list) are generated:
‘provide them a food’;‘them a food and’;‘a food and water’
using the functionality described above with reference to sub-stage IIA, the frequency of occurrence matrix in the internet corpus shown in table 53 is generated here for the above alternate cluster correction list in the above CFS list:
watch 53
Using the scoring function described in fig. 9, the final preference metric selection replaces the correction "food", and the corrected input text is:
We should provide them food and water.
referring now to fig. 12, fig. 12 is a simplified block diagram illustration of a system and functionality for computer-assisted language translation and generation, constructed and operative in accordance with a preferred embodiment of the present invention. As shown in fig. 12, input text is provided to language generation module 200 from one or more sources, including but not limited to:
a sentence search function 201 that assists a user in constructing a sentence by enabling the user to enter a query comprising several words and receiving a complete sentence containing the words;
a machine text generation function 202 that generates natural language statements from a machine representation system such as a knowledge base or logical form;
a word processor function 203, which may generate any suitable text, preferably a portion of a document, such as a sentence;
a machine translation function 204 that converts text in a source language to text in a target language and is capable of providing a plurality of alternate translated texts, phrases and/or words in the target language that can be processed by the language generation module as alternate input texts, alternate phrases and/or alternate words;
a speech-to-text conversion function 205 that converts speech to text and is capable of providing a plurality of replacement words that can be processed by the language generation module as input text with a replacement for each word;
an optical character recognition function 206 that converts characters into text and is capable of providing multiple alternatives for each word that can be processed by the language generation module as input text with alternatives for each word; and
any other text source 210, such as instant messages or text transmitted over the internet.
The language generation module 200 preferably includes statement retrieval functionality 212 and statement generation functionality 214.
One particular feature of the present invention is that the sentence retrieval function 212 interacts with a stem-to-sentence index 216 using an internet corpus 220.
The use of an internet corpus is important: it provides an extremely large number of sentences, resulting in a highly robust language production function.
An internet corpus is a large representative sample of natural language text that is typically collected from the world wide web by crawling (crawl) the internet and collecting the text from web pages. Preferably, dynamic text is also collected, such as chat transcripts, text from web forums, and text from blogs. The collected text is used to accumulate statistics on the natural language text. The size of the internet corpus may be, for example, 1 trillion (1,000,000,000,000) words or several trillion words, as compared to the more common corpus sizes of up to two billion words. Such as web corporaA small network sample of the library comprises 100 hundred million words, which is much smaller than that of a library such as GOOGLEIs one percent of the indexed web text. The present invention can work with network samples such as a web corpus, but preferably uses much larger network samples for the text generation task.
Preferably, the internet corpus is used in one of two ways:
one or more internet search engines are used by using the modified input text as a search query. Statements including words contained in the search query may be extracted from the search results.
The stem-to-sentence index 216 is built over time by crawling and indexing the internet. Preferably, this is done by: inflected words that appear in an internet corpus are narrowed down to their respective stems and all sentences that include words with such stems in the corpus are listed. The stem-to-sentence index and the search query may be based on selectable portions of the internet and may be identified using these selected portions. Similarly, portions of the Internet may be excluded or appropriately weighted to correct for anomalies between Internet usage and general language usage. In this way, websites that are reliable in language usage (such as news and government websites) may be given greater weight than other websites (such as chat or user forums).
Preferably, the input text is first provided to the sentence retrieval function 212. The operation of statement retrieval function 212 is described below with additional reference to FIG. 13.
The sentence retrieval function 212 is used to divide the input text into independent phrases which are then independently processed in the sentence generation module 214. Stems are generated for all words in each independent phrase. Alternatively, a stem is generated for some or all of the words in each independent phrase, and in this case the words themselves are used in a word-to-sentence index to retrieve sentences from an internet corpus.
The stems are then classified as mandatory or optional. An alternative stem is a stem of adjectives, adverbs, articles, prepositions, punctuation and other items that have mainly grammatical functions in the input text as well as items in a predefined list of alternative words. Mandatory stems are all stems that are not optional stems. The alternate stems may be ranked for their importance in the input text.
For each independent phrase, stem-to-sentence index 216 is used to retrieve all sentences in internet corpus 220 that include all stems.
For each independent phrase, if the number of sentences retrieved is less than a predetermined threshold, then all sentences including all mandatory stems are retrieved in the internet corpus 220 using the stem to sentence index 216.
For each independent phrase, if the number of retrieved sentences that include all the mandatory stems is less than another predetermined threshold, then a stem replacement generator is used to generate a replacement for all the mandatory stems, as described below with reference to FIG. 15.
Thereafter, for each independent phrase, all sentences that include as many mandatory stems as possible, but not less than one mandatory stem, and also include replacements for all remaining mandatory stems are retrieved in the Internet corpus 220 using the stem-to-sentence index 216.
The output of the sentence retrieval function 212 is preferably as follows:
an independent phrase;
for each independent phrase:
mandatory and optional stems and their hierarchies;
sentences retrieved from internet corpus 212.
The above output of the statement retrieval function 212 is provided to a statement generation function 214. The operation of the statement generation function 214 is described below with additional reference to fig. 14A and 14B.
For each independent phrase, a simplification of the statements retrieved from Internet corpus 212 is performed as described below:
as shown in fig. 14A, all words in a sentence retrieved from an internet corpus are first classified as mandatory or non-required, preferably using the same criteria for classifying stems in independent phrases. Unnecessary words are deleted unless their stem appears in the corresponding independent phrase or replaces one of the stems.
The phrases are extracted from all sentences using standard analysis functions. Phrases that do not include any stem that appears in the corresponding independent phrase or is a replacement stem are deleted.
The thus simplified sentences resulting from the above steps are grouped into groups having at least a predetermined similarity for each independent phrase, and the number of simplified sentences in each group is counted.
As shown in fig. 14B, each such group is ranked using the following criteria:
A. the number of simplified sentences contained therein;
B. degree of matching of stems of words in the group with stems in independent phrases and their substitutions;
C. the group includes the degree of words that do not match the words in the independent phrase and their alternatives.
A suitable composite ranking based on criteria A, B and C is preferably provided.
Groups with ranks below a predetermined threshold, which are individually obtained according to all criteria A, B and C, are eliminated. In addition, groups ranked lower than another according to all criteria A, B and C are eliminated.
The remaining groups are concatenated to correspond to the input text and are preferably presented to the user with an indication of the ranking in the order of their weighted composite ranking.
If the composite rating of the highest rated group is greater than a predetermined threshold, it is confirmed for automatic text generation.
Referring now to fig. 15, fig. 15 is a simplified flow diagram illustrating functionality for generating stem replacements useful in the functionality of fig. 12 and 13.
As shown in fig. 15, for each stem, a plurality of alternatives are first generated in the following manner:
a plurality of words similar to each stem is retrieved from the dictionary based on a written appearance expressed in character string similarity and based on pronunciation or phonetic similarity. This functionality is known and is free software available on the Internet, such as GNU Aspell and GoogleGspell. The retrieved and prioritized words provide a first plurality of alternatives.
Additional alternatives may be generated by using rules based on known alternative usages and accumulated user input. For example, u → you, r → are, Im → I am.
A plurality of words obtained from a synonym dictionary or other lexical database, such as Princeton WordNet, available for free on the internet, for example, as synonyms, supersets, or subsets, lexically related to the stem are retrieved.
One particular feature of the preferred embodiment of the present invention is the use of contextual information such as CFS (and more specifically such as feature syntax) to generate the replacement. Stems that often appear in the same context can be effective replacements. Frequently occurring word combinations, such as CFSs, and more particularly, feature grammars, may be retrieved from existing corpus, such as internet corpus.
Where the input text is automatically generated by an external system such as an optical character recognition, speech to text or machine translation system, further alternatives may be received directly from such a system. These additional alternatives are typically generated during operation of such systems. For example, in a machine translation system, alternate translations of foreign language words may be provided to the system for use as alternates.
The following examples illustrate the functionality of FIGS. 12-15:
the following input text is received from the word processor or machine translation function:
Be calling if to problem please
in this case, the input text is composed of a single independent phrase. The stem generation and classification of the mandatory/optional stems provides the following results:
forced word stem: call, if, proplem, please
Optional word stems: be, to
Some, but not all, of the statements, corresponding simplified statements, groups of simplified statements, and group level information retrieved from the internet corpus for the above results are given in the following tables.
In this example, the following hierarchical process is used, it being understood that the invention is not limited to using this process, which is only one example:
the stem is weighted to indicate the importance of the word in the language. For stems in independent phrases, the weight is equal to 1 if the stem is mandatory, and equal to or less than 1 if the stem is optional.
In the table, a weight is indicated in parentheses after each stem. For example, "you (0.5)" means that the stem 'you' has an importance weight of 0.5.
A positive match rating (corresponding to criterion B (fig. 14B)) is calculated that is equal to the sum of the upper weights of the stems occurring in the independent phrases and in the corresponding reduced sentence sets divided by the sum of the weights of all stems occurring in the independent phrases.
A negative match rating (corresponding to criterion C (fig. 14B)) is calculated that is equal to 1-the sum of the weights above the stems that appear in the corresponding simplified sentence set but not in the independent phrase divided by the sum of the weights of all stems that appear in the corresponding simplified sentence set.
A composite ranking (corresponding to 2 (fig. 14B)) is calculated based on the group count (criterion a (fig. 14B)) and on the positive-negative matching ranking. The preferred composite ranking is given by the following general expression:
composite rating is a function of the group count multiplied by the weighted sum of positive and negative matching ratings.
More specific examples are given by the following expressions, it being understood that the present invention is not limited to the above general expressions or the following specific expressions:
composite grading (group count) (+ 0.8 positive match grading +0.2 negative match grading)
Based on the composite ranking, a second group is selected.
From the above, it will be appreciated that the invention as described above with reference to fig. 12-15 is capable of entering text as follows:
Be calling if to problem please
the following statements are converted:
If you have any problems,please call
although the statement does not appear in this precise form in statements retrieved from the internet corpus.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and subcombinations of the various features described and illustrated hereinabove, as well as modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.

Claims (156)

1. A computer-assisted language generation method, comprising:
a sentence retrieval step of operating based on an input text containing words to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
a sentence generation step operating using a plurality of sentences retrieved from the Internet corpus by the sentence retrieval step to generate at least one correct sentence expressing the input text,
wherein the sentence generation step comprises:
a sentence simplification step of simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping step of grouping similar simplified sentences provided by the sentence simplification step; and
a simplified sentence group classification step of classifying the group of the similar simplified sentences;
a substitution generation step of generating a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence;
a selection step for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
a correction generation step of providing a correction output based on the selection made by the selection step;
wherein the selecting step comprises a context-based scoring step for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
2. The method of claim 1 and wherein the statement retrieval step comprises:
an independent phrase generating step for dividing the input text into one or more independent phrases;
a stem generation and classification step for operating on each independent phrase to generate a stem appearing in the word and assigning an importance weight thereto; and
and a replacement stem generation step of generating a replacement stem corresponding to the stem.
3. The method of claim 2, and further comprising a stem-to-sentence index that interacts with the internet corpus to retrieve the plurality of sentences that contain words corresponding to the words in the input text.
4. A method according to claim 1 and wherein said simplified sentence set ranking step operates using at least some of the following criteria:
A. the number of simplified sentences contained in the group;
B. the degree of correspondence of stems of words in the group with stems in independent phrases and their substitutions;
C. the set includes a degree of words that do not correspond to the words and their substitutions in the independent phrase.
5. The method of claim 4 and wherein said simplified sentence set ranking step operates using at least a portion of the following process:
defining a weight of the stem to indicate an importance of the word in the language;
calculating a positive matching grade corresponding to the criterion B;
calculating a negative matching grade corresponding to the criterion C;
a composite ranking is calculated based on:
the number of simplified sentences contained in a group, and the number corresponding to criterion A;
the positive match is ranked; and
the negative match is ranked.
6. The method according to any one of the preceding claims, and further comprising:
a machine translation step for providing the input text.
7. A method according to claim 6 and wherein said machine translation step provides a plurality of alternatives corresponding to words in said input text and said sentence retrieval step is for retrieving from said Internet corpus a plurality of sentences including words corresponding to said alternatives.
8. The method according to any of the preceding claims 1-5, wherein the language generation comprises text correction.
9. The method according to any of the preceding claims 1-5, and further comprising:
a sentence searching step for providing the input text based on a query word input by a user.
10. The method according to any of the preceding claims 1-5, and further comprising:
a speech to text conversion step for providing the input text.
11. The method of claim 1, wherein the selecting step is for making the selection based on at least one of the following corrective functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
12. The method of claim 1, wherein the selecting step is for making the selection based on at least two of the following correction functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
13. A method according to claim 12 and wherein said selecting step is operative to make said selection based on at least one of the following time series of corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
14. The method of claim 1 and wherein:
providing the input sentence by one of the following steps:
a word processing step;
a machine translation step;
a step of converting voice into text;
an optical character recognition step; and
an instant message transmission step; and is
The selecting step is for making the selection based on at least one of the following corrective functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
15. A method according to claim 1 and wherein said correction generation step comprises a corrected language input generation step for providing a corrected language output based on the selection made by said selection step without requiring user intervention.
16. A method according to any of claims 11-14 and wherein said grammar correction step comprises at least one of punctuation correction, verb inflection change correction, singular/plural correction, article correction and preposition correction.
17. The method according to any of claims 11-14 and wherein said syntax correction step comprises at least one of a substitution correction step, an insertion correction step and an omission correction step.
18. A method according to claim 1 and wherein said context-based scoring step is also used to rank said plurality of alternatives based at least in part on normalized CFS frequency of occurrence in said internet corpus.
19. The method of claim 1, and further comprising:
at least one of the following steps:
a spelling correction step;
correcting the misused words;
a grammar correction step; and
a vocabulary enhancement step; and
a contextual feature sequence step that works in conjunction with at least one of the spelling correction step, the misused word correction step, the grammar correction step and the vocabulary enhancement step and uses an internet corpus.
20. A method according to claim 19 and wherein said grammar correction step comprises at least one of punctuation correction, verb inflection change correction, singular/plural correction, article correction and preposition correction.
21. The method according to claim 19 and wherein said syntax correction step comprises at least one of a substitution correction step, an insertion correction step and an omission correction step.
22. The method according to claim 19, and comprising:
at least two of the following steps:
the spelling correction step;
correcting the misused words;
the grammar correcting step; and
the vocabulary is enhanced, and
wherein the contextual feature sequence step works in conjunction with at least two of the spelling correction step, the misused word correction step, the grammar correction step and the vocabulary enhancement step and uses an internet corpus.
23. The method according to claim 19, and comprising:
at least three of the following steps:
the spelling correction step;
correcting the misused words;
the grammar correcting step;
the vocabulary is enhanced, and
wherein the contextual feature sequence step works in conjunction with at least three of the spelling correction step, the misused word correction step, the grammar correction step and the vocabulary enhancement step and uses an internet corpus.
24. The method of claim 19, and comprising
The spelling correction step;
correcting the misused words;
the grammar correcting step; and
the vocabulary is enhanced, and
wherein the contextual feature sequence step works in conjunction with the spelling correction step, the misused word correction step, the grammar correction step and the vocabulary enhancement step and uses an internet corpus.
25. A method according to claim 19 and wherein said correction generation step comprises a corrected language generation step for providing a corrected language output based on a selection made by said selection step without requiring user intervention.
26. The method of claim 1, wherein,
the selecting step is further based at least in part on relationships between selected ones of the plurality of alternatives for at least some of the plurality of words in the language input.
27. The method of claim 26 and wherein the language input comprises at least one of an input sentence and an input text.
28. A method according to claim 26 and wherein said language input is speech and said producing step converts said language input in speech form into a text-based representation providing a plurality of alternatives for a plurality of words in said language input.
29. The method according to claim 26, and wherein,
the language input is at least one of:
inputting a text;
outputting an optical character recognition step;
the output of the machine translation step; and
an output of the word processing step, and
the generating step converts the language input in text form into a text-based representation that provides a plurality of alternatives for a plurality of words in the language input.
30. A method according to claim 26 and wherein said selecting step is operative to make said selection based on at least two of the following correcting steps:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
31. A method according to claim 30 and wherein said selecting step is operative to make said selection based on at least one of the following time series of corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
32. A method according to claim 26 and wherein said language input is speech and said selecting step is operative to make said selection based on at least one of the following corrective functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
33. A method according to claim 26 and wherein said selecting step is operative to make said selection by performing at least two of the following functions:
selecting a first set or combination of words comprising fewer words than all of the plurality of words used for initial selection in the language input;
then, ordering the elements of the first word set or word combination to establish a selected priority; and is
Thereafter, when selecting among a plurality of alternatives for elements of the first set of words, other words but not all words of the plurality of words are selected as context to influence the selection.
34. A method according to claim 26 and wherein said selecting step is operative to make said selection by performing the following functions:
when selecting an element having at least two words, evaluating each of the plurality of alternatives for each of the at least two words in conjunction with each of the plurality of alternatives for another of the at least two words.
35. A method according to claim 26 and wherein said correction generation step comprises a corrected language input generation step for providing a corrected language output based on the selection made by said selection step without requiring user intervention.
36. The method of claim 1, and further comprising:
a misused word suspicion step for evaluating at least most words in a linguistic input based on their suitability in the context of the linguistic input; and
a suspected misused word correction generation step for providing a correction output based at least in part on the evaluation performed by the misused word suspicion step.
37. The method of claim 36, and further comprising:
a suspected misused word substitution generating step for generating a text based representation based on the linguistic input, the text based representation providing a plurality of substitutions of at least one of the at least most words in the linguistic input; and
a suspected misused word replacement selection step for selecting at least among said plurality of replacements of each of said at least one of said at least most words in said linguistic input; and wherein
The suspected misused word correction generating step is for providing the correction output based on the selection made by the suspected misused word replacement selection step.
38. The method of claim 36, and further comprising:
a suspect word output indicating step for indicating a degree to which at least some of said at least most words in said linguistic input are suspected of being misused words.
39. A method according to claim 36 and wherein said suspected misused word correction generating step comprises an auto-corrected language generating step for providing a corrected text output based at least in part on an evaluation performed by said suspected step without requiring user intervention.
40. A method according to claim 36 and wherein said linguistic input is speech and said suspected misused word replacement selection step is operative to make said selection based on at least one of the following corrective functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
41. The method of claim 1, and further comprising:
a misused word suspicion step for evaluating words in the linguistic input;
a suspected misused word substitution generation step of generating a plurality of substitutions of at least some of the words in the linguistic input that are assessed by the suspicion step as suspect words, at least one of the plurality of substitutions of words in the linguistic input being consistent with contextual characteristics of the words in the linguistic input in an internet corpus;
a suspected misused word replacement selection step for selecting at least between the plurality of replacements; and
a suspected misused word correction generation step for providing a correction output based at least in part on the selection made by the suspected misused word replacement selection step.
42. The method of claim 1, and further comprising:
a misused word suspicion step for evaluating words in the linguistic input and identifying suspect words;
a suspected misused word substitution generation step for generating a plurality of substitutions of said suspect word;
a suspected misused word replacement selection step of ranking each of said suspect words and some of said plurality of alternatives generated therefor by said suspected misused word replacement generating step according to a plurality of selection criteria and applying a bias in favour of said suspect words with respect to some of said plurality of alternatives generated therefor by said suspected misused word replacement generating step; and
a suspected misused word correction generation step for providing a correction output based at least in part on the selection made by the suspected misused word replacement selection step.
43. The method of claim 1, and further comprising:
the selecting step further comprises: ranking each of the at least one word and some of the plurality of alternatives generated therefor by the alternative generating step according to a plurality of selection criteria, and applying a bias in favor of the at least one word relative to some of the plurality of alternatives generated therefor by the alternative generating step, the bias being a function of an input uncertainty metric for indicating an uncertainty of a person providing the input.
44. The method of claim 1, and further comprising:
a wrong-word suspect step for evaluating at least a majority of words in a speech input, said suspect step being at least partially responsive to an input uncertainty measure indicative of an uncertainty of a person providing said input, said suspect step providing a suspect wrong-word output; and
a suspected wrong word replacement generating step of generating a plurality of replacements of the suspected wrong word identified by the suspected wrong word output;
a suspected wrong word replacement selection step of selecting between each suspected wrong word and the plurality of replacements generated by the suspected wrong word replacement generation step; and
a suspected wrong word correction generating step for providing a correction output based on the selection made by the suspected wrong word replacement selecting step.
45. The method of claim 1, and further comprising:
at least one of a spelling correction step, a misused word correction step, a grammar correction step and a vocabulary enhancement step for receiving a multiword input and providing a correction output, each of said at least one of a spelling correction step, a misused word correction step, a grammar correction step and a vocabulary enhancement step comprising:
a replacement word candidate generating step including:
a speech similarity step of proposing a replacement word based on a speech similarity with a word in the input and indicating a measure of the speech similarity; and
a character string similarity step of proposing replacement words based on character string similarity with the words in the input and indicating a measure of character string similarity for each replacement word; and
the selecting step further comprises: selecting a word in the output or a replacement word candidate proposed by the replacement word candidate generating step by using the measure of speech similarity and the measure of string similarity together with a context-based selecting step.
46. The method of claim 1, and further comprising:
a suspicious word recognition step for receiving a multi-word language input and providing a suspicious word output indicative of the suspicious word;
a feature identification step for identifying features including the suspicious word;
a replacement selection step for identifying a replacement for the suspect word;
a feature occurrence step of using a corpus and providing an occurrence output that ranks each feature including the replacement according to a frequency of use of the each feature in the corpus; and
the selecting step further comprises using the occurrence output to provide a correction output,
the feature identification step comprises a feature filtering step comprising at least one of the following steps:
a step for eliminating features containing suspected errors;
a step for negatively biasing features that contain words introduced in early correction iterations of the multi-word input and have a confidence less than a predetermined threshold of confidence; and
a step for eliminating a feature contained in another feature having an occurrence frequency greater than a predetermined frequency threshold.
47. A method according to any of claims 43, 45-46 and wherein said selecting step is operative to make said selection based on at least two of the following corrective functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
48. A method according to claim 47 and wherein said selecting step is operative to make said selection based on at least one of the following time series of corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
49. A method according to any of claims 43, 45-46 and wherein said language input is speech and said selecting step is operative to make said selection based on at least one of the following corrective functions:
correcting grammar; and
correcting misused words; and
the vocabulary is enhanced.
50. A method according to any of claims 43 and 45-46 and wherein said correction generation step comprises a corrected language input generation step for providing a corrected language output based on a selection made by said selection step without requiring user intervention.
51. A method according to any of claims 45 and 46 and wherein said selecting step also serves to make said selection based at least in part on a user input uncertainty metric.
52. A method according to claim 51 and wherein said user input uncertainty metric is a function of an uncertainty measure based on a person providing said input.
53. A method according to any of claims 43, 45-46 and wherein said selecting step also uses a user input history learning function.
54. The method of claim 1, and further comprising:
a suspicious word recognition step for receiving a multi-word language input and providing a suspicious word output indicative of the suspicious word;
a feature identification step for identifying features including the suspicious word;
a replacement selection step for identifying a replacement for the suspect word;
an occurrence step of using a corpus and providing an occurrence output that ranks features including the replacement according to a frequency of use of the features in the corpus including the replacement; and
a correction output generating step of providing a correction output using the occurrence output,
the feature identification step comprises:
at least one of the following steps:
an N-gram recognition step; and
a co-occurrence identification step; and
at least one of the following steps:
skipping the grammar recognition step;
a step of converting grammar recognition; and
the user previously used a feature recognition step.
55. The method of claim 1, and further comprising:
a syntax error suspicion step for evaluating at least most words in a linguistic input based on their suitability in the context of the linguistic input; and
a suspected syntax error correction generating step for providing a correction output based at least in part on the evaluation performed by the suspected step.
56. The method according to claim 55, and further comprising:
a suspected grammatical error substitution generation step for generating a text-based representation based on the linguistic input, the text-based representation providing a plurality of substitutions of at least one of the at least most words in the linguistic input; and
a suspected grammatical error substitution selection step for selecting among said plurality of substitutions for each of said at least one of said at least most words in at least said linguistic input, and wherein
The suspected syntax error correction generating step is for providing the correction output based on the selection made by the selecting step.
57. The method according to claim 55, and further comprising:
a suspect word output indicating step for indicating a degree to which at least some of said at least most words in said linguistic input are suspected to contain grammatical errors.
58. The method according to claim 55 and wherein said suspected grammar error correction generating step comprises an auto-corrected language generating step for providing a corrected text output based at least in part on an evaluation performed by said suspected step without requiring user intervention.
59. The method of claim 1, and further comprising:
a syntax error suspicion step for evaluating a word in a language input;
a suspected grammatical error substitution generation step for generating a plurality of substitutions of at least some of the words in the linguistic input that are evaluated by the suspicion step as suspect words, at least one of the plurality of substitutions of words in the linguistic input being consistent with a contextual characteristic of the words in the linguistic input;
a suspected syntax error replacement selection step for selecting at least among said plurality of replacements; and
a suspected syntax error correction generating step for providing a correction output based at least in part on the selection made by the suspected syntax error replacement selecting step.
60. The method of claim 1, and further comprising:
a syntax error suspicion step for evaluating words in the linguistic input and identifying suspect words;
a suspected grammatical error substitution generation step for generating a plurality of substitutions of the suspect word;
a suspected grammar error substitution selection step of ranking each of the suspect words and ones of the plurality of substitutions generated therefor by the suspected grammar error substitution generation step according to a plurality of selection criteria and applying a bias favoring the suspect words over ones of the plurality of substitutions generated therefor by the suspected grammar error substitution generation step; and
a suspected syntax error correction generating step for providing a correction output based at least in part on the selection made by the suspected syntax error replacement selecting step.
61. A method according to claim 59 or 60 and wherein said suspected grammar error correction generating step includes a corrected language input generating step for providing a corrected language output based on selections made by said suspected grammar error replacement selecting step without requiring user intervention.
62. The method of claim 1, and further comprising:
a step for context-based scoring of the respective alternative corrections based at least in part on the frequency of occurrence of the contextual feature sequences CFSs in the internet corpus.
63. The method according to claim 62 and further comprising at least one of the following steps working in conjunction with said context-based scoring:
a spelling correction step;
correcting the misused words;
a grammar correction step; and
and (5) vocabulary enhancement step.
64. A method as recited in claim 62, and wherein said context-based scoring is also based, at least in part, on normalized CFS frequencies of occurrence in an Internet corpus.
65. A method as recited in claim 62, and wherein said context-based scoring is also based, at least in part, on a CFS importance score.
66. The method of claim 65 and wherein said CFS importance score is a function of at least one of:
the operation of the part of speech tagging and statement analysis functions; the length of the CFS; the frequency of occurrence of each word in the CFS and the CFS type.
67. The method of claim 1, and further comprising a lexical enhancement step, the lexical enhancement step comprising:
recognizing the vocabulary challenged words;
a replacement vocabulary enhancement generation step; and
a context-based scoring step based at least in part on the frequency of occurrence of the contextual feature sequences CFS in the Internet corpus,
the replacement vocabulary enhancement generating step includes a synonym dictionary preprocessing step for generating replacement vocabulary enhancements.
68. The method of claim 1, and further comprising:
a confidence degree assigning step of assigning a confidence degree to a replacement selected from the plurality of replacements;
wherein the correction output is based at least in part on the confidence level.
69. The method according to claim 68 and wherein said plurality of alternatives are evaluated based on a context feature sequence, CFS, and said confidence level is based on at least one of the following parameters:
the number, type, and rating of CFSs selected;
a measure of statistical significance of the frequency of occurrence of the plurality of substitutions in the context of the CFS;
a degree of correspondence in selection of one of the plurality of alternatives based on the preference metric for each of the CFSs and based on the word similarity scores for the plurality of alternatives;
a non-contextual similarity score for the one of the plurality of alternatives above a first predetermined minimum threshold; and
a degree of availability of context data, the degree being indicated by a number of CFS's having CFS scores greater than a second predetermined minimum threshold and having preference scores above a third predetermined threshold.
70. The method of claim 1, and further comprising:
a punctuation error suspicion step of evaluating at least some words and punctuation in a linguistic input based on their suitability within the context of the linguistic input based on their frequency of occurrence in an internet corpus; and
a suspected punctuation error correction generating step for providing a correction output based at least in part on the evaluation performed by the suspected step.
71. The method according to claim 70 and wherein said suspected punctuation error correction generating step comprises at least one of: a lost punctuation correction step, an excess punctuation correction step and a punctuation replacement correction step.
72. The method of claim 1, and further comprising:
a syntax element error suspicion step of evaluating at least some words in a linguistic input based on their suitability within the context of the linguistic input based on the frequency of occurrence of a characteristic grammar of the linguistic input in an internet corpus; and
a suspected syntax element error correction generating step for providing a correction output based at least in part on the evaluation performed by the suspected step.
73. The method according to claim 72 and wherein said suspected syntax element error correction generating step comprises at least one of: a lost syntax element correction step, an excess syntax element correction step and a syntax element replacement correction step.
74. A method according to claim 72 or 73 and wherein said grammatical element is one of an article, a preposition and a conjunctive.
75. A machine translation method, comprising:
a machine translation step;
a sentence retrieval step of operating based on the input text provided by the machine translation step to retrieve a plurality of sentences from an internet corpus, the plurality of sentences including words corresponding to the words in the input text;
a sentence generation step of operating using a plurality of sentences retrieved from the internet corpus by the sentence retrieval step to generate at least one correct sentence expressing the input text generated by the machine translation step,
wherein the sentence generation step comprises:
a sentence simplification step of simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping step of grouping similar simplified sentences provided by the sentence simplification step; and
a simplified sentence group classification step of classifying the group of the similar simplified sentences;
a substitution generation step of generating a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence;
a selection step for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
a correction generation step of providing a correction output based on the selection made by the selection step;
wherein the selecting step comprises a context-based scoring step for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
76. A text correction method, comprising:
a sentence retrieval step of operating based on an input text provided by a text correction function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
a sentence generation step operating using a plurality of sentences retrieved from the Internet corpus by the sentence retrieval step to generate at least one correct sentence expressing the input text,
wherein the sentence generation step comprises:
a sentence simplification step of simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping step of grouping similar simplified sentences provided by the sentence simplification step; and
a simplified sentence group classification step of classifying the group of the similar simplified sentences;
a substitution generation step of generating a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence;
a selection step for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
a correction generation step of providing a correction output based on the selection made by the selection step;
wherein the selecting step comprises a context-based scoring step for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
77. A sentence searching method, comprising:
a sentence searching step for providing an input text based on a query word input by a user;
a sentence retrieval step of operating based on the input text provided by the sentence search step to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
a sentence generation step of operating using a plurality of sentences retrieved from the internet corpus by the sentence retrieval step to generate at least one correct sentence expressing the input text generated by the sentence search step,
wherein the sentence generation step comprises:
a sentence simplification step of simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping step of grouping similar simplified sentences provided by the sentence simplification step; and
a simplified sentence group classification step of classifying the group of the similar simplified sentences;
a substitution generation step of generating a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence;
a selection step for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
a correction generation step of providing a correction output based on the selection made by the selection step;
wherein the selecting step comprises a context-based scoring step for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
78. A speech to text conversion method comprising:
a speech-to-text conversion step for providing an input text;
a sentence retrieval step of operating based on the input text provided by the sentence search step to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
a sentence generation step of operating using a plurality of sentences retrieved from the internet corpus by the sentence retrieval step to generate at least one correct sentence expressing the input text generated by the speech-to-text conversion step,
wherein the sentence generation step comprises:
a sentence simplification step of simplifying the sentences retrieved from the internet corpus;
a simplified sentence grouping step of grouping similar simplified sentences provided by the sentence simplification step; and
a simplified sentence group classification step of classifying the group of the similar simplified sentences;
a substitution generation step of generating a text-based representation based on an input sentence, the text-based representation providing a plurality of substitutions for each of a plurality of words in the sentence;
a selection step for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
a correction generation step of providing a correction output based on the selection made by the selection step;
wherein the selecting step comprises a context-based scoring step for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
79. A computer-assisted language generation apparatus comprising:
sentence retrieval means operative based on an input text containing words to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
sentence generation means operative using a plurality of sentences retrieved from the internet corpus by the sentence retrieval means to generate at least one correct sentence expressing the input text,
wherein the sentence generation apparatus comprises:
sentence simplification means for simplifying the sentences retrieved from the internet corpus;
simplified sentence grouping means for grouping similar simplified sentences provided by the sentence simplification means; and
simplified sentence group classification means for classifying the group of the similar simplified sentences;
generating a text-based representation based on an input sentence, the text-based representation providing a plurality of alternatives for each of a plurality of words in the sentence;
selecting means for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
correction generating means for providing a correction output based on the selection made by the selecting means;
wherein the selecting means comprises context-based scoring means for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
80. Apparatus according to claim 79 and wherein said sentence retrieval means comprises:
independent phrase generating means for dividing the input text into one or more independent phrases;
stem generation and classification means for operating on each independent phrase to generate a stem appearing in the word and assigning an importance weight thereto; and
and a replacement stem generation device for generating a replacement stem corresponding to the stem.
81. The apparatus of claim 80, and further comprising a stem-to-sentence index that interacts with the internet corpus to retrieve the plurality of sentences that contain words corresponding to the words in the input text.
82. Apparatus according to claim 79 and wherein said simplified sentence group ranking means operates using at least some of the following criteria:
A. the number of simplified sentences contained in the group;
B. the degree of correspondence of stems of words in the group with stems in independent phrases and their substitutions;
C. the set includes a degree of words that do not correspond to the words and their substitutions in the independent phrase.
83. The apparatus according to claim 82 and wherein said simplified sentence set ranking means operates using at least part of the following procedure:
defining a weight of the stem to indicate an importance of the word in the language;
calculating a positive matching grade corresponding to the criterion B;
calculating a negative matching grade corresponding to the criterion C;
a composite ranking is calculated based on:
the number of simplified sentences contained in a group, and the number corresponding to criterion A;
the positive match is ranked; and
the negative match is ranked.
84. The apparatus according to any one of claims 79-83, and further comprising:
and the machine translation device is used for providing the input text.
85. An apparatus according to claim 84 and wherein said machine translation means provides a plurality of alternatives corresponding to words in said input text and said sentence retrieval means is operative to retrieve a plurality of sentences including words corresponding to said alternatives from said internet corpus.
86. The apparatus of any of claims 79-83, wherein the language generation includes text correction.
87. The apparatus according to any one of claims 79-83, and further comprising:
sentence searching means for providing the input text based on a query word input by a user.
88. The apparatus according to any one of claims 79-83, and further comprising:
speech to text conversion means for providing said input text.
89. Apparatus according to claim 79, wherein said selection means is operable to make said selection based on at least one of the following corrective functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
90. Apparatus according to claim 79, wherein said selection means is operable to make said selection based on at least two of the following correction functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
91. Apparatus according to claim 90 and wherein said selecting means is operative to make said selection based on at least one of the following time series of corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
92. The apparatus according to claim 79, and wherein:
providing the input sentence by one of the following means:
a word processing device;
a machine translation device;
a speech to text conversion device;
an optical character recognition device; and
an instant messaging device; and is
The selection means is for making the selection based on at least one of the following correction functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
93. Apparatus according to claim 79 and wherein said correction generation means comprises correction language input generation means for providing a corrected language output based on a selection made by said selection means without requiring user intervention.
94. An apparatus according to any one of claims 89-92 and wherein said grammar correction functionality is implemented by at least one of the group consisting of punctuation correction means, verb inflection change correction means, singular/plural correction means, article correction means and preposition correction means.
95. Apparatus according to any one of claims 89-92 and wherein said grammar correction means is implemented by a computer system comprising at least one of substitution correction means, insertion correction means and omission correction means.
96. An apparatus according to claim 79 and wherein said context-based scoring device is also operative to rank said plurality of alternatives based, at least in part, on normalized CFS frequencies of occurrence in said Internet corpus.
97. The apparatus according to claim 79, and further comprising:
at least one of the following:
spelling correcting means;
a misused word correcting means;
grammar correcting means; and
vocabulary enhancement means; and
a contextual feature sequence means cooperating with at least one of the spelling correction means, the misused word correction means, the grammar correction means and the vocabulary enhancement means and using an internet corpus.
98. An apparatus according to claim 97, and wherein said grammar correction means includes at least one of punctuation correction means, verb inflection change correction means, singular/plural correction means, article correction means, and preposition correction means.
99. Apparatus according to claim 97 and wherein said grammar correction means comprises at least one of substitution correction means, insertion correction means and omission correction means.
100. The apparatus according to claim 97, and comprising:
at least two of the following:
the spelling correction means;
the misused word correcting means;
the grammar correction means; and
the vocabulary enhancing means, and
wherein the contextual feature sequence means cooperates with at least two of the spelling correction means, the misused word correction means, the grammar correction means and the vocabulary enhancement means and uses an internet corpus.
101. The apparatus according to claim 97, and comprising:
at least three of the following:
the spelling correction means;
the misused word correcting means;
the grammar correction means;
the vocabulary enhancing means, and
wherein the contextual feature sequence means cooperates with at least three of the spelling correction means, the misused word correction means, the grammar correction means and the vocabulary enhancement means and uses an internet corpus.
102. The apparatus according to claim 97, and comprising
The spelling correction means;
the misused word correcting means;
the grammar correction means; and
the vocabulary enhancing means, and
wherein the contextual feature sequence means cooperates with the spelling correction means, the misused word correction means, the grammar correction means and the vocabulary enhancement means and uses an internet corpus.
103. An apparatus according to claim 97 and wherein said correction generation means comprises correction language generation means for providing a corrected language output based on a selection made by said selection means without requiring user intervention.
104. The apparatus of claim 79, wherein:
the selection means further selects at least the plurality of alternatives for each of the plurality of words in the language input based at least in part on relationships between selected ones of the plurality of alternatives for at least some of the plurality of words in the language input.
105. The device of claim 104 and wherein said language input comprises at least one of an input sentence and an input text.
106. An apparatus according to claim 104 and wherein said language input is speech and said producing step converts said language input in speech form into a text-based representation providing multiple alternatives for multiple words in said language input.
107. The apparatus according to claim 104, and wherein,
the language input is at least one of:
inputting a text;
an output of the optical character recognition device;
an output of the machine translation device; and
an output of the word processing means, and
the generating means converts the language input in text form into a text-based representation that provides a plurality of alternatives for a plurality of words in the language input.
108. Apparatus according to claim 104 and wherein said selecting means is operative to make said selection based on at least two of the following correction means:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
109. Apparatus according to claim 108 and wherein said selecting means is operative to make said selection based on at least one of the following time-sequenced corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
110. An apparatus according to claim 104 and wherein said language input is speech and said selecting means is operative to make said selection based on at least one of the following corrective functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
111. The apparatus according to claim 104 and wherein said selecting means is operative to make said selection by performing at least two of the following functions:
selecting a first set or combination of words comprising fewer words than all of the plurality of words used for initial selection in the language input;
then, ordering the elements of the first word set or word combination to establish a selected priority; and is
Thereafter, when selecting among a plurality of alternatives for elements of the first set of words, other words but not all words of the plurality of words are selected as context to influence the selection.
112. The apparatus according to claim 104 and wherein said selecting means is operative to make said selection by performing the functions of:
when selecting an element having at least two words, evaluating each of the plurality of alternatives for each of the at least two words in conjunction with each of the plurality of alternatives for another of the at least two words.
113. An apparatus according to claim 104 and wherein said correction generation means comprises correction language input generation means for providing a corrected language output based on a selection made by said selection means without requiring user intervention.
114. The apparatus according to claim 79, and further comprising:
misused word suspicion means for evaluating at least most words in a linguistic input based on their suitability in the context of the linguistic input; and
suspected misused word correction generating means for providing a correction output based at least in part on the evaluation performed by the misused word suspicion means.
115. The apparatus according to claim 114, and further comprising:
suspected misused word substitution generating means for generating a text-based representation based on the linguistic input, the text-based representation providing a plurality of substitutions of at least one of the at least most words in the linguistic input; and
suspected misused word replacement selection means for selecting at least among said plurality of replacements of each of said at least one of said at least most words in said language input; and wherein
The suspected misused word correction generating means is for providing the correction output based on a selection made by the suspected misused word replacement selection means.
116. The apparatus according to claim 114, and further comprising:
suspect word output indication means for indicating the extent to which at least some of said at least most words in said linguistic input are suspected of being misused words.
117. An apparatus according to claim 114 and wherein said suspected misused word correction generating means comprises automatic correction language generating means for providing a corrected text output based at least in part on an evaluation performed by said suspected means without requiring user intervention.
118. An apparatus according to claim 114 and wherein said linguistic input is speech and said suspected misused word replacement selection means is operative to make said selection based on at least one of the following corrective functions:
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
119. The apparatus according to claim 79, and further comprising:
misused word suspicion means for evaluating words in the language input;
suspected misused word substitution generating means for generating a plurality of substitutions of at least some of the words in the linguistic input that are assessed by the suspicion means as suspect words, at least one of the plurality of substitutions of words in the linguistic input being consistent with contextual characteristics of the words in the linguistic input in an internet corpus;
suspected misused word replacement selection means for selecting at least between said plurality of replacements; and
suspected misused word correction generating means for providing a correction output based at least in part on the selection made by said suspected misused word replacement selection means.
120. The apparatus according to claim 79, and further comprising:
misused word suspicion means for evaluating words in the linguistic input and identifying suspect words;
suspected misused word substitution generating means for generating a plurality of substitutions of said suspect word;
suspected misused word replacement selection means for ranking each of said suspect words and ones of said plurality of alternatives generated therefor by said suspected misused word replacement generating means in accordance with a plurality of selection criteria and applying a bias in favour of said suspect words with respect to ones of said plurality of alternatives generated therefor by said suspected misused word replacement generating means; and
suspected misused word correction generating means for providing a correction output based at least in part on the selection made by said suspected misused word replacement selection means.
121. The apparatus of claim 79, wherein:
the selection means further ranks each of the at least one word and some of the plurality of alternatives generated therefor by the alternative generation means according to a plurality of selection criteria, and applies a bias in favor of the at least one word relative to some of the plurality of alternatives generated therefor by the alternative generation means, the bias being a function of an input uncertainty measure indicative of an uncertainty of a person providing the input.
122. The apparatus according to claim 79, and further comprising:
false word suspicion means for evaluating at least a majority of words in a speech input, said suspicion means being at least partially responsive to an input uncertainty measure indicative of an uncertainty of a person providing said input, said suspicion means providing a suspected false word output; and
suspected wrong word replacement generating means for generating a plurality of replacements of the suspected wrong word identified by the suspected wrong word output;
suspected wrong word replacement selection means for selecting between each suspected wrong word and the plurality of replacements generated by the suspected wrong word replacement generation means; and
suspected wrong word correction generating means for providing a correction output based on the selection made by said suspected wrong word replacement selecting means.
123. The apparatus according to claim 79, and further comprising:
at least one of a spelling correction device, a misused word correction device, a grammar correction device and a vocabulary enhancement device for receiving a multiple word input and providing a correction output, each of said at least one of a spelling correction device, a misused word correction device, a grammar correction device and a vocabulary enhancement device comprising:
a replacement word candidate generating device including:
speech similarity means for proposing a replacement word based on speech similarity with a word in the input and indicating a measure of speech similarity; and
string similarity means for proposing replacement words based on string similarity with the words in the input and indicating a measure of string similarity for each replacement word; and
wherein the selecting means further selects a word in the output or a replacement word candidate proposed by the replacement word candidate generating means by using the measure of the similarity of voices and the measure of the similarity of character strings together with the context-based selecting means.
124. The apparatus according to claim 79, and further comprising:
suspicious word recognition means for receiving a multi-word language input and providing a suspicious word output indicative of the suspicious word;
feature recognition means for recognizing features including the suspect word;
replacement selection means for identifying a replacement for the suspect word;
feature occurrence means for using a corpus and providing an occurrence output that ranks each feature including the replacement according to a frequency of use of the each feature in the corpus; and
wherein the selection means further uses the occurrence output to provide a correction output,
the feature recognition device comprises a feature filtering device comprising at least one of the following devices:
means for removing features containing suspected errors;
means for negatively biasing features that contain words introduced in earlier correction iterations of the multi-word input and that have a confidence less than a predetermined threshold of confidence; and
means for eliminating a feature contained in another feature having a frequency of occurrence greater than a predetermined frequency threshold.
125. The apparatus as claimed in any one of claims 121, 123 and 124 and wherein the selection means is arranged to make the selection based on at least two of the following correction functions:
spelling correction;
correcting misused words;
correcting grammar; and
the vocabulary is enhanced.
126. An apparatus according to claim 125 and wherein said selecting means is operative to make said selection based on at least one of the following time series of corrections:
spelling correction prior to at least one of misused word correction, grammar correction and vocabulary enhancement; and
misused word correction and grammar correction are performed prior to vocabulary enhancement.
127. Apparatus according to any one of claims 121, 123 and 124 and wherein said language input is speech and said selection means is operative to make said selection based on at least one of the following corrective functions:
correcting grammar; and
correcting misused words; and
the vocabulary is enhanced.
128. Apparatus according to any of claims 121, 123 and 124 and wherein said correction generating means comprises correction language input generating means for providing a corrected language output based on a selection made by said selecting means without requiring user intervention.
129. An apparatus according to any of claims 123 and 124 and wherein said selecting means is also operative to make said selection based, at least in part, on a user input uncertainty metric.
130. The device of claim 129, and wherein the user input uncertainty metric is a function of an uncertainty measure based on a person providing the input.
131. The apparatus as claimed in any one of claims 121, 123 and 124 and wherein the selection means also uses a user input history learning function.
132. The apparatus according to claim 79, and further comprising:
suspicious word recognition means for receiving a multi-word language input and providing a suspicious word output indicative of the suspicious word;
feature recognition means for recognizing features including the suspect word;
replacement selection means for identifying a replacement for the suspect word;
means for using a corpus and providing an appearance output that ranks features comprising the substitutions according to their frequency of use in the corpus; and
correction output generating means for providing a correction output using said occurrence output,
the feature recognition apparatus includes:
at least one of the following:
an N-gram recognition device; and
co-occurrence identification means; and
at least one of the following:
skipping over the grammar recognition device;
a conversion grammar recognition means; and
the user has previously used the feature recognition device.
133. The apparatus according to claim 79, and further comprising:
a syntax error suspicion means for evaluating at least a majority of words in a linguistic input based on their suitability in the context of the linguistic input; and
suspected syntax error correction generating means for providing a correction output based at least in part on the evaluation performed by the suspected means.
134. The apparatus according to claim 133, and further comprising:
suspected grammatical error substitution generating means for generating a text-based representation based on the linguistic input, the text-based representation providing a plurality of substitutions of at least one of the at least most words in the linguistic input; and
suspected grammatical error substitution selection means for selecting among said plurality of substitutions for each of said at least one of said at least most words in at least said linguistic input, and wherein
The suspected syntax error correction generating means is for providing the correction output based on a selection made by the suspected syntax error replacement selecting means.
135. The apparatus according to claim 133, and further comprising:
suspect word output indication means for indicating a degree to which at least some of said at least most words in said linguistic input are suspected to contain grammatical errors.
136. An apparatus according to claim 133, and wherein said suspected grammar error correction generating means includes an auto-corrected language generating means for providing a corrected text output based at least in part on an evaluation performed by said suspected means without requiring user intervention.
137. The apparatus according to claim 79, and further comprising:
a syntax error suspicion means for evaluating a word in the language input;
suspected grammatical error substitution generating means for generating a plurality of substitutions of at least some of the words in the linguistic input that are evaluated by the suspicion means as suspect words, at least one of the plurality of substitutions of words in the linguistic input being consistent with a contextual characteristic of the words in the linguistic input;
suspected syntax error substitution selection means for selecting at least among said plurality of substitutions; and
suspected syntax error correction generating means for providing a correction output based at least in part on the selection made by the suspected syntax error replacement selecting means.
138. The apparatus according to claim 79, and further comprising:
a syntax error suspicion means for evaluating words in the linguistic input and identifying suspect words;
suspected grammatical error substitution generating means for generating a plurality of substitutions of said suspect word;
suspected grammar error substitution selection means for ranking each of said suspect words and ones of said plurality of substitutions produced therefor by said suspected grammar error substitution generation means in accordance with a plurality of selection criteria and applying a bias favoring said suspect words over ones of said plurality of substitutions produced therefor by said suspected grammar error substitution generation means; and
suspected syntax error correction generating means for providing a correction output based at least in part on the selection made by the suspected syntax error replacement selecting means.
139. Apparatus according to claim 137 or 138 and wherein said suspected grammar error correction generating means includes corrected language input generating means for providing a corrected language output based on selections made by said suspected grammar error replacement selecting means without requiring user intervention.
140. The apparatus according to claim 79, and further comprising:
means for context-based scoring of the respective alternative corrections based at least in part on a frequency of occurrence of the contextual feature sequences CFS in the Internet corpus.
141. The apparatus according to claim 140 and also comprising at least one of the following means working in conjunction with said context-based scoring:
spelling correcting means;
a misused word correcting means;
grammar correcting means; and
a vocabulary enhancement device.
142. An apparatus according to claim 140 and wherein said context-based scoring is also based, at least in part, on normalized CFS frequency of occurrence in an internet corpus.
143. An apparatus as claimed in claim 140 and wherein said context-based score is also based, at least in part, on a CFS importance score.
144. The device of claim 143 and wherein said CFS importance score is a function of at least one of:
the operation of the part of speech tagging and statement analysis functions; the length of the CFS; the frequency of occurrence of each word in the CFS and the CFS type.
145. The apparatus according to claim 79 and also comprising vocabulary enhancement means, said vocabulary enhancement means comprising:
a vocabulary challenged word recognition device;
a replacement vocabulary enhancement generating means; and
a context-based scoring device based at least in part on the frequency of occurrence of the contextual feature sequences CFS in the Internet corpus,
the replacement vocabulary enhancement generating means includes a thesaurus dictionary preprocessing means for generating a replacement vocabulary enhancement.
146. The apparatus according to claim 79, and further comprising:
confidence assigning means for assigning a confidence to a replacement selected from the plurality of replacements;
wherein the correction output is based at least in part on the confidence level.
147. The device of claim 146 and wherein said plurality of alternatives are evaluated based on a contextual feature sequence, CFS, and said confidence level is based on at least one of the following parameters:
the number, type, and rating of CFSs selected;
a measure of statistical significance of the frequency of occurrence of the plurality of substitutions in the context of the CFS;
a degree of correspondence in selection of one of the plurality of alternatives based on the preference metric for each of the CFSs and based on the word similarity scores for the plurality of alternatives;
a non-contextual similarity score for the one of the plurality of alternatives above a first predetermined minimum threshold; and
a degree of availability of context data, the degree being indicated by a number of CFS's having CFS scores greater than a second predetermined minimum threshold and having preference scores above a third predetermined threshold.
148. The apparatus according to claim 79, and further comprising:
punctuation error suspicion means for evaluating at least some words and punctuation in a linguistic input based on their suitability within the context of the linguistic input based on their frequency of occurrence in an internet corpus; and
suspected punctuation error correction generating means for providing a correction output based at least in part on the evaluation performed by the suspected device.
149. The apparatus according to claim 148 and wherein said suspected punctuation error correction generating means comprises at least one of: lost punctuation correction means, redundant punctuation correction means and punctuation substitution correction means.
150. The apparatus according to claim 79, and further comprising:
syntax element error suspicion means for evaluating at least some words in a linguistic input based on their suitability within the context of the linguistic input based on their frequency of occurrence in an internet corpus; and
suspect syntax element error correction generating means for providing a correction output based at least in part on the evaluation performed by the suspect means.
151. The apparatus according to claim 150 and wherein said suspected syntax element error correction generating means comprises at least one of: lost syntax element correcting means, unnecessary syntax element correcting means, and syntax element substitution correcting means.
152. The device of claim 150 or 151 and wherein the syntax element is one of an article, a preposition, and a conjunctive.
153. A machine translation device comprising:
a machine translation device;
sentence retrieval means operative based on an input text provided by the machine translation means to retrieve a plurality of sentences from an internet corpus, the plurality of sentences including words corresponding to words in the input text;
sentence generation means operative using a plurality of sentences retrieved from the internet corpus by the sentence retrieval means to generate at least one correct sentence expressing the input text generated by the machine translation means,
wherein the sentence generation apparatus comprises:
sentence simplification means for simplifying the sentences retrieved from the internet corpus;
simplified sentence grouping means for grouping similar simplified sentences provided by the sentence simplification means; and
simplified sentence group classification means for classifying the group of the similar simplified sentences;
generating a text-based representation based on an input sentence, the text-based representation providing a plurality of alternatives for each of a plurality of words in the sentence;
selecting means for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
correction generating means for providing a correction output based on the selection made by the selecting means;
wherein the selecting means comprises context-based scoring means for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
154. A text correction apparatus comprising:
sentence retrieval means that operates based on an input text provided by a text correction function to retrieve a plurality of sentences from an internet corpus, the plurality of sentences including words corresponding to words in the input text;
sentence generation means operative using a plurality of sentences retrieved from the internet corpus by the sentence retrieval means to generate at least one correct sentence expressing the input text,
wherein the sentence generation apparatus comprises:
sentence simplification means for simplifying the sentences retrieved from the internet corpus;
simplified sentence grouping means for grouping similar simplified sentences provided by the sentence simplification means; and
simplified sentence group classification means for classifying the group of the similar simplified sentences;
generating a text-based representation based on an input sentence, the text-based representation providing a plurality of alternatives for each of a plurality of words in the sentence;
selecting means for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
correction generating means for providing a correction output based on the selection made by the selecting means;
wherein the selecting means comprises context-based scoring means for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
155. A sentence search apparatus comprising:
sentence searching means for providing an input text based on a query word input by a user;
sentence retrieval means that operates based on the input text provided by the sentence search means to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
sentence generation means operating using a plurality of sentences retrieved from the internet corpus by the sentence retrieval means to generate at least one correct sentence expressing the input text generated by the sentence search means,
wherein the sentence generation apparatus comprises:
sentence simplification means for simplifying the sentences retrieved from the internet corpus;
simplified sentence grouping means for grouping similar simplified sentences provided by the sentence simplification means; and
simplified sentence group classification means for classifying the group of the similar simplified sentences;
generating a text-based representation based on an input sentence, the text-based representation providing a plurality of alternatives for each of a plurality of words in the sentence;
selecting means for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
correction generating means for providing a correction output based on the selection made by the selecting means;
wherein the selecting means comprises context-based scoring means for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
156. A speech to text conversion apparatus comprising:
speech-to-text conversion means for providing input text;
sentence retrieval means that operates based on the input text provided by the sentence search means to retrieve a plurality of sentences from an internet corpus, the plurality of sentences containing words corresponding to the words in the input text;
sentence generation means operative using a plurality of sentences retrieved from the internet corpus by the sentence retrieval means to generate at least one correct sentence expressing the input text generated by the speech-to-text conversion means,
wherein the sentence generation apparatus comprises:
sentence simplification means for simplifying the sentences retrieved from the internet corpus;
simplified sentence grouping means for grouping similar simplified sentences provided by the sentence simplification means; and
simplified sentence group classification means for classifying the group of the similar simplified sentences;
generating a text-based representation based on an input sentence, the text-based representation providing a plurality of alternatives for each of a plurality of words in the sentence;
selecting means for selecting at least among the plurality of alternatives for each of the plurality of words of the sentence based at least in part on an Internet corpus; and
correction generating means for providing a correction output based on the selection made by the selecting means;
wherein the selecting means comprises context-based scoring means for ranking the plurality of alternatives based at least in part on a frequency of occurrence of contextual feature sequences, CFSs, in an Internet corpus.
HK12101697.0A 2008-07-31 2009-02-04 Automatic context sensitive language generation, correction and enhancement using an internet corpus HK1161646B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
PCT/IL2008/001051 WO2009016631A2 (en) 2007-08-01 2008-07-31 Automatic context sensitive language correction and enhancement using an internet corpus
ILPCT/IL2008/001051 2008-07-31
PCT/IL2009/000130 WO2010013228A1 (en) 2008-07-31 2009-02-04 Automatic context sensitive language generation, correction and enhancement using an internet corpus

Publications (2)

Publication Number Publication Date
HK1161646A1 true HK1161646A1 (en) 2012-07-27
HK1161646B HK1161646B (en) 2015-10-23

Family

ID=

Similar Documents

Publication Publication Date Title
US9026432B2 (en) Automatic context sensitive language generation, correction and enhancement using an internet corpus
JP5584212B2 (en) Generate, correct, and improve languages that are automatically context sensitive using an Internet corpus
US9015036B2 (en) Automatic context sensitive language correction using an internet corpus particularly for small keyboard devices
Zajic et al. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks
Gupta Hybrid algorithm for multilingual summarization of Hindi and Punjabi documents
Trost et al. The language component of the FASTY text prediction system
HK1161646B (en) Automatic context sensitive language generation, correction and enhancement using an internet corpus
HK1161646A1 (en) Automatic context sensitive language generation, correction and enhancement using an internet corpus
Ştefănescu et al. TiradeAI: An ensemble of spellcheckers
L’haire FipsOrtho: A spell checker for learners of French
KR102820763B1 (en) Search Result Providing Method Based on User Intention Understanding of Search Word and Storage Medium Recording Program for Executing the Same
Preiss Probabilistic word sense disambiguation: Analysis and techniques for combining knowledge sources
Webler a semantic search engine for tagged artworks based on word embeddings
Preiss et al. HMMs, GRs, and n-grams as lexical substitution techniques–are they portable to other languages?
Althobaiti Minimally-supervised Methods for Arabic Named Entity Recognition
Bahgat et al. Lexpansion: Evaluating Dictionary Based Lexicon Expansion for Social Media Analysis
Fuentes Fort et al. FEMsum: A flexible eclectic multitask summarizer architecture evaluated in multidocument tasks
Iraolak Word Sense Disambiguation: Facing Current Challenges

Legal Events

Date Code Title Description
PC Patent ceased (i.e. patent has lapsed due to the failure to pay the renewal fee)

Effective date: 20190207