WO2020000764A1 - 一种面向印地语的多语言混合输入方法及装置 - Google Patents
一种面向印地语的多语言混合输入方法及装置 Download PDFInfo
- Publication number
- WO2020000764A1 WO2020000764A1 PCT/CN2018/109507 CN2018109507W WO2020000764A1 WO 2020000764 A1 WO2020000764 A1 WO 2020000764A1 CN 2018109507 W CN2018109507 W CN 2018109507W WO 2020000764 A1 WO2020000764 A1 WO 2020000764A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- hindi
- vocabulary
- input
- language model
- latin
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 111
- 238000013507 mapping Methods 0.000 claims abstract description 28
- 238000004590 computer program Methods 0.000 claims description 16
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 11
- 230000008569 process Effects 0.000 description 7
- 230000001960 triggered effect Effects 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000003491 array Methods 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000012905 input function Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
- G06F3/0237—Character input methods using prediction or retrieval techniques
Definitions
- the invention relates to the technical field of input methods, and in particular, to a multilingual mixed input method and device for Hindi.
- the purpose of multilingual mixed input is achieved by switching input modes. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Vietnamese input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
- the present invention provides a multilingual mixed input method and device oriented to Hindi, which is used to solve the purpose of multilingual mixed input by switching input modes in the prior art.
- the efficiency of multilingual mixed input is low. And extremely time-consuming technical issues.
- An embodiment of one aspect of the present invention provides a multilingual mixed input method for Hindi, including:
- obtaining the first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model includes:
- the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling
- adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list
- An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- obtaining the first candidate character string list of the Latin character form corresponding to the Latin character sequence according to the first language model further includes:
- the method further includes:
- predicting a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generating a second candidate word list according to the prediction result including:
- the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- the first candidate character string list of the Latin character form corresponding to the Latin character sequence is obtained, and the first language model is a Latin language A language model of the character form spelling Hindi, where,
- the pre-establishment of the first language model includes:
- the constructing a language model using the collated corpus includes:
- the collated corpus uses the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- An embodiment of another aspect of the present invention provides a multilingual mixed input device for Hindi, including:
- Input character acquisition module which is used to acquire the Latin character sequence of the current input vocabulary typed by the input method interface
- a first candidate character string generating module configured to obtain a first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to a first language model, where the first language model is to spell Hindi in the form of a Latin character Language model
- a vocabulary mapping module is configured to obtain a target Hindi vocabulary list according to a mapping relationship between a Latin character spelling form of the Hindi vocabulary and a Hindi character spelling form, which is established in advance.
- the target Hindi vocabulary list includes: The Hindi character spelling form corresponding to the Hindi vocabulary in the Latin character spelling form in the first candidate string list;
- a first candidate word list generating module configured to generate a Hindi character spelling form corresponding to a Hindi word corresponding to a Hindi word spelling in the first candidate character string list and the Latin character spelling form in the first candidate character string list;
- a first candidate list of words including spellings of Latin characters and spellings of Hindi characters;
- a first candidate word list display module configured to display the first candidate word list on an input method interface
- the first candidate word input module is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the first candidate string generating module is specifically configured to:
- the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling
- adding the Hindi vocabulary corresponding to the Latin character sequence to the first candidate character string list
- An extended option is obtained, the extended option includes: a Hindi word or a vocabulary segment containing a Latin character spelling form of the Latin character sequence, and the extended option is added to a first candidate character string list.
- the first candidate string generating module is further configured to:
- the device further includes:
- a second candidate word list generating module configured to predict a subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result;
- a second candidate word list display module configured to display the second candidate word list on an input method interface
- a second candidate word input module is configured to obtain a selection operation of a vocabulary of the second candidate word list, and input the selected vocabulary as a next input vocabulary.
- the second candidate word list generating module is specifically configured to:
- the subsequent input vocabulary is predicted according to a second language model, which is a pre-established language model that spells Hindi in the form of Hindi characters.
- the device further includes:
- a first language model creation module is used to establish a first language model.
- the first language model creation module includes:
- a corpus acquisition unit configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus to obtain valid corpus;
- a corpus deduplication unit for removing redundant parts in the valid corpus data to obtain a collated corpus
- a language model building unit is used to build a language model using the collated corpus.
- the language model construction unit is specifically configured to:
- the collated corpus uses the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, where the parameters of the language model include: words in the language model, and in the N-gram lexical arrangement, the Nth word is about the former Conditional probability for N-1 words, where N is a positive integer; and
- the multilingual mixed input device for Vietnamese obtains the Latin character sequence of the current input vocabulary typed by the input method interface, and then obtains the first Latin character form corresponding to the Latin character input sequence according to the first language model.
- a list of candidate character strings where the first language model is a pre-established language model that spells Hindi in the form of Latin characters, and then according to the pre-established Vietnamese word spelling form and the Hindi character spelling form, Mapping relationship between the first candidate string list to obtain the Vietnamese character spelling form corresponding to the Vietnamese vocabulary of the Latin character spelling form in the first candidate string list, and according to the first candidate string list and the first candidate string list
- a Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate on the input method interface Word list, and obtain a selection operation of words in the first candidate word list, thereby The selected input as input vocabulary words.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- Another embodiment of the present invention provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the multilingual Hindi-oriented multilingual mixture proposed by the above embodiment of the present invention Input method.
- an embodiment of the fourth aspect of the present invention provides a computer program product, and when instructions in the computer program product are executed by a processor, a multi-language oriented Hindi language according to the foregoing embodiment of the present invention is implemented. Mixed language input method.
- an embodiment of the fifth aspect of the present invention provides a computing device including a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor executes the program, A multi-language mixed input method for Hindi language according to the above embodiment of the present invention is implemented.
- the computer program product and the computing device have similar methods and devices for Hindi-oriented multilingual mixed input according to the first and second aspects of the present invention The beneficial effects are not repeated here.
- FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to a first embodiment of the present invention
- FIG. 2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention
- FIG. 3 is a schematic flowchart of establishing a language model according to an embodiment of the present invention.
- FIG. 4 is a structural block diagram of a multi-lingual mixed input device for Hindi according to an embodiment of the present invention.
- FIG. 5 is a structural block diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
- the first way is to switch the input mode to achieve the purpose of multilingual mixed input. For example, when the user uses the English keyboard to input Latin characters, if the user wants to input a certain Hindi character at this time, the user needs to switch to the Vietnamese input method for input, and then switch back to the English keyboard to continue inputting Latin characters.
- the second method is to enter the temporary input mode through a preset operation, and the user can type characters in the second language in the temporary input mode. For example, in Chinese and English input methods, the user can switch the input method by clicking the Shift key.
- the third method, part of the input method supports two encoding methods in the language model, that is, according to user input, the most suitable encoding rule is automatically selected and the characters are displayed.
- the efficiency of mixed-language input is low.
- the second mode after entering the temporary input mode, special processing of characters is required, which increases the development cycle.
- the third mode when two When the encoding differences between languages are small, the accuracy of the output of the speech model is low.
- the present invention mainly aims at the technical problems of low efficiency of multilingual mixed input and low accuracy of output results in the prior art, and proposes a multilingual mixed input method oriented to Hindi.
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the language model in the form of N-Gram is based on the following assumptions: the occurrence of the nth vocabulary is related to the first n-1 vocabulary, but not related to any other vocabulary. Among them, the probability of occurrence of each vocabulary can be obtained through statistical calculation of corpus data.
- the probability of the Nth vocabulary is determined by the probabilities of these vocabularies from w 1 , w 2 , w 3 , ..., w N-1 that have appeared before.
- the previous vocabulary is used to predict the next vocabulary that will appear, and then based on a large number of text observations, it can be obtained that the predicted vocabulary is more and more likely to be behind these existing vocabulary. Therefore, the constructed language model can be a (n-1) -order Markov model or an N-ary language model.
- the value of N can be 2, 3, 4, etc.
- FIG. 1 is a schematic flowchart of a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
- the Hindi-oriented multilingual mixed input method provided by the embodiment of the present invention may be implemented by the Hindi-oriented multilingual mixed input device provided by the embodiment of the present invention, and the device may be configured in any computing device so that the The computing device implements a multilingual mixed input function for Hindi.
- the computing device may be a hardware device such as a personal computer (PC), a cloud device, or a mobile device.
- the mobile device may be a mobile phone, a tablet computer, a personal digital assistant, or a wearable device. And / or display hardware.
- the multilingual mixed input method for Hindi includes the following steps:
- Step 101 Obtain a Latin character sequence of a current input vocabulary typed by an input method interface.
- the computing device may be provided with an input method interface, and a user may enter a Latin character sequence through the input method interface.
- a user may enter a Latin character sequence through the input method interface.
- the computing device is a mobile phone
- the user can manually type the Latin character sequence through the touch screen
- the computing device is a PC
- the user can manually type the Latin character sequence through the keyboard.
- a computing device may be provided with a listener to monitor a user-typed input operation.
- the current input typed by the user on the input method interface may be obtained according to the user's input operation.
- Vocabulary sequence of Latin characters For example, when the user wants to enter “mobile phone”, he can type “mobile” in the input method interface.
- Step 102 Obtain a first candidate character string list of Latin character forms corresponding to the Latin character sequence according to the first language model.
- the first language model is a pre-established language model that spells Hindi in the form of Latin characters.
- the first language model is a pre-established language model that spells Hindi in the form of Latin characters.
- corpus data that spells Hindi in the form of Latin characters can be obtained, and then a language model is constructed based on the corpus data to obtain a first language model.
- the Latin character sequence when a Latin character sequence is acquired, the Latin character sequence may be input to a first language model to obtain a first candidate character string list of the Latin character form corresponding to the Latin character sequence.
- the Latin character sequence when the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling, the Vietnamese vocabulary corresponding to the Latin character sequence may be directly added to the first candidate character string list.
- the Latin character sequence corresponds to a Vietnamese vocabulary in the form of incomplete Latin character spelling, in order to improve the input efficiency of the user, or to correct and complete the Latin character sequence input by the user, in the present invention, an extension can be obtained Options.
- the extended option includes: a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and then the extended option is added to the first candidate character string list.
- the input method may also provide an error correction function. That is, obtaining the first candidate character string list in the form of a Latin character corresponding to the Latin character sequence according to the first language model may further include: when the first language model does not contain a character string containing the Latin character sequence When the Hindi vocabulary in the Latin character spelling form is obtained, the Hindi vocabulary in the Latin character spelling form having the highest similarity to the Latin character sequence is obtained, and added to the first candidate character string list as an extended option.
- the extension options can be: Mai, Nai, Main, Maine.
- Step 103 Obtain a target Vietnamese vocabulary list according to the mapping relationship between the Latin character spelling form of the Hindi vocabulary and the Hindi character spelling form, which may include a first candidate.
- a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters may be established in advance.
- the Latin character spelling form of the Hindi vocabulary includes two forms, one is : Vietnamese character spelling Latin pronunciation spelling directly translated from pronunciation, for example, Hindi characters The corresponding Latin character is "dena", dena has no practical meaning in other scenes, only if you want to get Vietnamese characters Only makes sense when you enter dena; another is: some English words that do not appear in Hindi, for example, there is no English word "mobile” in Hindi.
- mapping relationship between the spellings of the Latin characters and the spellings of the Hindi vocabulary By establishing a mapping between the spellings of the Latin characters and the spellings of the Hindi vocabulary, such as establishing "mobile" and The mapping relationship between them can ensure that the mapping relationship between the Latin character spelling form of Hindi vocabulary and the Hindi character spelling form is a one-to-one relationship.
- the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate character string list can be obtained by querying the above mapping relationship, and the operation is simple and easy to implement. And through the mapping relationship established in advance, the corresponding spelling form of the Hindi character can be determined, which can further improve the accuracy of the output result.
- Step 104 Generate a first candidate word list of words including Latin character spelling form and Hindi character spelling form according to the first candidate character string list and the target Hindi vocabulary list.
- the first candidate character string list and the first candidate character string may be obtained.
- the Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the list generates a first candidate word list of the vocabulary including the Latin character spelling form and the Hindi character spelling form.
- the first candidate word list may simultaneously include all the Hindi words in the spelling form of Latin characters in the first candidate character string list and the words in the Vietnamese character spelling form corresponding to the Hindi word.
- the Hindi word corresponding to the first number of Latin characters in the first candidate character string list and the Hindi word corresponding to the second number of Hindi words can be selected.
- the first and second numbers can be the same or different.
- the first number can be two and the second number can be three.
- Step 105 Display the first candidate word list on the input method interface.
- the first candidate word list may be displayed on the input method interface.
- the first candidate word list displayed on the input method interface may be: Nai, Main, Maine.
- Step 106 Acquire a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the selection operation is triggered by a user, and the selection operation may be, for example, a user's click operation, or the user triggers an operation corresponding to a number or a space key on the keyboard, which is not limited.
- the user may select a word from the first candidate word list for input according to actual needs.
- a computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as an input word. Enter it.
- the user can select "Main" as an input word for input.
- the present invention takes a mixed input of Hindi and Latin as an example, but the present invention is not limited thereto, and those skilled in the art can implement mixed input of any two languages based on the present invention. Strong scalability.
- the multi-lingual mixed input method for Hindi obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the subsequent vocabulary of the input vocabulary can be predicted, so that the user can input the next vocabulary according to the prediction result. Therefore, there is no need for the user to manually type the next vocabulary, which further improves the user's multilingual mixed input efficiency.
- FIG. 2 is a schematic flowchart of lexical association input in a Hindi-oriented multilingual mixed input method according to an embodiment of the present invention.
- the Hindi-oriented multilingual mixed input method may further include the following steps:
- Step 201 Predict the subsequent vocabulary of the input vocabulary according to the language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
- the subsequent input vocabulary can be predicted according to the first language model, and when the spelling form of the input vocabulary is Vietnamese characters, the subsequent input vocabulary is predicted according to the second language model, where
- the second language model is a pre-established language model that spells Hindi in the form of Hindi characters. For example, Hindi corpus data spelled with Hindi characters can be obtained, and then a language model is constructed based on the corpus data to obtain a second language model.
- the input vocabulary is "Main”
- the spelling form of the input vocabulary is Latin characters
- the subsequent input vocabulary is predicted according to the first language model.
- the prediction result can be: bhi, ne, to, nahi, khud, hi.
- the spelling form of the input vocabulary is Korean characters
- the subsequent input vocabulary is predicted according to the second language model.
- the prediction result can be
- the second candidate word list may include all words in the candidate result. Further, due to the limited display interface of the computing device, the second candidate word list may include the third number of words in the prediction result. Among them, the third number is preset.
- Step 202 Display the second candidate word list on the input method interface.
- the second candidate word list may be displayed on the input method interface.
- Step 203 Acquire a vocabulary selection operation of the second candidate word list, and input the selected vocabulary as the next input vocabulary.
- the user may select a word from the second candidate word list for input according to actual needs.
- a computing device may be provided with a listener to monitor the selection operation triggered by the user. When the selection operation triggered by the user is monitored, the selected word may be determined according to the selection operation, and then the selected word is used as the next one. Enter the word for input.
- the multi-language mixed input method for Hindi can be used in the process of inputting vocabulary by the user. To perform error correction, completion and prediction of the input vocabulary.
- the first candidate word list obtained can be:
- the user can select the vocabulary "Main”, and then predict the subsequent input vocabulary according to the first language model.
- the second candidate word list obtained can be:
- the user can select the vocabulary "bhi", and then predict the subsequent input vocabulary based on the first language model, and the obtained second candidate word list can be:
- the vocabulary that the user wants to output is Hindi, which is spelled in the form of Hindi characters corresponding to "nahi”. At this time, the user can enter "nahi".
- the obtained first A candidate list can be:
- the first candidate word list obtained after the first language model and the query mapping relationship can be:
- the user can select the vocabulary "meri”. After that, the vocabulary that the user wants to output is the Hindi spelled in the form of Hindi characters corresponding to "kahani". At this time, the user can enter "kahani" and pass the first language After the model and the query mapping relationship, the first candidate word list obtained can be:
- the user when a user wants to enter a Hindi word spelled in the form of Hindi characters, the user does not know the spelling rules of the word, but only knows the spelling of some Latin characters corresponding to the word form.
- the vocabulary the user wants to enter is The Latin character spelling form corresponding to this vocabulary is "Abhishek", if the user only remembers the first half of the Latin character spelling form "Abhis”.
- the user can enter the vocabulary "Abhis”, after completing the first language model to correct it, and querying the mapping relationship, the first candidate word list obtained can be:
- FIG. 3 is a schematic flowchart of establishing a language model according to an embodiment of the present invention.
- the process of establishing the first language model may include the following steps:
- Step 301 Obtaining corpus data that spells Hindi in the form of Latin characters, and preprocesses the corpus data to remove the erroneous corpus and low-frequency corpus to obtain a valid corpus.
- corpus data spelling Hindi in the form of Latin characters in India can be collected, and then the corpus data is pre-processed to remove the erroneous corpus and low-frequency corpus to obtain an effective corpus.
- the corpus can be The data is subjected to preprocessing operations such as interference removal of non-text information, spell check correction, data cleaning, data formatting, and selection of high-frequency words, so as to ensure the performance of the first language model after learning.
- Step 302 Remove redundant parts in the valid corpus data to obtain a collated corpus.
- the redundant part in the effective corpus data can be removed to obtain a collated corpus, thereby reducing the redundancy of the corpus data and the storage space occupied by it, and improving the learning efficiency of the first language model.
- Step 303 Construct a language model by using the corpus.
- the collated corpus when the collated corpus is obtained, the collated corpus may be used to construct a language model.
- a language model in order to avoid data overflow and improve the performance of the language model, logarithms can be used, and addition operations can be used instead of multiplication operations.
- the language model can A language model in the form of N-Gram is an N-gram language model.
- step 303 may specifically include: constructing a language model in the form of N-Gram using the compiled corpus, and calculating the parameters of the language model, wherein the parameters of the language model include: vocabulary in the language model and N-ary vocabulary arrangement In N, the conditional probability of the Nth word with respect to the first N-1 words, where N is a positive integer.
- step 303 may further include: smoothing the conditional probability data, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear in the collated corpus is not zero.
- data smoothing technology can be used to smooth the conditional probability data to reduce the conditional probability corresponding to the N-ary vocabulary arrangement that has appeared in the collated corpus, so that the conditional probability corresponding to the N-ary vocabulary arrangement that does not appear Not zero.
- the present invention also proposes a multilingual mixed input device oriented to Hindi.
- the implementation of the device may include one or more computing devices.
- the computing device includes a processor and a memory, and the memory stores an application program including computer program instructions executable on the processor.
- the application program can be divided into a plurality of program modules for corresponding functions of each component of the system.
- the division of program modules is logical rather than physical.
- Each program module can run on one or more computing devices, and one computing device can also run one or more program modules.
- the device of the present invention is described in detail according to the functional logic division of the program module.
- FIG. 4 is a schematic structural diagram of a Hindi-oriented multilingual mixed input device according to an embodiment of the present invention.
- the multilingual mixed input device 100 for Hindi may be implemented by using a computing device including a processor and a memory.
- the memory stores program modules that can be executed by the processor. When each program module is executed, the computing device is controlled to implement corresponding operations. Functions.
- the multilingual mixed input device 100 for Hindi includes: an input character acquisition module 101, a first candidate character string generation module 102, a vocabulary mapping module 103, a first candidate word list generation module 104, a first A candidate word list display module 105 and a first candidate word input module 106. among them,
- the input character acquisition module 101 is configured to acquire a Latin character sequence of a current input vocabulary typed by an input method interface.
- a first candidate character string generating module 102 is configured to obtain a first candidate character string list in the form of a Latin character corresponding to a Latin character sequence according to a first language model.
- the first language model is a language model that spells Hindi in the form of Latin characters. .
- a vocabulary mapping module 103 is configured to obtain a target Hindi vocabulary list according to a mapping relationship between the spelling form of the Latin characters of the Hindi vocabulary and the spelling form of the Hindi characters, and the target Hindi vocabulary list includes : The Hindi character spelling form corresponding to the Hindi vocabulary of the Latin character spelling form in the first candidate string list.
- the first candidate word list generating module 104 is configured to generate a first candidate word list including a Latin character spelling form and a Vietnamese character spelling form according to the first candidate character string list and the target Hindi word list.
- the first candidate word list display module 105 is configured to display the first candidate word list on an input method interface.
- the first candidate word input module 106 is configured to obtain a selection operation of a word in the first candidate word list, and input the selected word as an input word.
- the Vietnamese-oriented multilingual mixed input device 100 may further include:
- the first candidate string generating module 102 is specifically configured to: when the Latin character sequence is a Vietnamese vocabulary in the form of a complete Latin character spelling, add the Vietnamese vocabulary corresponding to the Latin character sequence to the first candidate string list; and The extended option is obtained.
- the extended option includes a Hindi word or a vocabulary segment of a Latin character spelling form containing a Latin character sequence, and the extended option is added to the first candidate character string list.
- the first candidate character string generating module 102 may be further configured to: obtain a similarity to the Latin character sequence when there is no Hindi vocabulary in the first language model containing the Latin character spelling form of the Latin character sequence The Hindi vocabulary with the highest degree of spelling of Latin characters is added as an extended option to the first candidate string list.
- a second candidate word list generating module 107 is configured to predict a subsequent vocabulary of the input vocabulary according to a language model corresponding to the input vocabulary, and generate a second candidate word list according to the prediction result.
- the second candidate word list display module 108 is configured to display the second candidate word list on the input method interface.
- the second candidate word input module 109 is configured to obtain a selection operation of a vocabulary in the second candidate word list, and input the selected vocabulary as a next input vocabulary.
- the second candidate word list generating module 107 is specifically configured to determine whether the spelling form of the input vocabulary is Latin characters or Hindi characters; when the spelling form of the input vocabulary is Latin characters, according to the first The language model predicts subsequent input vocabulary; when the spelling form of the input vocabulary is Vietnamese characters, the subsequent input vocabulary is predicted according to the second language model, which is a pre-established language that spells Hindi in the form of Hindi characters model.
- the first language model creation module 110 is configured to establish a first language model.
- the first language model creation module 110 includes:
- a corpus acquisition unit 111 is configured to acquire corpus data spelling Hindi in the form of Latin characters, and preprocess the corpus data to remove the erroneous corpus and low-frequency corpus therein to obtain a valid corpus.
- the corpus de-redundant unit 112 is used to remove redundant parts in the effective corpus data to obtain a collated corpus.
- the language model constructing unit 113 is configured to construct a language model using the corpus after arrangement.
- the language model constructing unit 113 is specifically configured to: use the collated corpus to construct a language model in the form of N-Gram, and calculate the parameters of the language model, wherein the parameters of the language model include: the language model Vocabulary, as well as the conditional probability of the Nth vocabulary with respect to the first N-1 vocabulary, N is a positive integer; and the conditional probability data is smoothed so that the The conditional probability corresponding to the N-gram lexical arrangement is not zero.
- the multilingual mixed input device for Vietnamese obtains a Latin character sequence of a current input vocabulary typed by an input method interface, and then obtains a first Latin character form corresponding to the Latin character sequence according to a first language model.
- Candidate string list where the first language model is a pre-established language model that spells Hindi in Latin characters, and then according to the pre-established Vietnamese vocabulary between the spelling form of Latin characters and the Hindi character spelling
- a Hindi character spelling form corresponding to the Hindi vocabulary of the character spelling form generating a first candidate list of words including the Latin character spelling form and the Hindi character spelling form, and finally displaying the first candidate word on the input method interface List, and get a selection operation of the words in the first candidate word list, which will be The vocabulary input as input vocabulary.
- determining the spelling form of the Hindi characters can improve the accuracy of the output result.
- the present invention also provides a non-transitory computer-readable storage medium.
- the non-transitory computer-readable storage medium stores executable instructions thereon.
- the executable instructions When the executable instructions are run on a processor, the multilingual oriented to the Hindi language as proposed in the foregoing embodiment of the present invention is implemented.
- Mixed input method The storage medium may be provided on the device as part of the device; or when the device can be remotely controlled by the server, the storage medium may be provided on a remote server that controls the device.
- the computer instructions for implementing the method of the present invention may be carried in any combination of one or more computer-readable media.
- the so-called non-transitory computer-readable medium may include any computer-readable medium, except for the signal itself which is temporarily propagated.
- the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.
- a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
- the present invention also provides a computer program product.
- Computer program code for performing the operations of the present invention may be written in one or more programming languages, or combinations thereof, including programming languages such as Java, Smalltalk, C ++, and also conventional Procedural programming language—such as "C" or similar programming language.
- the program code can be executed entirely on the user's computer, partly on the user's computer, as an independent software package, partly on the user's computer, partly on a remote computer, or entirely on a remote computer or server.
- the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider) Internet connection).
- LAN local area network
- WAN wide area network
- Internet service provider Internet service provider
- the present invention also provides a computing device.
- a computing device includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
- the processor executes the program, the print-oriented computer according to the foregoing embodiments of the present invention is implemented. Multilingual mixed input method in the local language.
- the computing device may be implemented by a central control unit of a computer device as part of the function of the central control unit of the computer device. It can also be implemented by a separate computing device, which is communicatively connected with the central control unit of the computer device.
- the implementation of the computing device may include, but is not limited to, a single chip microcomputer, a programmable logic controller (PLC), a complex programmable logic device (CPLD), a programmable gate array (PGA), a field programmable gate array (FPGA), and a dedicated nerve Network chip, etc.
- the non-transitory computer-readable storage medium, computer program product, and computing device according to the embodiments of the present invention may be implemented with reference to the content specifically described in the foregoing embodiments of the present invention, and have many advantages to the Hindi-oriented multifaceted solutions proposed by the foregoing embodiments of the present invention Similar beneficial effects of the mixed language input method are not repeated here.
- first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Therefore, the features defined as “first” and “second” may explicitly or implicitly include at least one of the features. In the description of the present invention, the meaning of “plurality” is two or more, such as two, three, etc., unless it is specifically and specifically defined otherwise.
- any process or method description in a flowchart or otherwise described herein may be understood to mean an instruction that includes one or more executable instructions for implementing a particular logical function or process step.
- Modules, fragments or sections of code, and the scope of the preferred embodiments of the present invention includes additional implementations, which may not be in the order shown or discussed, including in a substantially simultaneous manner or in the reverse order according to the functions involved To perform functions, which should be understood by those skilled in the art to which the embodiments of the present invention pertain.
- a sequenced list of executable instructions that can be considered to implement a logical function can be embodied in any computer-readable medium,
- the instruction execution system, device, or device such as a computer-based system, a system including a processor, or other system that can fetch and execute instructions from the instruction execution system, device, or device), or combine these instruction execution systems, devices, or devices Or equipment.
- a "computer-readable medium” may be any device that can contain, store, communicate, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device.
- each part of the present invention may be implemented by hardware, software, firmware, or a combination thereof.
- multiple steps or methods may be implemented by software or firmware stored in a memory and executed by a suitable instruction execution system.
- a suitable instruction execution system For example, if implemented in hardware, as in another embodiment, it may be implemented using any one or a combination of the following techniques known in the art: Discrete logic circuits, application specific integrated circuits with suitable combinational logic gate circuits, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
一种面向印地语的多语言混合输入方法及装置,其中,方法包括:获取输入法界面键入的当前输入词汇的拉丁字符序列;根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表;根据印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;在输入法界面展示第一候选词列表;获取对第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。该方法能够提升多语言的混合输入效率,改善用户的输入体验。
Description
相关申请的交叉引用
本申请要求北京金山安全软件有限公司于2018年6月29日提交的、发明名称为“一种面向印地语的多语言混合输入方法及装置”的、中国专利申请号“201810713058.9”的优先权。
本发明涉及输入法技术领域,尤其涉及一种面向印地语的多语言混合输入方法及装置。
随着国际交流的日益频繁,两种语言甚至多种语言的混合输入变得越来越普遍。目前印度地区的两种官方语言:英语和印地语,分别采用拉丁字母和梵文天诚体书写,因此,印度用户具有对拉丁语和印地语的混合使用需求。
现有技术中,通过切换输入模式,来达到多语言混合输入的目的。例如,当用户使用英文键盘输入拉丁字符时,如果此时用户想输入某个印地语字符时,用户需切换成印地语输入法进行输入后,再切回英文键盘继续输入拉丁字符。
这种方式下,用户需要来回切换输入模式,多语言的混合输入的效率较低,且极为耗时。
发明内容
本发明提供一种面向印地语的多语言混合输入方法及装置,用于解决现有技术中通过切换输入模式,来达到多语言混合输入的目的,存在多语言的混合输入的效率较低,且极为耗时的技术问题。
本发明一方面实施例提出了一种面向印地语的多语言混合输入方法,包括:
获取输入法界面键入的当前输入词汇的拉丁字符序列;
根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型;
根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表 中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;
根据所述第一候选字符串列表和所述目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;
在输入法界面展示所述第一候选词列表;
获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
作为本发明第一种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,包括:
当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及
获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。
作为本发明第二种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还包括:
当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
作为本发明第三种可能的实现方式,获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入之后,还包括:
根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;
在输入法界面展示所述第二候选词列表;
获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
作为本发明第四种可能的实现方式,所述根据输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表,包括:
判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;
当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;
当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。
作为本发明第五种可能的实现方式,所述根据第一语言模型,获取所述拉丁字符序列 对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,其中,
所述第一语言模型的预先建立,包括:
获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;
去除所述有效语料数据中的冗余部分,得到整理后的语料;
使用整理后的语料构建语言模型。
作为本发明第六种可能的实现方式,所述使用整理后的语料构建语言模型,包括:
使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及
对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。
本发明又一方面实施例提出了一种面向印地语的多语言混合输入装置,包括:
输入字符获取模块,用于获取输入法界面键入的当前输入词汇的拉丁字符序列;
第一候选字符串生成模块,用于根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为以拉丁字符形式拼写印地语的语言模型;
词汇映射模块,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼 写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;
第一候选词列表生成模块,用于根据所述第一候选字符串列表和所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;
第一候选词列表展示模块,用于在输入法界面展示所述第一候选词列表;
第一候选词输入模块,用于获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
作为本发明第一种可能的实现方式,所述第一候选字符串生成模块,具体用于:
当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及
获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。
作为本发明第二种可能的实现方式,所述第一候选字符串生成模块,还用于:
当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
作为本发明第三种可能的实现方式,所述装置还包括:
第二候选词列表生成模块,用于根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;
第二候选词列表显示模块,用于在输入法界面展示所述第二候选词列表;
第二候选词输入模块,用于获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
作为本发明第四种可能的实现方式,所述第二候选词列表生成模块,具体用于:
判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;
当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;
当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。
作为本发明第五种可能的实现方式,所述装置还包括:
第一语言模型创建模块,用于建立第一语言模型,所述第一语言模型创建模块包括:
语料获取单元,用于获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据 进行预处理以去除其中的错误语料和低频语料,得到有效语料;
语料去冗余单元,用于去除所述有效语料数据中的冗余部分,得到整理后的语料;
语言模型构建单元,用于使用整理后的语料构建语言模型。
作为本发明第六种可能的实现方式,所述语言模型构建单元,具体用于:
使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及
对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
本发明实施例的面向印地语的多语言混合输入装置,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符输入序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。
本发明又一方面实施例提供了一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现本发明上述实施例提出的面向印地语的多语言混合输入方法。
为了实现上述目的,本发明第四方面实施例提供了一种计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,实现根据本发明上述实施例提出的面向印地语的多语言混合输入方法。
为了实现上述目的,本发明第五方面实施例提供了一种计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据本发明上述实施例提出的面向印地语的多语言混合输入方法。
根据本发明第三到五方面的非临时性计算机可读存储介质,计算机程序产品和计算设备具有与根据本发明第一和第二方面的面向印地语的多语言混合输入方法和装置类似的有益效果,在此不再赘述。
本发明上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本发明实施例一所提供的面向印地语的多语言混合输入方法的流程示意图;
图2为根据本发明实施例的面向印地语的多语言混合输入方法中词汇联想输入的流程示意图;
图3为根据本发明实施例的建立语言模型的流程示意图;
图4为根据本发明实施例的面向印地语的多语言混合输入装置的结构框图;
图5为根据本发明实施例的面向印地语的多语言混合输入装置的结构框图。
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。
目前,可以通过以下三种方式,实现用户的多语言混合输入需求。
第一种方式,通过切换输入模式,来达到多语言混合输入的目的。例如,当用户使用英文键盘输入拉丁字符时,如果此时用户想输入某个印地语字符时,用户需切换成印地语输入法进行输入后,再切回英文键盘继续输入拉丁字符。
第二种方式,通过预设操作,进入临时输入模式,用户可以在临时输入模式中进行第二语言的字符键入。例如,在中英文输入法中,用户可以通过点击Shift键来进行输入法的切换。
第三种方式,部分输入法在语言模型中同时支持两种编码方式,即根据用户输入,自动选择最合适的编码规则并进行字符显示。
第一种方式下,多语言的混合输入的效率较低;第二种方式的下,在进入临时输入模式后,需要进行字符的特殊处理,增加了开发周期;第三种方式下,当两种语言的编码方式差异较小时,语音模型的输出结果的准确性较低。
本发明主要针对现有技术中多语言的混合输入的效率较低,输出结果准确性较低的技术问题,提出一种面向印地语的多语言混合输入方法。
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。
下面参考附图对本发明实施例的面向印地语的多语言混合输入方法及装置进行详细的说明。在具体描述本发明实施例之前,为了便于理解,首先对常用技术词进行介绍:
N-Gram形式的语言模型,基于下述假设:第n个词汇出现与前n-1个词汇相关,而与其他任何词汇不相关,整个句子出现的概率等于各个词汇出现的概率乘积。其中,各个词汇出现的概率,可以通过对语料数据进行统计计算得到。
假设句子T是由词汇序列w
1,w
2,w
3,...,w
N组成,则N-Gram形式的语言模型可以用下述公式表示:
P(w
N|w
1.........w
N-1);
上述公式表示:出现第N个词汇的概率是由前边已经出现的从w
1,w
2,w
3,...,w
N-1的这些词汇的概率来决定的,在该过程中,通过先前词汇来去预测下一个将要出现的词汇,然后根据大量的文本观测,可以得到预测的词汇越来越趋向于在这些已出现词汇的后面的可能性。因此,构建的语言模型可以为(n-1)阶马尔科夫模型,或者为N元语言模型。就输入法的应用而言,由于与机器翻译等应用不同,通常不需要对长句子的理解和词序预测,一般情况下,N的取值为可为2、3、4等。
图1为根据本发明实施例的面向印地语的多语言混合输入方法的流程示意图。
本发明实施例提供的面向印地语的多语言混合输入方法,可以由本发明实施例提供的 面向印地语的多语言混合输入装置实现,该装置可以被配置在任何计算设备中,以使该计算设备实现面向印地语的多语言混合输入功能。
其中,计算设备例如可以为个人电脑(Personal Computer,PC),云端设备或者移动设备等硬件设备,移动设备例如可以为手机、平板电脑、个人数字助理、穿戴式设备等具有各种操作系统、触摸屏和/或显示屏的硬件设备。
如图1所示,该面向印地语的多语言混合输入方法包括以下步骤:
步骤101,获取输入法界面键入的当前输入词汇的拉丁字符序列。
本发明实施例中,计算设备可以提供有输入法界面,用户可以通过该输入法界面键入拉丁字符序列。例如,当计算设备为手机时,用户可以通过触摸屏,手动键入拉丁字符序列,或者当计算设备为PC时,用户可以通过键盘,手动键入拉丁字符序列。
可选地,计算设备中可以设置有监听器,以对用户触发的键入操作进行监听,当监听到用户触发的键入操作时,可以根据用户的键入操作,获取用户在输入法界面键入的当前输入词汇的拉丁字符序列。例如,用户想要输入“手机”时,可以在输入法界面键入“mobile”。
步骤102,根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型。
本发明实施例中,第一语言模型为预先建立的、以拉丁字符形式拼写印地语的语言模型。例如,可以获取以拉丁字符形式拼写印地语的语料数据,而后根据语料数据,构建语言模型,得到第一语言模型。
本发明实施例中,在获取到拉丁字符序列时,可以将拉丁字符序列输入至第一语言模型,得到拉丁字符序列对应的拉丁字符形式的第一候选字符串列表。
具体地,当拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,可以直接将该拉丁字符序列对应的印地语词汇加入第一候选字符串列表。而当拉丁字符序列对应非完整的拉丁字符拼写形式的印地语词汇时,为了提升用户的输入效率,或者,对用户输入的拉丁字符序列进行纠错和补全,本发明中,可以获取扩展选项。其中,扩展选项包括:含有拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,而后将扩展选项加入第一候选字符串列表。
有时,用户可能会存在拼写错误,因此在一些实施例中,输入法还可提供纠错功能。即,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还可包括:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
举例说明,当用户想要输入的语句是“Main bhi
meri
”,该语句对应的拉丁字符拼写形式的印地语词汇为“Main bhi nahi meri kahani hai”。假设用户第一个键入的印地语词汇为“Mai”,则第一语言模型输出的结果,即扩展选项可以为:Mai,Nai,Main,Maine。
步骤103,根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表可以包括第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式。
本发明实施例中,可以预先建立印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,其中,印地语词汇的拉丁字符拼写形式包括两种形式,一种为:印地语字符拼写形式经过发音直接翻译过来的拉丁字符拼写形式,比如,印地语字符
其对应的拉丁字符为“dena”,dena在其他场景中没有实际意义,只有在想得到印地语字符
时,输入dena才有意义;另一种为:某些英语单词,这些单词未出现在印地语中,例如印地语中未有英文单词“mobile”。
通过建立印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,例如建立“mobile”和
之间的映射关系,从而可以保证印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系为一对一关系,在确定拉丁字符形式的第一候选字符串列表后,可以通过查询上述映射关系,获取与该第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,操作简单,且易于实现。并且通过预先建立的映射关系,确定对应的印地语字符拼写形式,可以进一步提升输出结果的准确性。
步骤104,根据第一候选字符串列表和目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。
本发明实施例中,在得到第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式后,可以根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。
可选地,第一候选词列表中可以同时包括所有的第一候选字符串列表中拉丁字符拼写形式的印地语词汇,以及该印地语词汇所对应的印地语字符拼写形式的词汇。
进一步地,由于计算设备的显示界面有限,可以选取第一候选字符串列表中第一个数的拉丁字符拼写形式的印地语词汇和第二个数的印地语词汇所对应的印地语字符拼写形式的词汇,而后根据选取的词汇生成第一候选词列表。其中,第一个数和第二个数可以相同 或者不同。例如,第一个数可以为2个,第二个数可以为3个。
步骤105,在输入法界面展示第一候选词列表。
本发明实施例中,为了满足用户同时输入印地语和拉丁语的混合输入需求,在得到第一候选词列表后,可以在输入法界面展示第一候选词列表。
步骤106,获取对第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
本发明实施例中,选择操作为用户触发的,该选择操作例如可以为用户的点击操作,或者用户触发键盘上的数字或空格键所对应的操作,对此不作限制。
具体地,当在输入法界面展示第一候选词列表后,用户可以根据实际需求,从第一候选词列表中选择一个词汇进行输入。计算设备中可以设置有监听器,以对用户触发的选择操作进行监听,当监听到用户触发的选择操作时,可以根据选择操作,确定被选中的词汇,而后将该被选中的词汇作为输入词进行输入。
仍以上述例子示例,用户可以选择“Main”作为输入词进行输入。
需要说明的是,本发明以印地语和拉丁语的混合输入为例,但是本发明并不限于此,本领域的技术人员可以在本发明的基础上,实现任意两种语言的混合输入,扩展性较强。
本发明实施例的面向印地语的多语言混合输入方法,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。
作为一种可能的实现方式,为了提升用户的输入效率,在将被选中的词汇作为输入词 汇进行输入之后,还可以预测输入词汇的后续词汇,从而用户可以根据预测结果,进行下一个词汇的输入,由此,无需用户手动键入下一个词汇,进一步提升用户多语言的混合输入效率。下面结合图2,对上述过程进程详细说明。
图2为根据本发明实施例的面向印地语的多语言混合输入方法中词汇联想输入的流程示意图。
如图2所示,在图1所示实施例的基础上,在步骤106之后,该面向印地语的多语言混合输入方法还可以包括以下步骤:
步骤201,根据输入词汇对应的语言模型,预测输入词汇的后续词汇,并根据预测结果生成第二候选词列表。
具体地,当输入词汇的拼写形式是拉丁字符时,可以根据第一语言模型预测后续输入词汇,而当输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,其中,第二语言模型为预先建立的、以印地语字符形式拼写印地语的语言模型。例如,可以获取以印地语字符拼写印地语的语料数据,而后根据语料数据,构建语言模型,得到第二语言模型。
举例而言,当输入词汇为“Main”,可知,该输入词汇的拼写形式是拉丁字符,则根据第一语言模型预测后续输入词汇,预测结果可以为:bhi,ne,to,nahi,khud,hi。
本发明实施例中,第二候选词列表可以包括候选结果中所有的词汇。进一步地,由于计算设备的显示界面有限,第二候选词列表中可以包括预测结果中第三个数的词汇。其中,第三个数为预先设置的。
步骤202,在输入法界面展示第二候选词列表。
本发明实施例中,在生成第二候选词列表后,可以在输入法界面展示第二候选词列表。
步骤203,获取对第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
本发明实施例中,当在输入法界面展示第二候选词列表后,用户可以根据实际需求,从第二候选词列表中选择一个词汇进行输入。计算设备中可以设置有监听器,以对用户触发的选择操作进行监听,当监听到用户触发的选择操作时,可以根据选择操作,确定被选中的词汇,而后将该被选中的词汇作为下一个输入词进行输入。
作为一种应用场景,当用户希望高效地输入包括拉丁语和印地语的混合输入句子时,采用本发明实施例的面向印地语的多语言混合输入方法,可以在用户输入词汇的过程中, 进行输入词汇的纠错、补全和预测。
1)当用户输入词汇“Mai”时,经过第一语言模型对其进行补全纠错,以及查询映射关系后,得到的第一候选词列表可以为:
2)用户可以选择词汇“Main”,之后根据第一语言模型,预测后续输入词汇,得到的第二候选词列表可以为:
bhi,ne,to,nahi,khud,hi
3)用户可以选择词汇“bhi”,之后根据第一语言模型,预测后续输入词汇,得到的第二候选词列表可以为:
nahi,bhi,to,ho,hai,na
4)用户想要输出的词汇为“nahi”对应的以印地语字符形式拼写的印地语,此时,用户可以输入“nahi”,经过第一语言模型以及查询映射关系后,得到的第一候选词列表可以为:
6)用户可以选择词汇“meri”,之后,用户想要输出的词汇为“kahani”对应的以印地语字符形式拼写的印地语,此时,用户可以输入“kahani”,经过第一语言模型以及查询映射关系后,得到的第一候选词列表可以为:
作为另一种应用场景,当用户想要输入某个以印地语字符形式拼写的印地语词汇时,但是该用户并不知道该词汇的拼写规则,只知道该词汇对应的部分拉丁字符拼写形式。例如,用户希望输入的词汇为
该词汇对应的拉丁字符拼写形式为“Abhishek”,如果用户只记得拉丁字符拼写形式的前半部分“Abhis”。
1)用户可以输入词汇“Abhis”,经过第一语言模型对其进行补全纠错,以及查询映射关系后,得到的第一候选词列表可以为:
作为一种可能的实现方式,参见图3,图3为根据本发明实施例的建立语言模型的流程示意图。第一语言模型的建立过程,具体可以包括以下步骤:
步骤301,获取以拉丁字符形式拼写印地语的语料数据,并对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料。
本发明实施例中,可以采集印度地区的以拉丁字符形式拼写印地语的语料数据,而后,对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料,例如,可以对语料数据进行非文本信息的干扰去除、拼写检查更正、数据清洗、数据格式整理、挑选高频词语等预处理操作,从而保证学习后的第一语言模型的性能。
步骤302,去除有效语料数据中的冗余部分,得到整理后的语料。
应当理解的是,获取到的有效语料数据中往往存在大量的冗余信息,如果直接利用该有效语料数据,构建语言模型,将严重影响第一语言模型的学习效率。因此,本发明中,可以去除有效语料数据中的冗余部分,得到整理后的语料,从而可以降低语料数据的冗余和其占用的存储空间,以及提升第一语言模型的学习效率。
步骤303,使用整理后的语料构建语言模型。
本发明实施例中,在得到整理后的语料时,可以使用整理后的语料,构建语言模型。在构建语言模型时,为了避免数据溢出,且提高语言模型性能,可以采用取对数后,使用加法运算代替乘法运算。
作为一种可能的实现方式,由于需要根据语言模型和输入词汇,预测后续输入词汇,而后续输入词汇的出现,仅与之前出现的词汇相关,而与其他任何词汇不相关,因此,语言模型可以为N-Gram形式的语言模型,即为N元语言模型。则步骤303具体可以包括:使用整理后的语料,构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数。
假设语言模型中的词汇为:w
1,w
2,w
3,...,w
N,则第N个词汇关于前N-1个词汇的条件概率为:
P(w
N|w
1.........w
N-1);
需要说明的是,假设语言模型中的词汇为1000个,当语言模型为二元语言模型时,使用二元语言模型将会形成1000*1000的矩阵,使用三元语言模型将会形成1000*1000*1000的矩阵,形成的矩阵中含有大量的零值,即稀疏矩阵,此时,需要对形成的矩阵中的稀疏数据进行平滑处理。即步骤303还可以包括:对条件概率的数据进行平滑处理,以使整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
可选地,可以采用数据平滑处理技术,对条件概率的数据进行平滑处理,降低整理后的语料中已出现的N元词汇排列对应的条件概率,使得未出现的N元词汇排列对应的条件概率不为零。
为了实现上述实施例,本发明还提出一种面向印地语的多语言混合输入装置。
装置的实现可包括一个或多个计算设备,计算设备包括处理器和存储器,存储器上存储有包括可在处理器上运行的计算机程序指令的应用程序。应用程序可以划分为多个程序模块,用于系统各个组成部分的相应功能。其中,程序的模块的划分是逻辑上的而非物理上的,每个程序模块可以运行在一个或多个计算设备上,一个计算设备上也可以运行一个或一个以上的程序模块。以下对本发明的装置按照程序模块的功能逻辑划分进行详细说明。
图4为根据本发明实施例的面向印地语的多语言混合输入装置的结构示意图。
其中,面向印地语的多语言混合输入装置100可以采用包括处理器和存储器的计算设备实现,存储器中存储有可被处理器执行的程序模块,各个程序模块被执行时,控制计算设备实现相应的功能。
如图4所示,该面向印地语的多语言混合输入装置100包括:输入字符获取模块101、第一候选字符串生成模块102、词汇映射模块103、第一候选词列表生成模块104、第一候选词列表展示模块105,以及第一候选词输入模块106。其中,
输入字符获取模块101,用于获取输入法界面键入的当前输入词汇的拉丁字符序列。
第一候选字符串生成模块102,用于根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,第一语言模型为以拉丁字符形式拼写印地语的语言模型。
词汇映射模块103,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式。
第一候选词列表生成模块104,用于根据第一候选字符串列表和目标印地语词汇列表, 生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表。
第一候选词列表展示模块105,用于在输入法界面展示第一候选词列表。
第一候选词输入模块106,用于获取对第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
进一步地,在本发明实施例的一种可能的实现方式中,参见图5,在图4所示实施例的基础上,该面向印地语的多语言混合输入装置100还可以包括:
第一候选字符串生成模块102,具体用于:当拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将拉丁字符序列对应的印地语词汇加入第一候选字符串列表;以及获取扩展选项,扩展选项包括:含有拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将扩展选项加入第一候选字符串列表。
第一候选字符串生成模块102,还可以用于:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
第二候选词列表生成模块107,用于根据输入词汇对应的语言模型,预测输入词汇的后续词汇,并根据预测结果生成第二候选词列表。
第二候选词列表显示模块108,用于在输入法界面展示第二候选词列表。
第二候选词输入模块109,用于获取对第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
作为一种可能的实现方式,第二候选词列表生成模块107,具体用于:判断输入词汇的拼写形式是拉丁字符还是印地语字符;当输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;当输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。
第一语言模型创建模块110,用于建立第一语言模型。
作为一种可能的实现方式,第一语言模型创建模块110,包括:
语料获取单元111,用于获取以拉丁字符形式拼写印地语的语料数据,并对语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料。
语料去冗余单元112,用于去除有效语料数据中的冗余部分,得到整理后的语料。
语言模型构建单元113,用于使用整理后的语料构建语言模型。
作为一种可能的实现方式,语言模型构建单元113,具体用于:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正 整数;以及对条件概率的数据进行平滑处理,以使整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
本发明面向印地语的多语言混合输入装置100中的各个模块的功能和作用的实现过程具体详情可参见上述方法中对应步骤的实现过程。对于装置实施例而言,由于其基本对应于方法实施例,前述对本发明的方法实施例的解释说明也适用于本发明的装置实施例。为避免冗余,在装置实施例中将不会对所有细节进行重复,相关未尽之处可参见上述结合图1到图3对本发明面向印地语的多语言混合输入方法实施例的相关描述。
本发明实施例的面向印地语的多语言混合输入装置,通过获取输入法界面键入的当前输入词汇的拉丁字符序列,而后根据第一语言模型,获取拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,其中,第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,接着根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,以及根据第一候选字符串列表和第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表,最后在输入法界面展示第一候选词列表,并获取对第一候选词列表中的词汇的选择操作,从而将被选中的词汇作为输入词汇进行输入。由此,无需频繁切换输入模式来满足用户同时输入印地语和拉丁语的混合输入需求,提升多语言的混合输入效率,改善用户的输入体验。此外,根据映射关系,确定印地语字符拼写形式,可以提升输出结果的准确性。
为了实时上述实施例,本发明还提出一种非临时性计算机可读存储介质。
本发明实施例的非临时性计算机可读存储介质,其上存储有可执行指令,所述可执行指令在处理器上运行时,实现如本发明前述实施例提出的面向印地语的多语言混合输入方法。该存储介质可以作为设备的一部分设置在其上;或者当该设备可以被服务器远程控制时,该存储介质可以设置在对该设备进行控制的远程服务器上。
用于实现本发明方法的计算机指令的可以采用一个或多个计算机可读的介质的任意组合来承载。所谓非临时性计算机可读介质可以包括任何计算机可读介质,除了临时性地传播中的信号本身。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上 述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
为了实现上述实施例,本发明还提出一种计算机程序产品。
本发明实施例的计算机程序产品,当所述计算机程序产品中的指令由处理器执行时,实现根据本发明前述实施例提出的面向印地语的多语言混合输入方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
为了实现上述实施例,本发明还提出一种计算设备。
本发明实施例的计算设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据本发明前述实施例提出的面向印地语的多语言混合输入方法。
该计算设备可以由计算机设备的中央控制单元实现,作为计算机设备的中央控制单元的功能中的一部分。也可以由单独的计算设备实现,与计算机设备的中央控制单元通信连接。该计算设备的实现可包括但不限于,单片机,可编程逻辑控制器(PLC),复杂可编程逻辑器件(CPLD),可编程门阵列(PGA),现场可编程门阵列(FPGA),专用神经网络芯片,等等。
上述存储介质和计算设备,其相关部分的具体实施方式可以从相应的本发明的面向印地语的多语言混合输入方法或装置的实施例中获得,并具有与相应的本发明的面向印地语的多语言混合输入方法或装置相似的有益效果,在此不再赘述。
本发明实施例的非临时性计算机可读存储介质,计算机程序产品和计算设备,可以参照本发明前述实施例具体描述的内容实现,并具有与本发明前述实施例提出的面向印地语的多语言混合输入方法类似的有益效果,在此不再赘述。
需要说明的是,在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意 性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中,“多个”的含义是两个或两个以上,例如两个,三个等,除非另有明确具体的限定。
本技术领域的普通技术人员可以理解实现上述实施例的方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
在本说明书的描述中,流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本发明的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本发明的实施例所属技术领域的技术人员所理解。
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。
应当理解,本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一个实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的, 不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。
Claims (17)
- 一种面向印地语的多语言混合输入方法,其特征在于,包括:获取输入法界面键入的当前输入词汇的拉丁字符序列;根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型;根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;根据所述第一候选字符串列表和目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;在输入法界面展示所述第一候选词列表;获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,包括:当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。
- 根据权利要求2所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,还包括:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入之后,还包括:根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;在输入法界面展示所述第二候选词列表;获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
- 根据权利要求4所述的面向印地语的多语言混合输入方法,其特征在于,所述根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表,包括:判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。
- 根据权利要求1所述的面向印地语的多语言混合输入方法,其特征在于,所述根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为预先建立的以拉丁字符形式拼写印地语的语言模型,其中,所述第一语言模型的预先建立,包括:获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;去除所述有效语料数据中的冗余部分,得到整理后的语料;使用整理后的语料构建语言模型。
- 根据权利要求6所述的面向印地语的多语言混合输入方法,其特征在于,所述使用整理后的语料构建语言模型,包括:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
- 一种面向印地语的多语言混合输入装置,其特征在于,包括:输入字符获取模块,用于获取输入法界面键入的当前输入词汇的拉丁字符序列;第一候选字符串生成模块,用于根据第一语言模型,获取所述拉丁字符序列对应的拉丁字符形式的第一候选字符串列表,所述第一语言模型为以拉丁字符形式拼写印地语的语言模型;词汇映射模块,用于根据预先建立的印地语词汇的拉丁字符拼写形式和印地语字符拼 写形式之间的映射关系,获取目标印地语词汇列表,所述目标印地语词汇列表包括:所述第一候选字符串列表中的拉丁字符拼写形式的印地语词汇所对应的印地语字符拼写形式;第一候选词列表生成模块,用于根据所述第一候选字符串列表和所述目标印地语词汇列表,生成包括拉丁字符拼写形式和印地语字符拼写形式的词汇的第一候选词列表;第一候选词列表展示模块,用于在输入法界面展示所述第一候选词列表;第一候选词输入模块,用于获取对所述第一候选词列表中的词汇的选择操作,将被选中的词汇作为输入词汇进行输入。
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,所述第一候选字符串生成模块,具体用于:当所述拉丁字符序列为完整的拉丁字符拼写形式的印地语词汇时,将所述拉丁字符序列对应的印地语词汇加入所述第一候选字符串列表;以及获取扩展选项,所述扩展选项包括:含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇或者词汇片段,将所述扩展选项加入第一候选字符串列表。
- 根据权利要求9所述的面向印地语的多语言混合输入装置,其特征在于,所述第一候选字符串生成模块,还用于:当所述第一语言模型中不存在含有所述拉丁字符序列的拉丁字符拼写形式的印地语词汇时,获取与所述拉丁字符序列相似度最高的拉丁字符拼写形式的印地语词汇,并将之作为扩展选项加入第一候选字符串列表。
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,还包括:第二候选词列表生成模块,用于根据所述输入词汇对应的语言模型,预测所述输入词汇的后续词汇,并根据预测结果生成第二候选词列表;第二候选词列表显示模块,用于在输入法界面展示所述第二候选词列表;第二候选词输入模块,用于获取对所述第二候选词列表的词汇的选择操作,将被选中的词汇作为下一个输入词汇进行输入。
- 根据权利要求11所述的面向印地语的多语言混合输入装置,其特征在于,所述第二候选词列表生成模块,具体用于:判断所述输入词汇的拼写形式是拉丁字符还是印地语字符;当所述输入词汇的拼写形式是拉丁字符时,根据第一语言模型预测后续输入词汇;当所述输入词汇的拼写形式是印地语字符时,根据第二语言模型预测后续输入词汇,所述第二语言模型为预先建立的以印地语字符形式拼写印地语的语言模型。
- 根据权利要求8所述的面向印地语的多语言混合输入装置,其特征在于,还包括, 第一语言模型创建模块,用于建立第一语言模型,所述第一语言模型创建模块包括:语料获取单元,用于获取以拉丁字符形式拼写印地语的语料数据,并对所述语料数据进行预处理以去除其中的错误语料和低频语料,得到有效语料;语料去冗余单元,用于去除所述有效语料数据中的冗余部分,得到整理后的语料;语言模型构建单元,用于使用整理后的语料构建语言模型。
- 根据权利要求13所述的面向印地语的多语言混合输入装置,其特征在于,所述语言模型构建单元,具体用于:使用整理后的语料构建N-Gram形式的语言模型,并计算语言模型的参数,其中,所述语言模型的参数包括:语言模型中的词汇,以及N元词汇排列中,第N个词汇关于前N-1个词汇的条件概率,N为正整数;以及对所述条件概率的数据进行平滑处理,以使所述整理后的语料中未出现的N元词汇排列对应的条件概率不为零。
- 一种非临时性计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。
- 一种计算机程序产品,其特征在于,当所述计算机程序产品中的指令由处理器执行时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。
- 一种计算设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现根据权利要求1-7中任一项所述的面向印地语的多语言混合输入方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810713058.9A CN108897438A (zh) | 2018-06-29 | 2018-06-29 | 一种面向印地语的多语言混合输入方法及装置 |
CN201810713058.9 | 2018-06-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020000764A1 true WO2020000764A1 (zh) | 2020-01-02 |
Family
ID=64348154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/109507 WO2020000764A1 (zh) | 2018-06-29 | 2018-10-09 | 一种面向印地语的多语言混合输入方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108897438A (zh) |
WO (1) | WO2020000764A1 (zh) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109739367A (zh) * | 2018-12-28 | 2019-05-10 | 北京金山安全软件有限公司 | 候选词列表生成方法及装置 |
CN112506359B (zh) * | 2020-12-21 | 2023-07-21 | 北京百度网讯科技有限公司 | 输入法中候选长句的提供方法、装置及电子设备 |
CN112764551A (zh) * | 2020-12-31 | 2021-05-07 | 维沃移动通信有限公司 | 词汇显示方法、装置和电子设备 |
CN112987943B (zh) * | 2021-03-10 | 2023-03-14 | 江西航智信息技术有限公司 | 一种远程控制学生移动终端输入法的云架构系统 |
CN112987940B (zh) * | 2021-04-27 | 2021-08-27 | 广州智品网络科技有限公司 | 一种基于样本概率量化的输入方法、装置和电子设备 |
WO2022241640A1 (en) * | 2021-05-18 | 2022-11-24 | Citrix Systems, Inc. | A split keyboard with different languages as input |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983129A (zh) * | 2005-12-12 | 2007-06-20 | 北京优耐数码科技有限公司 | 数字键盘中印地语的智能输入技术 |
CN101882025A (zh) * | 2010-06-29 | 2010-11-10 | 汉王科技股份有限公司 | 手写输入方法及系统 |
CN102929571A (zh) * | 2012-10-15 | 2013-02-13 | 深圳市视得安罗格朗电子股份有限公司 | 多语言配置显示系统和装置 |
CN106156014A (zh) * | 2016-07-29 | 2016-11-23 | 宇龙计算机通信科技(深圳)有限公司 | 一种信息处理方法及装置 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7636083B2 (en) * | 2004-02-20 | 2009-12-22 | Tegic Communications, Inc. | Method and apparatus for text input in various languages |
CN101493732A (zh) * | 2009-02-27 | 2009-07-29 | 广东国笔科技股份有限公司 | 一种用于印欧语系的语言输入系统 |
CN102193643B (zh) * | 2010-03-15 | 2014-07-02 | 北京搜狗科技发展有限公司 | 一种文字输入方法和具有翻译功能的输入法系统 |
US20140278349A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Language Model Dictionaries for Text Predictions |
US9785252B2 (en) * | 2015-07-28 | 2017-10-10 | Fitnii Inc. | Method for inputting multi-language texts |
-
2018
- 2018-06-29 CN CN201810713058.9A patent/CN108897438A/zh active Pending
- 2018-10-09 WO PCT/CN2018/109507 patent/WO2020000764A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983129A (zh) * | 2005-12-12 | 2007-06-20 | 北京优耐数码科技有限公司 | 数字键盘中印地语的智能输入技术 |
CN101882025A (zh) * | 2010-06-29 | 2010-11-10 | 汉王科技股份有限公司 | 手写输入方法及系统 |
CN102929571A (zh) * | 2012-10-15 | 2013-02-13 | 深圳市视得安罗格朗电子股份有限公司 | 多语言配置显示系统和装置 |
CN106156014A (zh) * | 2016-07-29 | 2016-11-23 | 宇龙计算机通信科技(深圳)有限公司 | 一种信息处理方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN108897438A (zh) | 2018-11-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020000764A1 (zh) | 一种面向印地语的多语言混合输入方法及装置 | |
JP7301922B2 (ja) | 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム | |
US20210312139A1 (en) | Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium | |
WO2020062770A1 (zh) | 一种领域词典的构建方法、装置、设备及存储介质 | |
US10789431B2 (en) | Method and system of translating a source sentence in a first language into a target sentence in a second language | |
WO2018205389A1 (zh) | 语音识别方法、系统、电子装置及介质 | |
JP7413630B2 (ja) | 要約生成モデルの訓練方法、装置、デバイス及び記憶媒体 | |
TWI629601B (zh) | 提供翻譯與分類翻譯結果的系統,電腦可讀存儲媒體,檔案分配系統及其方法 | |
JP5513898B2 (ja) | 共有された言語モデル | |
US11907671B2 (en) | Role labeling method, electronic device and storage medium | |
JP2016218995A (ja) | 機械翻訳方法、機械翻訳装置及びプログラム | |
US20210209472A1 (en) | Method and apparatus for determining causality, electronic device and storage medium | |
CN106202059A (zh) | 机器翻译方法以及机器翻译装置 | |
US11321370B2 (en) | Method for generating question answering robot and computer device | |
KR102456535B1 (ko) | 의료 사실 검증 방법, 장치, 전자 기기, 저장 매체 및 프로그램 | |
JP7093825B2 (ja) | マンマシン対話方法、装置、及び機器 | |
CN102567306B (zh) | 一种不同语言间词汇相似度的获取方法及系统 | |
CN110874536A (zh) | 语料质量评估模型生成方法和双语句对互译质量评估方法 | |
CN102193912A (zh) | 短语划分模型建立方法、统计机器翻译方法以及解码器 | |
JP2021192283A (ja) | 情報照会方法、装置及び電子機器 | |
CN117093729A (zh) | 一种基于医疗科研信息的检索方法、系统及检索终端 | |
CN109710952B (zh) | 基于人工智能的翻译历史检索方法、装置、设备和介质 | |
JP7241122B2 (ja) | スマート応答方法及び装置、電子機器、記憶媒体並びにコンピュータプログラム | |
JP2023007369A (ja) | 翻訳方法、分類モデルの訓練方法、装置、デバイス及び記憶媒体 | |
JP2014010634A (ja) | 対訳表現抽出装置、対訳表現抽出方法及び対訳表現抽出のためのコンピュータプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18924071 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 12.05.2021 DATED 12/05/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18924071 Country of ref document: EP Kind code of ref document: A1 |