CN105095322A - Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device - Google Patents
Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device Download PDFInfo
- Publication number
- CN105095322A CN105095322A CN201410221701.8A CN201410221701A CN105095322A CN 105095322 A CN105095322 A CN 105095322A CN 201410221701 A CN201410221701 A CN 201410221701A CN 105095322 A CN105095322 A CN 105095322A
- Authority
- CN
- China
- Prior art keywords
- name
- person
- unit
- name unit
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
公开了人名单元词典扩充方法、人名语言识别方法和装置,该人名单元词典扩充方法包括:计算各个词项在预定多个语言的人名单元词典中的初始权重;将人名划分为人名单元,将划分的人名单元与各个词典中的词项进行匹配以确定匹配人名单元和未匹配人名单元;根据匹配人名单元在各个词典中的权重来确定包含匹配人名单元的人名在各个词典中的权重;根据包含未匹配人名单元的所有人名在各个词典中的权重计算未匹配人名单元在各个词典中的权重,并将未匹配人名单元添加到词典中;根据包含匹配人名单元的所有人名在各个词典中的权重更新匹配人名单元在各个词典中的权重;重复上述处理直至满足预定条件,从而得到带权重标注的人名单元词典。
Disclosed are a name unit dictionary expansion method, a name language recognition method and a device. The name unit dictionary expansion method includes: calculating the initial weight of each term in a person name unit dictionary in a predetermined plurality of languages; dividing the name into person name units , match the divided person name unit with the terms in each dictionary to determine the matching person name unit and the unmatched person name unit; Weights in each dictionary; Calculate the weight of unmatched person-name units in each dictionary based on the weights in each dictionary of all names that contain unmatched person-name units, and add unmatched person-name units to the dictionary; The weights of all the names of the person name unit in each dictionary are updated to match the weights of the person name unit in each dictionary; the above process is repeated until the predetermined conditions are met, so as to obtain the name unit dictionary with weights.
Description
技术领域technical field
本公开涉及信息处理技术领域,更具体地,涉及一种人名单元词典扩充方法、人名语言识别方法和装置。The present disclosure relates to the technical field of information processing, and more specifically, to a method for expanding a dictionary of personal name units, a method and a device for recognizing language of personal names.
背景技术Background technique
人名语言识别已被广泛应用于文本处理、机器翻译等自然语言处理领域,其能够有效地提高处理性能。一般地,人名语言识别可以认为是分类问题,而分类性能受到训练预料和特征选择的影响。因此,对于人名这样的短文本如何获得大量覆盖率高的训练语料以及实现有效的特征提取是本领域的研究重点。Language recognition of personal names has been widely used in natural language processing fields such as text processing and machine translation, and it can effectively improve processing performance. In general, name language recognition can be considered as a classification problem, and classification performance is affected by training expectations and feature selection. Therefore, how to obtain a large amount of training corpus with high coverage and achieve effective feature extraction for short texts such as names is the focus of research in this field.
具体来说,带有人名国别标注的人名训练语料需要投入大量的资源,并且使得这样的训练语料覆盖各个语言的所有可能也是困难的。因此,如何解决在没有足够训练语料的情况下进行人名语言识别对本领域技术人员来说是一项挑战。另一方面,由于人名是短文本,因此与普通文本相比,人名可供使用的特征较少,使得人名的语言识别与普通文本的语言识别相比来说难度更大。因此,如何构建分类器训练特征也是一大难题。Specifically, the training corpus of names marked with names and countries needs to invest a lot of resources, and it is also difficult to make such a training corpus cover all possibilities of each language. Therefore, it is a challenge for those skilled in the art how to solve the problem of performing name language recognition without sufficient training corpus. On the other hand, since personal names are short texts, compared with ordinary texts, there are fewer features available for personal names, which makes the language recognition of personal names more difficult than that of ordinary texts. Therefore, how to construct classifier training features is also a big problem.
发明内容Contents of the invention
在下文中给出了关于本公开的简要概述,以便提供关于本公开的某些方面的基本理解。但是,应当理解,这个概述并不是关于本公开的穷举性概述。它并不是意图用来确定本公开的关键性部分或重要部分,也不是意图用来限定本公开的范围。其目的仅仅是以简化的形式给出关于本公开的某些概念,以此作为稍后给出的更详细描述的前序。A brief summary of the present disclosure is given below in order to provide a basic understanding of some aspects of the present disclosure. It should be understood, however, that this summary is not an exhaustive summary of the disclosure. It is not intended to identify key or critical parts of the disclosure, nor is it intended to limit the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
鉴于以上情形,本公开的目的是提供一种能够在保证性能的前提下减少对训练语料的要求的人名单元词典扩充方法、人名语言识别方法和人名语言识别装置。In view of the above circumstances, the purpose of the present disclosure is to provide a name unit dictionary expansion method, a name language recognition method and a name language recognition device that can reduce the requirement for training corpus while ensuring performance.
根据本公开的一方面,提供了一种人名单元词典扩充方法,其可包括:词项初始权重计算步骤,基于每个词项所出现的人名单元词典的数量,计算各个词项在预定多个语言的人名单元词典中的初始权重;人名单元匹配步骤,将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元;人名权重计算步骤,根据匹配人名单元在预定多个语言的人名单元词典中的权重来确定包含匹配人名单元的人名在各个人名单元词典中的权重;未匹配人名单元处理步骤,根据包含未匹配人名单元的所有人名在各个人名单元词典中的权重,计算未匹配人名单元在各个人名单元词典中的权重,并将未匹配人名单元作为词项添加到包含未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中;匹配人名单元权重更新步骤,根据包含匹配人名单元的所有人名在各个人名单元词典中的权重,更新匹配人名单元在各个人名单元词典中的权重;以及重复人名单元匹配步骤、人名权重计算步骤、未匹配人名单元处理步骤和匹配人名单元权重更新步骤中的处理,直至预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值为止,从而得到所有词项均具有权重标注的人名单元词典。According to one aspect of the present disclosure, there is provided a method for expanding a person-name unit dictionary, which may include: a term initial weight calculation step, based on the number of person-name unit dictionaries that each term appears, and calculating the weight of each term in a predetermined The initial weight in the name unit dictionary of a plurality of languages; The name unit matching step is divided into each name unit in a plurality of names of people as training samples, and the divided name unit is combined with the predetermined multiple languages The words in the name unit dictionary are matched, and the matched name unit is determined as the matching person name unit, and the unmatched person name unit is determined as the unmatched person name unit; the name weight calculation step, according to the matching person name unit Determine the weight of the person's name containing the matching person's name unit in each person's name unit dictionary according to the weight in the person's name unit dictionary of predetermined multiple languages; weights in individual name unit dictionaries, computes the weight of unmatched name units in individual name unit dictionaries, and adds unmatched name units as terms to matches in all names containing unmatched name units In the personal name unit dictionary where the personal name unit is located; the matching personal name unit weight update step is to update the matching personal name unit in each personal name unit dictionary according to the weight of all names in each personal name unit dictionary containing the matching personal name unit and repeating the processing in the person-name unit matching step, the person-name weight calculation step, the unmatched person-name unit processing step and the matching person-name unit weight update step, until all the terms in the person-name unit dictionary of a plurality of languages are predetermined Until the weight change of is less than a predetermined threshold, a dictionary of person-name units is obtained in which all terms are marked with weights.
根据本公开的另一方面,还提供了一种人名语言识别方法,其可包括:人名划分步骤,将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数;n元子字符串权重计算步骤,根据包含各个n元子字符串的所有词项在根据本公开的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及识别步骤,根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别人名所属的语言。According to another aspect of the present disclosure, there is also provided a method for recognizing the language of a person's name, which may include: a step of dividing a person's name into n-gram substrings, wherein the n-gram substrings represent the A unit of n consecutive characters, n is an integer greater than or equal to 2; the n-element substring weight calculation step is based on all the terms that include each n-element substring in the personal name unit dictionary marked with weight according to the present disclosure to calculate the weight of each n-element substring in each person's name unit dictionary; and the identification step, according to the weights of all n-element substrings in the person's name in each person's name unit dictionary, identify the language to which the name belongs .
根据本公开的另一方面,还提供了一种人名语言识别装置,其可包括:人名划分单元,被配置成将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数;n元子字符串权重计算单元,被配置成根据包含各个n元子字符串的所有词项在根据本公开的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及识别单元,被配置成根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别人名所属的语言。According to another aspect of the present disclosure, there is also provided a personal name language recognition device, which may include: a personal name division unit configured to divide the input personal name into n-gram substrings, wherein the n-gram substrings represent A unit of n consecutive characters in a person's name, n is an integer greater than or equal to 2; an n-element substring weight calculation unit is configured to include all terms containing each n-element substring in the weighted The weights in the marked personal name unit dictionary are used to calculate the weights of each n-element substring in each personal name unit dictionary; The weight in identifies the language to which the person's name belongs.
根据本公开的另一方面,还提供了一种存储介质,该存储介质包括机器可读的程序代码,当在信息处理设备上执行程序代码时,该程序代码使得信息处理设备执行以下步骤:词项初始权重计算步骤,基于每个词项所出现的人名单元词典的数量,计算各个词项在预定多个语言的人名单元词典中的初始权重;人名单元匹配步骤,将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元;人名权重计算步骤,根据匹配人名单元在预定多个语言的人名单元词典中的权重来确定包含匹配人名单元的人名在各个人名单元词典中的权重;未匹配人名单元处理步骤,根据包含未匹配人名单元的所有人名在各个人名单元词典中的权重,计算未匹配人名单元在各个人名单元词典中的权重,并将未匹配人名单元作为词项添加到包含未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中;匹配人名单元权重更新步骤,根据包含匹配人名单元的所有人名在各个人名单元词典中的权重,更新匹配人名单元在各个人名单元词典中的权重;以及重复人名单元匹配步骤、人名权重计算步骤、未匹配人名单元处理步骤和匹配人名单元权重更新步骤中的处理,直至预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值为止,从而得到所有词项均具有权重标注的人名单元词典。According to another aspect of the present disclosure, there is also provided a storage medium, the storage medium includes machine-readable program code, and when the program code is executed on the information processing device, the program code causes the information processing device to perform the following steps: The item initial weight calculation step is based on the number of person-name unit dictionaries that appear in each term, and calculates the initial weight of each term in the person-name unit dictionary of predetermined multiple languages; the person-name unit matching step will be used as a training sample Divide each person name in the plurality of names of people into name units, match the divided name units with the terms in the name unit dictionary of predetermined multiple languages, and determine the matched name units as a list of matching people element, the unmatched personal name unit is determined as an unmatched personal name unit; the personal name weight calculation step determines that the personal name containing the matching personal name unit is in the The weight in each name unit dictionary; the unmatched person name unit processing step, according to the weight of all names in each person name unit dictionary containing the unmatched person name unit, calculate the unmatched person name unit in each person name unit dictionary , and add the unmatched person name unit as a term to the name unit dictionary where the matching person name unit in all names containing the unmatched person name unit is located; the matching person name unit weight update step, according to the inclusion of matching person The weight of all names of the name unit in each name unit dictionary, update the weight of the matching person name unit in each person name unit dictionary; and repeat the name unit matching steps, name weight calculation steps, and unmatched person name unit processing steps And match the processing in the person name unit weight update step until the weight changes of all the terms in the person name unit dictionaries of predetermined multiple languages are less than a predetermined threshold, so as to obtain a person name unit dictionary in which all terms have weight labels.
根据本公开的另一方面,还提供了一种程序产品,该程序产品包括机器可执行的指令,当在信息处理设备上执行指令时,该指令使得信息处理设备执行以下步骤:词项初始权重计算步骤,基于每个词项所出现的人名单元词典的数量,计算各个词项在预定多个语言的人名单元词典中的初始权重;人名单元匹配步骤,将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元;人名权重计算步骤,根据匹配人名单元在预定多个语言的人名单元词典中的权重来确定包含匹配人名单元的人名在各个人名单元词典中的权重;未匹配人名单元处理步骤,根据包含未匹配人名单元的所有人名在各个人名单元词典中的权重,计算未匹配人名单元在各个人名单元词典中的权重,并将未匹配人名单元作为词项添加到包含未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中;匹配人名单元权重更新步骤,根据包含匹配人名单元的所有人名在各个人名单元词典中的权重,更新匹配人名单元在各个人名单元词典中的权重;以及重复人名单元匹配步骤、人名权重计算步骤、未匹配人名单元处理步骤和匹配人名单元权重更新步骤中的处理,直至预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值为止,从而得到所有词项均具有权重标注的人名单元词典。According to another aspect of the present disclosure, there is also provided a program product, the program product includes machine-executable instructions, and when the instructions are executed on the information processing device, the instructions cause the information processing device to perform the following steps: Calculation step, based on the number of person-name unit dictionaries that appear in each term, calculate the initial weight of each term in the person-name unit dictionary of predetermined multiple languages; Each person name in the name is divided into a person name unit, the divided person name unit is matched with the term in the person name unit dictionary of a predetermined plurality of languages, and the matched person name unit is determined as a matching person name unit, and the The unmatched personal name unit is determined as an unmatched personal name unit; the personal name weight calculation step determines that the personal name containing the matched personal name unit is included in each personal list according to the weight of the matched personal name unit in the personal name unit dictionary of a predetermined plurality of languages The weight in the meta-dictionary; the unmatched person-name unit processing step, according to the weight of all names in each person-name unit dictionary containing the unmatched person-name unit, calculate the weight of the unmatched person-name unit in each person-name unit dictionary, And the unmatched personal name unit is added as a term to the personal name unit dictionary where the matching personal name unit in the names of all the unmatched personal name units is located; the matching personal name unit weight update step, according to the inclusion of the matching personal name unit The weights of all names in each person name unit dictionary, update the weights of matching person name units in each person name unit dictionary; and repeat the person name unit matching step, the person name weight calculation step, the unmatched person name unit processing step and the matching person The processing in the step of updating the weight of the name unit is performed until the weight changes of all the terms in the person-name unit dictionaries of the predetermined multiple languages are less than a predetermined threshold, so as to obtain a person-name unit dictionary in which all terms are marked with weights.
根据本公开的另一方面,还提供了一种存储介质,该存储介质包括机器可读的程序代码,当在信息处理设备上执行程序代码时,该程序代码使得信息处理设备执行以下步骤:人名划分步骤,将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数;n元子字符串权重计算步骤,根据包含各个n元子字符串的所有词项在根据本公开的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及识别步骤,根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别人名所属的语言。According to another aspect of the present disclosure, there is also provided a storage medium, the storage medium includes machine-readable program code, and when the program code is executed on the information processing device, the program code causes the information processing device to perform the following steps: The dividing step is to divide the input person's name into n-element substrings, wherein, the n-element substrings represent units comprising consecutive n characters in the person's name, and n is an integer greater than or equal to 2; n-element substring weight calculations Step, calculate the weight of each n-element substring in each person's name unit dictionary according to the weights of all words containing each n-element substring in the person-name unit dictionary with weight labeling according to the present disclosure; and identify The step is to identify the language to which the name belongs according to the weights of all n-element substrings in the name in each name unit dictionary.
根据本公开的另一方面,还提供了一种程序产品,该程序产品包括机器可执行的指令,当在信息处理设备上执行指令时,该指令使得信息处理设备执行以下步骤:人名划分步骤,将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数;n元子字符串权重计算步骤,根据包含各个n元子字符串的所有词项在根据本公开的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及识别步骤,根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别人名所属的语言。According to another aspect of the present disclosure, there is also provided a program product, the program product includes machine-executable instructions, and when the instructions are executed on the information processing device, the instructions cause the information processing device to perform the following steps: the step of classifying personal names, Divide the input person's name into n-element substrings, wherein, n-element substrings represent units that include consecutive n characters in the person's name, and n is an integer greater than or equal to 2; the n-element substring weight calculation step is based on Computing the weights of each n-element substring in each individual name unit dictionary according to the weights of all terms containing each n-element substring in the personal name unit dictionary marked with weight according to the present disclosure; and the recognition step, according to The weights of all n-gram substrings in the person's name in each person's name unit dictionary, identifying the language to which the person's name belongs.
在下面的说明书部分中给出本公开实施例的其它方面,其中,详细说明用于充分地公开本公开实施例的优选实施例,而不对其施加限定。Further aspects of embodiments of the present disclosure are given in the following descriptive section, wherein the detailed description serves to fully disclose preferred embodiments of the embodiments of the present disclosure without imposing limitations thereon.
附图说明Description of drawings
本公开可以通过参考下文中结合附图所给出的详细描述而得到更好的理解,其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并形成说明书的一部分,用来进一步举例说明本公开的优选实施例和解释本公开的原理和优点。其中:The present disclosure can be better understood by referring to the following detailed description given in conjunction with the accompanying drawings, wherein the same or similar reference numerals are used throughout to designate the same or similar parts. The accompanying drawings, together with the following detailed description, are incorporated in and form a part of this specification, and serve to further illustrate preferred embodiments of the present disclosure and explain principles and advantages of the present disclosure. in:
图1是示出根据本公开的实施例的人名单元词典扩充方法的过程示例的流程图;FIG. 1 is a flowchart illustrating a process example of a method for expanding a person-name unit dictionary according to an embodiment of the present disclosure;
图2是示出根据本公开的实施例的人名单元词典扩充装置的功能配置示例的框图;2 is a block diagram showing an example of a functional configuration of a person-name unit dictionary expansion device according to an embodiment of the present disclosure;
图3是示出根据本公开的实施例的人名语言识别方法的过程示例的流程图;FIG. 3 is a flow chart illustrating a process example of a method for identifying language of a person's name according to an embodiment of the present disclosure;
图4是示出根据本公开的实施例的人名语言识别装置的功能配置示例的框图;以及4 is a block diagram showing an example of a functional configuration of a personal name language recognition device according to an embodiment of the present disclosure; and
图5是作为本公开的实施例中可采用的信息处理设备的个人计算机的示例结构的框图。FIG. 5 is a block diagram of an example structure of a personal computer as an information processing device employable in an embodiment of the present disclosure.
具体实施方式Detailed ways
在下文中将结合附图对本公开的示范性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施方式的所有特征。然而,应该了解,在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的决定,以便实现开发人员的具体目标,例如,符合与系统及业务相关的那些限制条件,并且这些限制条件可能会随着实施方式的不同而有所改变。此外,还应该了解,虽然开发工作有可能是非常复杂和费时的,但对得益于本公开内容的本领域技术人员来说,这种开发工作仅仅是例行的任务。Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It should be understood, however, that in developing any such practical embodiment, many implementation-specific decisions must be made in order to achieve the developer's specific goals, such as meeting those constraints related to the system and business, and those Restrictions may vary from implementation to implementation. Moreover, it should also be understood that development work, while potentially complex and time-consuming, would at least be a routine undertaking for those skilled in the art having the benefit of this disclosure.
在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本公开,在附图中仅仅示出了与根据本公开的方案密切相关的设备结构和/或处理步骤,而省略了与本公开关系不大的其它细节。Here, it should be noted that in order to avoid obscuring the present disclosure due to unnecessary details, only the device structure and/or processing steps closely related to the solution according to the present disclosure are shown in the drawings, and the Other details that are not materially relevant to the present disclosure are omitted.
接下来,将参照图1至图5描述本公开的实施例。Next, an embodiment of the present disclosure will be described with reference to FIGS. 1 to 5 .
首先,将参照图1来描述根据本公开的实施例的人名单元词典扩充方法的过程示例。图1是示出根据本公开的实施例的人名单元词典扩充方法的过程示例的流程图。First, a procedure example of a person-name unit dictionary expanding method according to an embodiment of the present disclosure will be described with reference to FIG. 1 . FIG. 1 is a flowchart illustrating a procedure example of a person-name unit dictionary expansion method according to an embodiment of the present disclosure.
如图1所示,根据本公开的实施例的人名单元词典扩充方法包括词项初始权重计算步骤S102、人名单元匹配步骤S104、人名权重计算步骤S106、未匹配人名单元处理步骤S108和匹配人名单元权重更新步骤S110。As shown in FIG. 1 , the method for expanding a person-name unit dictionary according to an embodiment of the present disclosure includes a term initial weight calculation step S102, a person-name unit matching step S104, a person-name weight calculation step S106, an unmatched person-name unit processing step S108, and Step S110 for updating the weight of the matching person name unit.
首先,在词项初始权重计算步骤S102中,基于每个词项所出现的人名单元词典的数量来计算各个词项在预定多个语言的人名单元词典中的初始权重。First, in step S102 of calculating the initial weight of each term, the initial weight of each term in the personal-name unit dictionaries in predetermined languages is calculated based on the number of personal-name unit dictionaries in which each term appears.
应理解,多个语言的人名单元词典中的词项是人名单元(即,人名的分量,如姓、名等)。执行词项初始权重计算步骤是为了消除歧义性。这是由于一个词项可能同时出现在多个人名单元词典中,并且由于一个人名通常包括多个人名单元,而这些人名单元也可能存在于不同的人名单元词典中,因此这对于人名语言识别产生了歧义性。It should be understood that the terms in the multilingual name unit dictionary are name units (ie, components of a person's name, such as last name, first name, etc.). The term initial weight calculation step is performed to disambiguate. This is because a term may appear in multiple person-name unit dictionaries at the same time, and since a person-name usually includes multiple person-name units, and these person-name units may also exist in different person-name-unit dictionaries, this is an important factor for the language of personal names. Recognition creates ambiguity.
具体地,作为示例,在词项初始权重计算步骤S102中,假设一个词项同时存在于三个人名单元词典中,则该词项在这三个人名单元词典中的初始权重分别为1/3。Specifically, as an example, in step S102 of calculating the initial weight of a term, assuming that a term exists in three person-name unit dictionaries at the same time, the initial weight of the term in the three person-name unit dictionaries is 1/3 respectively .
接下来,在人名单元匹配步骤S104中,将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元。应指出,人名单元的划分方法为本领域公知的技术,在此不再赘述。Next, in the person name unit matching step S104, each person name in a plurality of person names as training samples is divided into a person name unit, and the divided person name unit is compared with the word in the person name unit dictionary of a predetermined plurality of languages Items are matched, and the matched person name unit is determined as a matched person name unit, and the unmatched person name unit is determined as an unmatched person name unit. It should be pointed out that the division method of the name unit is a well-known technology in the art, and will not be repeated here.
然后,在人名权重计算步骤S106中,根据匹配人名单元在预定多个语言的人名单元词典中的权重来确定包含匹配人名单元的人名在各个人名单元词典中的权重。Then, in the name weight calculation step S106, according to the weights of the matching person name units in the person name unit dictionaries of predetermined multiple languages, the weights of the names including the matching person name unit in each person name unit dictionary are determined.
具体地,假设预定多个语言的人名单元词典为集合Dk={dk 1,dk 2,dk 3,dk j,...,dk n},其中,dk i表示第i个人名单元词典,k表示第k次迭代,并且n表示人名单元词典的数量。此外,假设每个人名单元词典被定义为其中,Wi,j表示第i个人名单元词典中的第j个词项,k表示第k次迭代,并且LEN(di)表示词典di的长度(即,所包含的词项的数量)。假设作为训练样本的人名为pm={N1,...,Nt},其中,pm表示训练样本中的第m个人名,并且Nt表示人名pm中的第t个人名单元。Specifically, it is assumed that the predetermined personal name unit dictionaries in multiple languages are a set D k = {d k 1 , d k 2 , d k 3 , d k j ,..., d k n }, where d k i represents The ith personal name unit dictionary, k represents the kth iteration, and n represents the number of personal name unit dictionaries. Furthermore, assume that each person-name cell dictionary is defined as where W i,j represents the jth term in the ith personal name unit dictionary, k represents the kth iteration, and LEN(d i ) represents the length of the dictionary d i (i.e., the number of terms contained in ). Assume that the names of persons as training samples are p m = {N 1 ,..., N t }, where p m represents the m-th person name in the training sample, and N t represents the t-th person list in the person name p m Yuan.
这里,作为示例,将人名pm在所有人名单元词典中的权重分布定义为矢量
接下来,在未匹配人名单元处理步骤S108中,根据包含未匹配人名单元的所有人名在各个人名单元词典中的权重,计算未匹配人名单元在各个人名单元词典中的权重,并将未匹配人名单元作为词项添加到包含未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中。Next, in the unmatched personal name unit processing step S108, according to the weights of the names of all persons containing the unmatched personal name unit in each personal name unit dictionary, calculate the weight of the unmatched personal name unit in each personal name unit dictionary, And add the unmatched person name unit as a term to the person name unit dictionary where the matching person name unit is located in all names containing the unmatched person name unit.
具体地,作为示例,假设未匹配人名单元为“kuai”,其包含在两个人名p12和p43中,则
可以看出,通过将同时出现在一个人名中的人名单元都添加到同一人名单元词典中,能够在后续进行人名识别时更准确地识别该人名所属的语言。It can be seen that by adding all the name units that appear in a person's name to the same name unit dictionary, the language to which the name belongs can be more accurately identified in the subsequent name recognition.
接下来,在匹配人名单元权重更新步骤S110中,根据包含匹配人名单元的所有人名在各个人名单元词典中的权重,更新匹配人名单元在各个人名单元词典中的权重。Next, in the weight updating step S110 of the matching person-name unit, the weight of the matching person-name unit in each person-name unit dictionary is updated according to the weights of the names of all persons including the matching person-name unit in each person-name unit dictionary.
具体地,假设匹配人名单元为“lin”,其分别包含在三个人名p1234、p43567和p89352中,则人名单元“lin”的更新后的权重为
应理解,尽管以上给出了关于权重计算的具体示例公式,但是这仅是示例而非限制,并且本领域技术人员可根据本公开的原理对上述公式进行改变,并且这些变型应认为落入本公开的范围内。It should be understood that although the specific example formulas for weight calculations are given above, this is only an example rather than a limitation, and those skilled in the art can make changes to the above formulas according to the principles of the present disclosure, and these modifications should be considered to fall within the scope of this disclosure. within the scope of the public.
接下来,重复执行上述人名单元匹配步骤、人名权重计算步骤、未匹配人名单元处理步骤和匹配人名单元权重更新步骤中的处理,直至预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值或者重复了预定次数为止,从而得到其中的所有词项均具有权重标注的人名单元词典。Next, repeat the processing in the above-mentioned personal name unit matching step, personal name weight calculation step, unmatched personal name unit processing step and matching personal name unit weight update step, until all words in the personal name unit dictionary of multiple languages are predetermined Until the change of the weight of an item is less than a predetermined threshold or repeats for a predetermined number of times, a person-name unit dictionary in which all terms are marked with weights is obtained.
可以看出,根据本公开的实施例的人名单元词典扩充方法,可以在有限的训练语料的情况下来构建其中的各个词项均具有权重标注的人名单元词典,以用于更准确地识别人名所属的语言。It can be seen that, according to the name unit dictionary expansion method of the embodiment of the present disclosure, it is possible to construct a name unit dictionary in which each term has a weight label for more accurate recognition The language the person's name belongs to.
此外,优选地,根据本公开的实施例的人名单元词典扩充方法还可包括归一化步骤(如虚线框所示),在归一化步骤中,可以对在人名权重步骤中算出的人名在各个人名单元词典中的权重进行归一化。作为示例,可以如下进行归一化:
优选地,在未匹配人名单元处理步骤和匹配人名单元权重更新步骤中,可利用归一化之后的人名的权重进行相应计算。Preferably, in the step of processing the unmatched person name unit and the weight update step of the matched person name unit, the weight of the person name after normalization can be used for corresponding calculation.
应理解,以上参照图1描述的根据本公开的实施例的人名单元词典扩充方法的处理过程仅为示例而非限制,并且本领域技术人员可以根据本公开的原理对上述处理过程进行修改、组合等。It should be understood that the processing process of the method for expanding the personal name unit dictionary according to the embodiment of the present disclosure described above with reference to FIG. combination etc.
接下来,将参照图2描述根据本公开的实施例的人名单元词典扩充装置的功能配置示例。图2是示出根据本公开的实施例的人名单元词典扩充装置的功能配置示例的框图。Next, a functional configuration example of a person-name unit dictionary expanding device according to an embodiment of the present disclosure will be described with reference to FIG. 2 . FIG. 2 is a block diagram showing an example of a functional configuration of a personal-name unit dictionary expansion device according to an embodiment of the present disclosure.
如图2所示,根据本公开的实施例的人名单元词典扩充装置200可包括词项初始权重计算单元202、人名单元匹配单元204、人名权重计算单元206、未匹配人名单元处理单元208、匹配人名单元权重更新步骤210和控制单元212。As shown in FIG. 2 , the apparatus 200 for expanding a person-name unit dictionary according to an embodiment of the present disclosure may include a term initial weight calculation unit 202 , a person-name unit matching unit 204 , a person-name weight calculation unit 206 , and an unmatched person-name unit processing unit. 208 , the step 210 and the control unit 212 for updating the weight of the matching person name unit.
词项初始权重计算单元202可被配置成基于每个词项所出现的人名单元词典的数量,计算各个词项在预定多个语言的人名单元词典中的初始权重。The term initial weight calculation unit 202 may be configured to calculate the initial weight of each term in the personal-name unit dictionaries in predetermined multiple languages based on the number of personal-name unit dictionaries in which each term appears.
人名单元匹配单元204可被配置成将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元。The personal name unit matching unit 204 may be configured to divide each of the multiple personal names used as training samples into a personal name unit, and compare the divided personal name units with the terms in the personal name unit dictionary of a predetermined plurality of languages. match, and determine the matching person name unit as a matching person name unit, and determine the unmatched person name unit as an unmatched person name unit.
人名权重计算单元206可被配置成根据匹配人名单元在预定多个语言的人名单元词典中的权重来确定包含匹配人名单元的人名在各个人名单元词典中的权重。The name weight calculation unit 206 may be configured to determine the weights of the names containing the matching name units in each name unit dictionary according to the weights of the matching name units in the name unit dictionaries of predetermined multiple languages.
未匹配人名单元处理单元208可被配置成根据包含未匹配人名单元的所有人名在各个人名单元词典中的权重,计算未匹配人名单元在各个人名单元词典中的权重,并将未匹配人名单元作为词项添加到包含未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中。The unmatched person name unit processing unit 208 may be configured to calculate the weight of the unmatched person name unit in each person name unit dictionary according to the weights of all names containing the unmatched person name unit in each person name unit dictionary, and The unmatched name unit is added as a term to the name unit dictionary containing the matching name unit in all names containing the unmatched name unit.
匹配人名单元权重更新单元210可被配置成根据包含匹配人名单元的所有人名在各个人名单元词典中的权重,更新匹配人名单元在各个人名单元词典中的权重。The matching person name unit weight updating unit 210 may be configured to update the weights of the matching person name unit in each person name unit dictionary according to the weights of all names containing the matching person name unit in each person name unit dictionary.
控制单元212可被配置成控制人名单元匹配单元204、人名权重计算单元206、未匹配人名单元处理单元208和匹配人名单元权重更新单元210重复执行各自的处理,直至预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值或者重复了预定次数为止,从而得到所有词项均具有权重标注的人名单元词典。The control unit 212 may be configured to control the name unit matching unit 204, the name weight calculation unit 206, the unmatched name unit processing unit 208, and the matched name unit weight update unit 210 to repeatedly perform their respective processes until the predetermined number of languages Until the weight change of all terms in the person-name unit dictionary is less than a predetermined threshold or repeats for a predetermined number of times, a person-name unit dictionary in which all terms are marked with weights is obtained.
此外,优选地,该人名单元词典扩充装置200还可包括归一化单元(如虚线框所示),该归一化单元可被配置成对人名权重计算单元算出的人名在各个人名单元词典中的权重进行归一化,并且未匹配人名单元处理单元和匹配人名单元权重更新单元可利用归一化后的人名的权重进行相应计算。In addition, preferably, the name unit dictionary expansion device 200 can also include a normalization unit (shown in a dotted line box), the normalization unit can be configured to calculate the name weight calculation unit in each name unit The weights in the dictionary are normalized, and the unmatched person name unit processing unit and the matched person name unit weight update unit can use the normalized name weights to perform corresponding calculations.
应理解,参照图2描述的人名单元词典扩充装置是与上述人名单元词典扩充方法对应的装置实施例,因此对于在装置实施例中未详细描述的内容,可参见以上方法实施例的相应位置的描述,在此不再赘述。It should be understood that the personal name unit dictionary expansion device described with reference to FIG. 2 is a device embodiment corresponding to the above-mentioned personal name unit dictionary expansion method, so for the content not described in detail in the device embodiment, you can refer to the corresponding method embodiment above. The description of the location will not be repeated here.
此外,应指出,尽管以上参照图2描述了根据本公开的实施例的人名单元词典扩充装置的功能配置的示例,但是这仅是示例而非限制,并且本领域技术人员可以想到根据实际需要而对以上实施例中描述的功能模块进行组合和/或省略和/或添加一个或多个功能模块,这样的变型示例应认为落入本公开的范围内。In addition, it should be pointed out that although an example of the functional configuration of the personal name unit dictionary expansion device according to the embodiment of the present disclosure is described above with reference to FIG. 2 , this is only an example rather than a limitation, and those skilled in the art can think Whereas the functional modules described in the above embodiments are combined and/or omitted and/or one or more functional modules are added, such modified examples should be deemed to fall within the scope of the present disclosure.
接下来,将参照图3描述根据本公开的实施例的人名语言识别方法的过程示例。具体地,将描述如何利用如上所述构建的其中的词项带权重标注的人名单元词典来执行人名语言识别处理。Next, a process example of the human name language recognition method according to the embodiment of the present disclosure will be described with reference to FIG. 3 . Specifically, how to perform the personal name language recognition process using the personal name unit dictionary in which terms are weighted annotated constructed as described above will be described.
图3是示出根据本公开的实施例的人名语言识别方法的过程示例的流程图。FIG. 3 is a flow chart showing an example of a procedure of a human name language recognition method according to an embodiment of the present disclosure.
如图3所示,根据本公开的实施例的人名语言识别方法包括人名划分步骤S302、n元子字符串权重计算步骤S304以及识别步骤S306。As shown in FIG. 3 , the name language recognition method according to the embodiment of the present disclosure includes a name division step S302 , an n-gram weight calculation step S304 and a recognition step S306 .
首先,在人名划分步骤S302中,将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数。First, in the personal name division step S302, the input personal name is divided into n-gram substrings, wherein an n-gram substring represents a unit including n consecutive characters in the personal name, and n is an integer greater than or equal to 2.
具体地,在人名划分步骤S302中,构建英文人名所有可能出现的n元子字符串(从空格和26个英文字母中进行组合)。此外,优选地,对于人名的开始和结束处的字符,可分别在其前方和后方添加特殊字符(例如,“_”)来进行划分。这里,以三元子字符串(即,n=3)为例,假设对于人名“linshuhao”,所划分的三元子字符串包括“_li”、“lin”、......、“ao_”。Specifically, in the personal name division step S302, all possible n-gram substrings of English personal names (combined from spaces and 26 English letters) are constructed. In addition, preferably, for the characters at the beginning and end of the person's name, special characters (for example, "_") may be added in front and behind respectively for division. Here, taking the ternary substring (that is, n=3) as an example, assume that for the name "linshuhao", the divided ternary substring includes "_li", "lin", ..., " ao_".
接下来,在n元子字符串权重计算步骤S304中,根据包含各个n元子字符串的所有词项在如上所述的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重。Next, in the n-gram substring weight calculation step S304, each n-gram subcharacter is calculated according to the weights of all terms containing each n-gram substring in the above-mentioned weight-labeled personal name unit dictionary The weight of the string in each name unit dictionary.
具体地,例如,假设在词典dj中,带权重标准的人名单元如下:li:0.8;lian:0.9;liang:1.0。因此,对于三元子字符串“_li”,其在词典dj中的权重可以被计算为0.8+0.9+1.0=2.7。类似地,可以计算该人名包括的所有n元子字符串在各个人名单元词典中的权重。应理解,该权重计算方法仅为示例而非限制,并且本领域技术人员可以根据需要而采用其它方式来确定n元子字符串的权重。Specifically, for example, suppose that in the dictionary d j , the person name units with weight criteria are as follows: li: 0.8; lian: 0.9; liang: 1.0. Therefore, for the ternary substring "_li", its weight in the dictionary d j can be calculated as 0.8+0.9+1.0=2.7. Similarly, the weights of all n-gram substrings included in the person name in each person name unit dictionary can be calculated. It should be understood that the weight calculation method is only an example rather than a limitation, and those skilled in the art may use other methods to determine the weight of the n-gram substring according to needs.
然后,在识别步骤S306中,根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别人名所属的语言。具体地,作为示例,如果该人名所包含的所有n元子字符串在词典dj中的权重之和最大,则可认为该人名属于词典dj所代表的语言。Then, in the identifying step S306, the language to which the name belongs is identified according to the weights of all n-gram substrings in the name in each individual name unit dictionary. Specifically, as an example, if the sum of the weights of all n -gram substrings contained in the person's name in the dictionary dj is the largest, it can be considered that the person's name belongs to the language represented by the dictionary dj .
此外,优选地,根据本公开的实施例的人名语言识别方法300还可包括排序步骤S308。In addition, preferably, the name language recognition method 300 according to the embodiment of the present disclosure may further include a sorting step S308.
在排序步骤S308中,根据各个n元子字符串在各个人名单元词典中的权重,确定各个n元子字符串在各个人名单元词典中的排序,并且在识别步骤S306中可根据人名中的所有n元子字符串在各个人名单元词典中的排序来识别该人名所属的语言。In the sorting step S308, according to the weight of each n-gram substring in each person-name unit dictionary, determine the ordering of each n-gram substring in each person-name unit dictionary, and in the recognition step S306, according to the weight of each person-name unit dictionary The ordering of all n-gram substrings in each person's name unit dictionary is used to identify the language to which the person's name belongs.
优选地,在排序步骤S308中,可根据n元子字符串在各个人名单元词典中的权重的降序来确定n元子字符串在各个人名单元词典中的排序。具体地,以下表1给出了根据权重的降序来确定排序的示例。Preferably, in the sorting step S308, the sorting of the n-gram substrings in each person-name unit dictionary can be determined according to the descending order of the weights of the n-gram substrings in each person-name unit dictionary. Specifically, the following Table 1 gives an example of determining the ranking according to the descending order of the weights.
表1Table 1
优选地,在识别步骤S306中,计算人名中的所有n元子字符串在各个人名单元词典中的排序的和,并将与最小的和对应的人名单元词典表示的语言确定为人名所属的语言。Preferably, in the recognition step S306, the sum of the ordering of all n-element substrings in the person's name in each person's name unit dictionary is calculated, and the language represented by the person's name unit dictionary corresponding to the smallest sum is determined as the name to which the person's name belongs language.
具体地,仍以人名“linshuhao”为例,对于词典D1至D4,该人名在各个词典中的排序之和如下:Specifically, still taking the person name "linshuhao" as an example, for dictionaries D1 to D4, the sum of the rankings of the person name in each dictionary is as follows:
Distance(D1,“linshuhao”)=Order_D1(_li)+Order_D1(lin)+...Order_D1(ao_)Distance(D1,"linshuhao")=Order_D1(_li)+Order_D1(lin)+...Order_D1(ao_)
Distance(D2,“linshuhao”)=Order_D2(_li)+Order_D2(lin)+...Order_D2(ao_)Distance(D2,"linshuhao")=Order_D2(_li)+Order_D2(lin)+...Order_D2(ao_)
Distance(D3,“linshuhao”)=Order_D3(_li)+Order_D3(lin)+...Order_D3(ao_)Distance(D3,"linshuhao")=Order_D3(_li)+Order_D3(lin)+...Order_D3(ao_)
Distance(D4,“linshuhao”)=Order_D4(_li)+Order_D4(lin)+...Order_D4(ao_)Distance(D4,"linshuhao")=Order_D4(_li)+Order_D4(lin)+...Order_D4(ao_)
在该示例中,由于按照权重的降序进行排序,因此,人名“linshuhao”所属的语言应该为其中的排序之和最小的词典所表示的语言。即,Language=DxifDistance(Dx,″linshuhao″)=Min(Distance(D1,"linshuhao″),Distance(D2,″linshuhao″),Distance(D3,″linshuhao″),Distance(D4,″linshuhao"))。In this example, since the sorting is performed in descending order of weight, the language to which the name "linshuhao" belongs should be the language represented by the dictionary whose sorting sum is the smallest. That is, Language=DxifDistance(Dx, "linshuhao")=Min(Distance(D1, "linshuhao"), Distance(D2, "linshuhao"), Distance(D3, "linshuhao"), Distance(D4, "linshuhao") ).
替选地,Distance的求法可改为求在各个词典中人名“linshuhao”中包括的所有子字符串的排序的均值,并且将均值最小的词典表示的语言作为人名语言识别结果。或者,可计算在各个词典中人名中的所有子字符串的归一化权重值的乘积,并且将乘积值最大的词典表示的语言作为人名语言识别结果。Alternatively, the method of calculating Distance can be changed to calculating the average value of all substrings included in the person name "linshuhao" in each dictionary, and the language represented by the dictionary with the smallest average value is used as the name language recognition result. Alternatively, the product of the normalized weight values of all substrings in the person's name in each dictionary can be calculated, and the language represented by the dictionary with the largest product value can be used as the language recognition result of the person's name.
可以看出,如果仅直接根据原始权重来进行人名语言识别,可能会由于各个词典中的权重未根据统一标准进行归一化而导致识别存在误差。因此,根据上述排序步骤中的处理,可以提高识别的准确度。It can be seen that if the language recognition of personal names is only performed directly based on the original weights, there may be errors in the recognition due to the weights in each dictionary not being normalized according to a unified standard. Therefore, according to the processing in the above-mentioned sorting step, the accuracy of recognition can be improved.
然而,应理解,上述排序算法仅是示例而非限制,并且本领域技术人员可以根据本公开的原理而想到其它算法进行排序。例如,也可按照权重的升序来排序,在该情况下,则人名可被确定为其中的排序之和最大的词典所表示的语言。However, it should be understood that the above sorting algorithm is only an example rather than a limitation, and those skilled in the art can think of other algorithms for sorting according to the principle of the present disclosure. For example, they can also be sorted in ascending order of weights. In this case, the names of people can be determined as the language represented by the dictionary with the largest sorted sum among them.
此外,还应理解,该排序步骤是可选的,例如,也可不进行排序而是对所有人名单元词典中的权重进行归一化,从而根据该归一化后的权重来进行人名语言识别。In addition, it should be understood that the sorting step is optional, for example, instead of sorting, the weights in all name unit dictionaries can be normalized, so that the language recognition of names can be performed according to the normalized weights .
应指出,尽管以上参照图3描述了根据本公开的实施例的人名语言识别方法,但是本领域技术人员完全可以根据本公开的原理而对上述处理过程进行修改、组合等。It should be pointed out that although the method for recognizing the language of a person's name according to the embodiment of the present disclosure is described above with reference to FIG. 3 , those skilled in the art can completely modify, combine, etc. the above processing procedures according to the principle of the present disclosure.
接下来,将参照图4描述根据本公开的实施例的人名语言识别装置的功能配置示例。图4是示出根据本公开的实施例的人名语言识别装置的功能配置示例的框图。Next, an example of a functional configuration of a personal name language recognition device according to an embodiment of the present disclosure will be described with reference to FIG. 4 . FIG. 4 is a block diagram showing an example of a functional configuration of a personal name language recognition device according to an embodiment of the present disclosure.
如图4所示,根据本公开的实施例的人名语言识别装置400可包括人名划分单元402、n元子字符串权重计算单元404和识别单元406。As shown in FIG. 4 , the device 400 for recognizing the language of a person's name according to an embodiment of the present disclosure may include a name division unit 402 , an n-gram weight calculation unit 404 and a recognition unit 406 .
人名划分单元402可被配置成将输入的人名划分为n元子字符串,其中,n元子字符串表示包括人名中的连续n个字符的单元,n是大于或等于2的整数。The personal name division unit 402 may be configured to divide the input personal name into n-gram substrings, where an n-gram substring means a unit including n consecutive characters in the personal name, and n is an integer greater than or equal to 2.
n元子字符串权重计算步骤404可被配置成根据包含各个n元子字符串的所有词项在上述带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重。The n-gram substring weight calculation step 404 can be configured to calculate the weight of each n-gram substring in each person list according to the weights of all terms containing each n-gram substring in the above-mentioned person name unit dictionary with weight annotation. Weights in the meta dictionary.
识别单元406可被配置成根据人名中的所有n元子字符串在各个人名单元词典中的权重,识别该人名所属的语言。The identification unit 406 may be configured to identify the language to which the person's name belongs according to the weights of all n-gram substrings in the person's name in each person's name unit dictionary.
优选地,该人名语言识别装置400还可包括排序单元408。Preferably, the apparatus 400 for identifying language of personal names may further include a sorting unit 408 .
排序单元408可被配置成根据各个n元子字符串在各个人名单元词典中的权重,确定各个n元子字符串在各个人名单元词典中的排序,并且识别单元406可进一步根据人名中的所有n元子字符串在各个人名单元词典中的排序来识别该人名所属的语言。The sorting unit 408 can be configured to determine the sorting of each n-gram substring in each person-name unit dictionary according to the weight of each n-gram substring in each person-name unit dictionary, and the recognition unit 406 can further according to the weight of each person-name unit dictionary. The ordering of all n-gram substrings in each person's name unit dictionary is used to identify the language to which the person's name belongs.
优选地,排序单元408可进一步根据n元子字符串在各个人名单元词典中的权重的降序来确定n元子字符串在各个人名单元词典中的排序,并且识别单元可进一步计算人名中的所有n元子字符串在各个人名单元词典中的排序的和,并将与最小的和对应的人名单元词典表示的语言确定为该人名所属的语言。Preferably, the sorting unit 408 can further determine the ordering of the n-gram substrings in each person-name unit dictionary according to the descending order of the weights of the n-gram substrings in each person-name unit dictionary, and the recognition unit can further calculate the The sorted sum of all n-elements of substrings in each person name unit dictionary, and the language represented by the person name unit dictionary corresponding to the smallest sum is determined as the language to which the person name belongs.
应理解,参照图4描述的人名语言识别装置是与上述人名语言识别方法对应的装置实施例,因此对于在装置实施例中未详细描述的内容,可参见以上方法实施例的相应位置的描述,在此不再赘述。It should be understood that the personal name language recognition device described with reference to FIG. 4 is a device embodiment corresponding to the above-mentioned personal name language recognition method, so for the content not described in detail in the device embodiment, you can refer to the description of the corresponding position of the above method embodiment, I won't repeat them here.
此外,应指出,尽管以上参照图4描述了根据本公开的实施例的人名语言识别装置的功能配置的示例,但是这仅是示例而非限制,并且本领域技术人员可以想到根据实际需要而对以上实施例中描述的功能模块进行组合和/或省略和/或添加一个或多个功能模块,这样的变型示例应认为落入本公开的范围内。In addition, it should be pointed out that although an example of the functional configuration of the personal name language recognition device according to the embodiment of the present disclosure has been described above with reference to FIG. The functional modules described in the above embodiments are combined and/or omitted and/or one or more functional modules are added, and such modified examples should be deemed to fall within the scope of the present disclosure.
应理解,根据本公开的实施例的存储介质和程序产品中的机器可执行的指令还可以执行上述人名单元词典扩充方法和人名语言识别方法,因此在此未详细描述的内容可参考先前相应位置的描述,在此不再重复进行描述。It should be understood that the machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure can also execute the above-mentioned personal name unit dictionary expansion method and personal name language recognition method, so for the content not described in detail here, please refer to the previous corresponding The description of the location will not be repeated here.
相应地,用于承载上述存储有机器可执行的指令的程序产品的存储介质也包括在本发明的公开中。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the above-mentioned program product storing machine-executable instructions is also included in the disclosure of the present invention. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.
另外,还应该指出的是,上述系列处理和装置也可以通过软件和/或固件实现。在通过软件和/或固件实现的情况下,从存储介质或网络向具有专用硬件结构的计算机,例如图5所示的通用个人计算机500安装构成该软件的程序,该计算机在安装有各种程序时,能够执行各种功能等等。In addition, it should also be noted that the series of processes and devices described above may also be implemented by software and/or firmware. In the case of realization by software and/or firmware, a program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware configuration, such as a general-purpose personal computer 500 shown in FIG. , can perform various functions and so on.
在图5中,中央处理单元(CPU)501根据只读存储器(ROM)502中存储的程序或从存储部分508加载到随机存取存储器(RAM)503的程序执行各种处理。在RAM503中,也根据需要存储当CPU501执行各种处理等等时所需的数据。In FIG. 5 , a central processing unit (CPU) 501 executes various processes according to programs stored in a read only memory (ROM) 502 or loaded from a storage section 508 to a random access memory (RAM) 503 . In the RAM 503 , data required when the CPU 501 executes various processes and the like is also stored as necessary.
CPU501、ROM502和RAM503经由总线504彼此连接。输入/输出接口505也连接到总线504。The CPU 501 , ROM 502 , and RAM 503 are connected to each other via a bus 504 . The input/output interface 505 is also connected to the bus 504 .
下述部件连接到输入/输出接口505:输入部分506,包括键盘、鼠标等等;输出部分507,包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等等,和扬声器等等;存储部分508,包括硬盘等等;和通信部分509,包括网络接口卡比如LAN卡、调制解调器等等。通信部分509经由网络比如因特网执行通信处理。The following components are connected to the input/output interface 505: an input section 506 including a keyboard, a mouse, etc.; an output section 507 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; The storage section 508 includes a hard disk and the like; and the communication section 509 includes a network interface card such as a LAN card, a modem, and the like. The communication section 509 performs communication processing via a network such as the Internet.
根据需要,驱动器510也连接到输入/输出接口505。可拆卸介质511比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器510上,使得从中读出的计算机程序根据需要被安装到存储部分508中。A drive 510 is also connected to the input/output interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read therefrom is installed into the storage section 508 as necessary.
在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质511安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 511 .
本领域的技术人员应当理解,这种存储介质不局限于图5所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质511。可拆卸介质511的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM502、存储部分508中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 511 shown in FIG. 5 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable medium 511 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be the ROM 502, a hard disk contained in the storage section 508, or the like, in which the programs are stored and distributed to users together with devices containing them.
还需要指出的是,执行上述系列处理的步骤可以自然地根据说明的顺序按时间顺序执行,但是并不需要一定根据时间顺序执行。某些步骤可以并行或彼此独立地执行。It should also be pointed out that the steps for executing the above series of processes can naturally be executed in chronological order according to the illustrated order, but they do not need to be executed in chronological order. Certain steps may be performed in parallel or independently of each other.
虽然已经详细说明了本公开及其优点,但是应当理解在不脱离由所附的权利要求所限定的本公开的精神和范围的情况下可以进行各种改变、替代和变换。而且,本公开实施例的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个......”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made hereto without departing from the spirit and scope of the disclosure as defined by the appended claims. Moreover, the terms "comprising", "comprising" or any other variation thereof in the embodiments of the present disclosure are intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a series of elements includes not only those elements, but also Including other elements not expressly listed, or also including elements inherent in such process, method, article or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
根据本公开的实施例,还公开了以下附记:According to the embodiments of the present disclosure, the following additional notes are also disclosed:
1.一种人名单元词典扩充方法,包括:1. A personal name unit dictionary expansion method, comprising:
词项初始权重计算步骤,基于每个词项所出现的人名单元词典的数量,计算各个词项在预定多个语言的人名单元词典中的初始权重;The term initial weight calculation step is based on the number of personal name unit dictionaries that each term appears, and calculates the initial weight of each term in the personal name unit dictionary of predetermined multiple languages;
人名单元匹配步骤,将作为训练样本的多个人名中的每个人名划分为人名单元,将所划分的人名单元与所述预定多个语言的人名单元词典中的词项进行匹配,并将匹配的人名单元确定为匹配人名单元,将未匹配的人名单元确定为未匹配人名单元;The personal name unit matching step divides each personal name in a plurality of personal names as training samples into a personal name unit, and matches the divided personal name unit with the terms in the personal name unit dictionary of the predetermined plurality of languages, And the matching person name unit is determined as a matching person name unit, and the unmatched person name unit is determined as an unmatched person name unit;
人名权重计算步骤,根据所述匹配人名单元在所述预定多个语言的人名单元词典中的权重来确定包含所述匹配人名单元的人名在各个人名单元词典中的权重;The name weight calculation step is to determine the weights of the names containing the matching name unit in each name unit dictionary according to the weights of the matching person name unit in the person name unit dictionaries of the predetermined multiple languages;
未匹配人名单元处理步骤,根据包含所述未匹配人名单元的所有人名在各个人名单元词典中的权重,计算所述未匹配人名单元在各个人名单元词典中的权重,并将所述未匹配人名单元作为词项添加到包含所述未匹配人名单元的所有人名中的匹配人名单元所在的人名单元词典中;The unmatched personal name unit processing step is to calculate the weight of the unmatched personal name unit in each personal name unit dictionary according to the weight of all names containing the unmatched personal name unit in each personal name unit dictionary, and The unmatched person name unit is added as a term to the name unit dictionary where the matching person name unit in all names containing the unmatched person name unit is located;
匹配人名单元权重更新步骤,根据包含所述匹配人名单元的所有人名在各个人名单元词典中的权重,更新所述匹配人名单元在所述各个人名单元词典中的权重;以及The weight updating step of the matching personal name unit, updating the weight of the matching personal name unit in the respective personal name unit dictionaries according to the weights of all names containing the matching personal name unit in the respective personal name unit dictionaries; and
重复执行所述人名单元匹配步骤、所述人名权重计算步骤、所述未匹配人名单元处理步骤和所述匹配人名单元权重更新步骤中的处理,直至所述预定多个语言的人名单元词典中的所有词项的权重变化小于预定阈值为止,从而得到所有词项均具有权重标注的人名单元词典。Repeating the processing in the name unit matching step, the name weight calculation step, the unmatched name unit processing step and the matched name unit weight update step until the list of people in the predetermined multiple languages Until the weight changes of all the terms in the meta-dictionary are less than a predetermined threshold, a person-name unit dictionary in which all terms are marked with weights is obtained.
2.根据附记1所述的方法,还包括:2. According to the method described in Note 1, further comprising:
归一化步骤,对在所述人名权重计算步骤中算出的人名在各个人名单元词典中的权重进行归一化,A normalization step, normalizing the weights of the names calculated in the name weight calculation step in each name unit dictionary,
其中,在所述未匹配人名单元处理步骤和所述匹配人名单元权重更新步骤中,利用归一化后的人名的权重进行相应计算。Wherein, in the step of processing the unmatched person name unit and the step of updating the weight of the matched person name unit, corresponding calculations are performed using the weight of the normalized person name.
3.一种人名语言识别方法,包括:3. A name language recognition method, comprising:
人名划分步骤,将输入的人名划分为n元子字符串,其中,所述n元子字符串表示包括所述人名中的连续n个字符的单元,n是大于或等于2的整数;The personal name division step divides the input personal name into n-element substrings, wherein the n- element substrings represent units comprising consecutive n characters in the personal name, and n is an integer greater than or equal to 2;
n元子字符串权重计算步骤,根据包含各个n元子字符串的所有词项在根据附记1或2所述的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及The n-element substring weight calculation step is to calculate each n-element substring according to the weights of all terms containing each n-element substring in the person name unit dictionary with weight labeling described in Supplementary Note 1 or 2 weights in the individual name unit dictionaries; and
识别步骤,根据所述人名中的所有n元子字符串在各个人名单元词典中的权重,识别所述人名所属的语言。The identifying step is to identify the language to which the personal name belongs according to the weights of all n-gram substrings in the personal name in each personal name unit dictionary.
4.根据附记3所述的方法,还包括:4. According to the method described in Note 3, further comprising:
排序步骤,根据各个n元子字符串在各个人名单元词典中的权重,确定各个n元子字符串在各个人名单元词典中的排序,Sorting step, according to the weight of each n yuan substring in each person's name unit dictionary, determine the ordering of each n yuan substring in each person's name unit dictionary,
其中,在所述识别步骤中,根据所述人名中的所有n元子字符串在各个人名单元词典中的排序来识别所述人名所属的语言。Wherein, in the step of identifying, the language to which the name of the person belongs is identified according to the ordering of all n-gram substrings in the name of the person in each name unit dictionary.
5.根据附记4所述的方法,其中,在所述排序步骤中,根据所述n元子字符串在各个人名单元词典中的权重的降序来确定所述n元子字符串在各个人名单元词典中的排序,5. according to the method described in supplementary note 4, wherein, in described sorting step, according to the descending order of the weight of described n-element substring in each personal name unit dictionary, determine that described n-element substring is in each person's name unit dictionary ordering in the personal name cell dictionary,
并且其中,在所述识别步骤中,计算所述人名中的所有n元子字符串在各个人名单元词典中的排序的和,并将与最小的和对应的人名单元词典表示的语言确定为所述人名所属的语言。And wherein, in the recognition step, calculate the sum of the sorting of all n-element substrings in the name of the person in each name unit dictionary, and determine the language represented by the smallest sum corresponding to the name unit dictionary is the language to which the person's name belongs.
6.一种人名语言识别装置,包括:6. A name language recognition device, comprising:
人名划分单元,被配置成将输入的人名划分为n元子字符串,其中,所述n元子字符串表示包括所述人名中的连续n个字符的单元,n是大于或等于2的整数;The personal name division unit is configured to divide the input personal name into n-element substrings, wherein the n-element substrings represent units including consecutive n characters in the personal name, and n is an integer greater than or equal to 2 ;
n元子字符串权重计算单元,被配置成根据包含各个n元子字符串的所有词项在根据附记1或2所述的带权重标注的人名单元词典中的权重来计算各个n元子字符串在各个人名单元词典中的权重;以及The n-element substring weight calculation unit is configured to calculate each n-element according to the weights of all terms containing each n-element substring in the person name unit dictionary with weight labeling according to Supplementary Note 1 or 2 the weight of the substring in each person's cell dictionary; and
识别单元,被配置成根据所述人名中的所有n元子字符串在各个人名单元词典中的权重,识别所述人名所属的语言。The identification unit is configured to identify the language to which the person's name belongs according to the weights of all n-gram substrings in the person's name in each person's name unit dictionary.
7.根据附记6所述的装置,还包括:7. The device according to appendix 6, further comprising:
排序单元,被配置成根据各个n元子字符串在各个人名单元词典中的权重,确定各个n元子字符串在各个人名单元词典中的排序,The sorting unit is configured to determine the ordering of each n-element substring in each person-name unit dictionary according to the weight of each n-element substring in each person-name unit dictionary,
其中,所述识别单元进一步根据所述人名中的所有n元子字符串在各个人名单元词典中的排序来识别所述人名所属的语言。Wherein, the recognition unit further recognizes the language to which the person's name belongs according to the ordering of all n-gram substrings in the person's name in each person's name unit dictionary.
8.根据附记7所述的装置,其中,所述排序单元进一步根据所述n元子字符串在各个人名单元词典中的权重的降序来确定所述n元子字符串在各个人名单元词典中的排序,8. The device according to Supplementary Note 7, wherein the sorting unit further determines that the n-gram substrings are included in each name list according to the descending order of the weights of the n-gram substrings in each personal name unit dictionary. ordering in the meta dictionary,
并且其中,所述识别单元进一步计算所述人名中的所有n元子字符串在各个人名单元词典中的排序的和,并将与最小的和对应的人名单元词典表示的语言确定为所述人名所属的语言。And wherein, the recognition unit further calculates the sum of the sorting sums of all n-element substrings in the person's name in each person's name unit dictionary, and determines the language represented by the person's name unit dictionary corresponding to the smallest sum as the The language of the person's name.
Claims (8)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410221701.8A CN105095322A (en) | 2014-05-23 | 2014-05-23 | Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device |
JP2015102946A JP2015225662A (en) | 2014-05-23 | 2015-05-20 | Personal name unit dictionary extension method, personal name language recognition method, and personal name language recognition device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410221701.8A CN105095322A (en) | 2014-05-23 | 2014-05-23 | Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105095322A true CN105095322A (en) | 2015-11-25 |
Family
ID=54575767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410221701.8A Pending CN105095322A (en) | 2014-05-23 | 2014-05-23 | Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP2015225662A (en) |
CN (1) | CN105095322A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106227364A (en) * | 2016-07-28 | 2016-12-14 | 百度在线网络技术(北京)有限公司 | For determining the method and apparatus representing order of name result |
CN108830380A (en) * | 2018-04-11 | 2018-11-16 | 开放智能机器(上海)有限公司 | A kind of training pattern generation method and system based on cloud service |
CN110178139A (en) * | 2016-11-14 | 2019-08-27 | 柯达阿拉里斯股份有限公司 | Use the system and method for the character recognition of the full convolutional neural networks with attention mechanism |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115410185A (en) * | 2022-08-26 | 2022-11-29 | 惠每数科(北京)医疗科技有限公司 | A method for extracting attributes of specific person names and unit names from multimodal data |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080162118A1 (en) * | 2006-12-15 | 2008-07-03 | International Business Machines Corporation | Technique for Searching Out New Words That Should Be Registered in Dictionary For Speech Processing |
JP2009295052A (en) * | 2008-06-06 | 2009-12-17 | Yahoo Japan Corp | Compound word break estimating device, method, and program for estimating break position of compound word |
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
-
2014
- 2014-05-23 CN CN201410221701.8A patent/CN105095322A/en active Pending
-
2015
- 2015-05-20 JP JP2015102946A patent/JP2015225662A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080162118A1 (en) * | 2006-12-15 | 2008-07-03 | International Business Machines Corporation | Technique for Searching Out New Words That Should Be Registered in Dictionary For Speech Processing |
JP2009295052A (en) * | 2008-06-06 | 2009-12-17 | Yahoo Japan Corp | Compound word break estimating device, method, and program for estimating break position of compound word |
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
Non-Patent Citations (3)
Title |
---|
BRUNO POULIQUEN等: "Multilingual person name recognition and transliteration", 《COMPUTER SCIENCE》 * |
张仰森 等: "基于姓氏驱动的中国姓名自动识别方法", 《计算机工程与应用》 * |
童毅见: "基于平行语料库的英语人名译名识别", 《大学英语》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106227364A (en) * | 2016-07-28 | 2016-12-14 | 百度在线网络技术(北京)有限公司 | For determining the method and apparatus representing order of name result |
CN110178139A (en) * | 2016-11-14 | 2019-08-27 | 柯达阿拉里斯股份有限公司 | Use the system and method for the character recognition of the full convolutional neural networks with attention mechanism |
CN108830380A (en) * | 2018-04-11 | 2018-11-16 | 开放智能机器(上海)有限公司 | A kind of training pattern generation method and system based on cloud service |
Also Published As
Publication number | Publication date |
---|---|
JP2015225662A (en) | 2015-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104239300B (en) | The method and apparatus that semantic key words are excavated from text | |
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
US11379668B2 (en) | Topic models with sentiment priors based on distributed representations | |
CN103678418B (en) | Information processing method and message processing device | |
WO2019091026A1 (en) | Knowledge base document rapid search method, application server, and computer readable storage medium | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN106960001B (en) | A kind of entity link method and system of term | |
JP5379138B2 (en) | Creating an area dictionary | |
Desai | A review on knowledge discovery using text classification techniques in text mining | |
US12190621B2 (en) | Generating weighted contextual themes to guide unsupervised keyphrase relevance models | |
CN108319583A (en) | Method and system for extracting knowledge from Chinese language material library | |
CN112633000A (en) | Method and device for associating entities in text, electronic equipment and storage medium | |
JP2018010514A (en) | Bilingual dictionary creation device, bilingual dictionary creation method, and bilingual dictionary creation program | |
CN104881399B (en) | Event recognition method and system based on probability soft logic PSL | |
CN108717459A (en) | A kind of mobile application defect positioning method of user oriented comment information | |
CN102999538B (en) | Personage's searching method and equipment | |
CN105608075A (en) | Related knowledge point acquisition method and system | |
CN105095322A (en) | Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device | |
Saini et al. | Intrinsic plagiarism detection system using stylometric features and DBSCAN | |
CN110968693A (en) | A computational method for multi-label text classification based on ensemble learning | |
JP6495124B2 (en) | Term semantic code determination device, term semantic code determination model learning device, method, and program | |
Bettiche et al. | Opinion mining in social networks for Algerian dialect | |
US12361027B2 (en) | Iterative sampling based dataset clustering | |
CN108733702B (en) | Method, device, electronic equipment and medium for extracting upper and lower relation of user query | |
US20130238607A1 (en) | Seed set expansion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20151125 |
|
WD01 | Invention patent application deemed withdrawn after publication |