[go: up one dir, main page]

CN107402933A - Entity polyphone disambiguation method and entity polyphone disambiguation equipment - Google Patents

Entity polyphone disambiguation method and entity polyphone disambiguation equipment Download PDF

Info

Publication number
CN107402933A
CN107402933A CN201610342051.1A CN201610342051A CN107402933A CN 107402933 A CN107402933 A CN 107402933A CN 201610342051 A CN201610342051 A CN 201610342051A CN 107402933 A CN107402933 A CN 107402933A
Authority
CN
China
Prior art keywords
entity
pronunciation
attribute
predetermined
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610342051.1A
Other languages
Chinese (zh)
Inventor
房璐
缪庆亮
孟遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201610342051.1A priority Critical patent/CN107402933A/en
Priority to JP2017100185A priority patent/JP2017208097A/en
Publication of CN107402933A publication Critical patent/CN107402933A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of entity polyphone disambiguation method and entity polyphone disambiguation equipment is disclosed, the wherein entity polyphone disambiguation method includes:Entity recognition step, for identifying at least one entity including polyphone from the text of input;And determine pronunciation step, for each entity at least one entity, by the corresponding entity in the data set of the entity link to the open data of association, and its property value based on corresponding entity includes at least one attribute of pronunciation and/or the pronunciation associated with corresponding entity, determines the pronunciation of the entity.In accordance with an embodiment of the present disclosure, the pronunciation of entity can be found out from the open data of association, disambiguation is carried out so as to the pronunciation to entity polyphone.

Description

实体多音字消歧方法和实体多音字消歧设备Entity polyphonic character disambiguation method and entity polyphonic character disambiguation device

技术领域technical field

本公开涉及信息处理处理领域,更具体地,涉及一种实体多音字消歧方法和实体多音字消歧设备,其能够从关联开放数据中找出实体的发音,从而能够对实体多音字的发音进行消歧。The present disclosure relates to the field of information processing, and more specifically, relates to a method for disambiguating entity polyphonic characters and a device for disambiguating entity polyphonic characters. Disambiguate.

背景技术Background technique

TTS(Text To Speech)技术又称文语转换技术,是当前语音合成的代表性内容,是指利用计算机将任意文本转化为语音的技术。因为对于输入的文本需要将其转化为对应的发音,因此多音字消歧是文语转换的核心问题。多音字转换的正确与否,极大地影响了用户对合成语音的理解情况。如果对多音字消歧的准确率高,则合成的语言更容易被用户理解,听起来也更加自然流畅。TTS (Text To Speech) technology, also known as text-to-speech technology, is a representative content of current speech synthesis, which refers to the technology of converting arbitrary text into speech by using a computer. Because the input text needs to be converted into the corresponding pronunciation, disambiguation of polyphonic characters is the core issue of text-to-speech conversion. Whether the conversion of polyphonic characters is correct or not greatly affects the user's comprehension of synthesized speech. If the accuracy of polyphone disambiguation is high, the synthesized language is easier for users to understand and sounds more natural and fluent.

在中文或日文中,存在大量多音字,因此如何确定多音字的发音就成为针对中文或日文文本语音合成领域中需要重点解决的问题。目前,对多音字的消歧主要包括两种方法:一是通过人工总结并制定规则的方法;二是利用机器学习的方法对多音字进行消歧。其中,基于人工规则的方法耗费人力,且有些情况下多音字的发音毫无规律可循,人也无法判断其发音,例如在日语中,同一个汉字在不同的人的名字里发音也可能不同。而在机器学习的方法中,往往需要大量的人工标注的语料,同样费时费力。In Chinese or Japanese, there are a large number of polyphonic characters, so how to determine the pronunciation of polyphonic characters has become a problem that needs to be solved in the field of speech synthesis of Chinese or Japanese texts. At present, the disambiguation of polyphonic characters mainly includes two methods: one is to manually summarize and formulate rules; the other is to use machine learning to disambiguate polyphonic characters. Among them, the method based on artificial rules is labor-intensive, and in some cases, the pronunciation of polyphonic characters has no rules to follow, and people cannot judge their pronunciation. For example, in Japanese, the same Chinese character may be pronounced differently in different people's names . In machine learning methods, a large amount of manually labeled corpus is often required, which is also time-consuming and labor-intensive.

关联数据(Linked Data)是一系列利用Web在不同数据源之间创建语义关联的最佳实践方法,关联数据使用统一资源标识符(URI)来标识资源(可理解为实体),因此每个实体具有唯一性,同时还以三元组的形式提供每个资源元数据,即相关的属性和属性值。将关联数据开发并发布在互联网上称为关联开放数据(Linked Open Data,LOD),常用的大规模的LOD的数据集包括DBpedia、Freebase等等。例如,DBpedia是Wikipedia的结构化数据集,人们在编辑某个实体的Wikipedia页面时,往往同时会给出其发音,但是又不是以某个固定的方式给出。在LOD中,有些资源存在类似发音这样的属性,而且每个资源都有唯一的标识,因此,我们考虑可以利用LOD对多音字资源进行消岐。Linked Data (Linked Data) is a series of best practice methods that use the Web to create semantic associations between different data sources. Linked Data uses Uniform Resource Identifiers (URI) to identify resources (understandable as entities), so each entity Unique, while also providing per-resource metadata in the form of triplets, that is, associated attributes and attribute values. The development and release of linked data on the Internet is called linked open data (Linked Open Data, LOD). Commonly used large-scale LOD datasets include DBpedia, Freebase, etc. For example, DBpedia is a structured dataset of Wikipedia. When people edit the Wikipedia page of an entity, they often give its pronunciation at the same time, but it is not given in a fixed way. In LOD, some resources have attributes like pronunciation, and each resource has a unique identifier. Therefore, we consider that LOD can be used to disambiguate polyphone resources.

发明内容Contents of the invention

在下文中给出了关于本公开的简要概述,以便提供关于本公开的某些方面的基本理解。但是,应当理解,这个概述并不是关于本公开的穷举性概述。它并不是意图用来确定本公开的关键性部分或重要部分,也不是意图用来限定本公开的范围。其目的仅仅是以简化的形式给出关于本公开的某些概念,以此作为稍后给出的更详细描述的前序。A brief summary of the present disclosure is given below in order to provide a basic understanding of some aspects of the present disclosure. It should be understood, however, that this summary is not an exhaustive summary of the disclosure. It is not intended to identify key or critical parts of the disclosure, nor is it intended to limit the scope of the disclosure. Its sole purpose is to present some concepts of the disclosure in a simplified form as a prelude to the more detailed description that is presented later.

鉴于以上问题,本公开的目的是提供一种实体多音字消歧方法和实体多音字消歧设备,其能够从关联开放数据中找出实体的发音作为实体发音的消歧结果,从而能够对实体多音字的发音进行消歧。In view of the above problems, the purpose of this disclosure is to provide a method for disambiguating entity polyphonic characters and a device for disambiguating entity polyphonic characters, which can find out the pronunciation of entities from associated open data as the disambiguation result of entity pronunciation, so as to be able to disambiguate entities The pronunciation of polyphonic characters is disambiguated.

根据本公开的一方面,提供了一种实体多音字消歧方法,包括:实体识别步骤,可以用于从输入的文本中识别出包括多音字的至少一个实体;以及确定发音步骤,可以对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据的数据集中的相应实体,并且可以基于相应实体的其属性值包含发音的至少一个属性和/或与相应实体相关联的发音,确定该实体的发音。According to an aspect of the present disclosure, a method for disambiguating entity polyphonic characters is provided, including: an entity recognition step, which can be used to identify at least one entity including polyphonic characters from the input text; and a determining pronunciation step, which can be used for all each of the at least one entity, the entity is linked to a corresponding entity in the dataset of associated open data, and may contain at least one attribute of an utterance and/or an utterance associated with the corresponding entity based on its attribute value of the corresponding entity , which determines the pronunciation of the entity.

根据本公开的另一方面,还提供了一种实体多音字消歧设备,包括:实体识别单元,可以被配置成从输入的文本中识别出包括多音字的至少一个实体;以及确定发音单元,可以被配置成对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据的数据集中的相应实体,并且可以基于相应实体的其属性值包含发音的至少一个属性和/或与相应实体相关联的发音,确定该实体的发音。According to another aspect of the present disclosure, there is also provided an entity polyphone disambiguation device, including: an entity recognition unit configured to identify at least one entity including a polyphone from an input text; and a pronunciation determination unit, Can be configured such that, for each entity in the at least one entity, the entity is linked to a corresponding entity in the data set of associated open data, and at least one attribute of the pronunciation can be included based on its attribute value of the corresponding entity and/or related to The pronunciation associated with the corresponding entity, determines the pronunciation of the entity.

根据本公开的其它方面,还提供了用于实现上述根据本公开的方法的计算机程序代码和计算机程序产品以及其上记录有该用于实现上述根据本公开的方法的计算机程序代码的计算机可读存储介质。According to other aspects of the present disclosure, there are also provided computer program codes and computer program products for realizing the above-mentioned methods according to the present disclosure, and computer-readable computer program codes on which the computer program codes for realizing the above-mentioned methods according to the present disclosure are recorded. storage medium.

在下面的说明书部分中给出本公开实施例的其它方面,其中,详细说明用于充分地公开本公开实施例的优选实施例,而不对其施加限定。Further aspects of embodiments of the present disclosure are given in the following descriptive section, wherein the detailed description serves to fully disclose preferred embodiments of the embodiments of the present disclosure without imposing limitations thereon.

附图说明Description of drawings

本公开可以通过参考下文中结合附图所给出的详细描述而得到更好的理解,其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似的部件。所述附图连同下面的详细说明一起包含在本说明书中并形成说明书的一部分,用来进一步举例说明本公开的优选实施例和解释本公开的原理和优点。其中:The present disclosure can be better understood by referring to the following detailed description given in conjunction with the accompanying drawings, wherein the same or similar reference numerals are used throughout to designate the same or similar parts. The accompanying drawings, together with the following detailed description, are incorporated in and form a part of this specification, and serve to further illustrate preferred embodiments of the present disclosure and explain principles and advantages of the present disclosure. in:

图1是示出根据本公开的实施例的实体多音字消歧方法的流程示例的流程图;Fig. 1 is a flow chart showing an example of the process of a method for disambiguating entity polyphonic characters according to an embodiment of the present disclosure;

图2是示出关联开放数据的数据集中的一个实体的示例的图;FIG. 2 is a diagram illustrating an example of an entity in a dataset of associated open data;

图3是示出关联开放数据的数据集中的另一个实体的示例的图;FIG. 3 is a diagram illustrating an example of another entity in a dataset of associated open data;

图4是示出关联开放数据的数据集中的又一个实体的示例的图;Figure 4 is a diagram illustrating an example of yet another entity in a data set of associated open data;

图5是示出根据本公开的实施例的实体多音字消歧设备的功能配置示例的框图;以及FIG. 5 is a block diagram showing an example of a functional configuration of an entity polyphone disambiguation device according to an embodiment of the present disclosure; and

图6是示出作为本公开的实施例中可采用的信息处理设备的个人计算机的示例结构的框图。FIG. 6 is a block diagram showing an example structure of a personal computer as an information processing device employable in an embodiment of the present disclosure.

具体实施方式detailed description

在下文中将结合附图对本公开的示范性实施例进行描述。为了清楚和简明起见,在说明书中并未描述实际实施方式的所有特征。然而,应该了解,在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的决定,以便实现开发人员的具体目标,例如,符合与系统及业务相关的那些限制条件,并且这些限制条件可能会随着实施方式的不同而有所改变。此外,还应该了解,虽然开发工作有可能是非常复杂和费时的,但对得益于本公开内容的本领域技术人员来说,这种开发工作仅仅是例行的任务。Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It should be understood, however, that in developing any such practical embodiment, many implementation-specific decisions must be made in order to achieve the developer's specific goals, such as meeting those constraints related to the system and business, and those Restrictions may vary from implementation to implementation. Moreover, it should also be understood that development work, while potentially complex and time-consuming, would at least be a routine undertaking for those skilled in the art having the benefit of this disclosure.

在此,还需要说明的一点是,为了避免因不必要的细节而模糊了本公开,在附图中仅仅示出了与根据本公开的方案密切相关的设备结构和/或处理步骤,而省略了与本公开关系不大的其它细节。Here, it should be noted that in order to avoid obscuring the present disclosure due to unnecessary details, only the device structure and/or processing steps closely related to the solution according to the present disclosure are shown in the drawings, and the Other details that are not materially relevant to the present disclosure are omitted.

下面结合附图详细说明根据本公开的实施例。Embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings.

首先,将参照图1描述根据本公开的实施例的实体多音字消歧方法的流程示例。图1是示出根据本公开的实施例的实体多音字消歧方法的流程示例的流程图。如图1所示,根据本公开的实施例的实体多音字消歧方法可以包括实体识别步骤S102和确定发音步骤S104。First, a flow example of a method for disambiguating entity polyphonic characters according to an embodiment of the present disclosure will be described with reference to FIG. 1 . Fig. 1 is a flow chart showing an example of the procedure of a method for disambiguating entity polyphonic characters according to an embodiment of the present disclosure. As shown in FIG. 1 , the method for disambiguating entity polyphonic characters according to an embodiment of the present disclosure may include an entity recognition step S102 and a pronunciation determination step S104.

首先,在实体识别步骤S102中,可以从输入的文本中识别出包括多音字的至少一个实体。First, in the entity recognition step S102, at least one entity including polyphonic characters may be recognized from the input text.

具体地,在实体识别步骤S102中,可以利用命名实体识别技术识别出输入文本中的实体。但这只是例示而非限制,本领域技术人员还可以利用其他技术识别出输入文本中的实体。例如在日语句子“世界最強の選手が集うATPツアー·ファイナルに錦織圭(日清食品)が初出場”中,可以识别出人名“錦織圭”和机构名“日清食品”,其中,“錦織圭”是包括多音字“錦織”的实体。Specifically, in the entity recognition step S102, entities in the input text may be recognized using named entity recognition technology. But this is only an illustration rather than a limitation, and those skilled in the art can also use other techniques to identify entities in the input text. For example, in the Japanese sentence "The World's Strongest Player が集うATPツアー・ファイナルに麦维基(Nissin Foods) が初见", the name "Nishinori Kei" and the organization name "Nissin Foods" can be recognized, among which "Nishin Gui" is an entity that includes the polyphonic word "Nishikiri".

在确定发音步骤S104中,可以对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据的数据集中的相应实体,并且可以基于相应实体的其属性值包含发音的至少一个属性和/或与相应实体相关联的发音,确定该实体的发音。In determining the pronunciation step S104, for each entity in the at least one entity, the entity may be linked to a corresponding entity in the data set of associated open data, and at least one attribute of the pronunciation may be included based on its attribute value of the corresponding entity and/or the pronunciation associated with the corresponding entity, determining the pronunciation of the entity.

在本公开中,例示而非限制,LOD的数据集为DBpedia。另外,LOD的数据集还可以是Freebase等。In this disclosure, by way of example and not limitation, the dataset for LOD is DBpedia. In addition, the data set of LOD can also be Freebase and so on.

对于在实体识别步骤S102中识别出的“錦織圭”和“日清食品”,可以将这些实体利用实体链接技术分别链接到LOD的数据集中的相应实体中。例如,可以将“錦織圭”链接到DBpedia中的相应实体“http://ja.dbpedia.org/resource/錦織圭”,并且可以基于该相应实体的其属性值包含发音的至少一个属性和/或与该相应实体相关联的发音,确定实体“錦織圭”的发音。另外,可以将“日清食品”链接到DBpedia中的相应实体“http://ja.dbpedia.org/resource/日清食品”,并且可以基于该相应实体的其属性值包含发音的至少一个属性和/或与该相应实体相关联的发音,确定实体“日清食品”的发音。因为LOD中的每个实体是唯一的,所以获得的发音也是无歧义的。For the "Kei Nishikori" and "Nissin Foods" identified in the entity recognition step S102, these entities can be linked to corresponding entities in the LOD dataset using entity linking technology. For example, "Kei Nishikori" may be linked to the corresponding entity "http://ja.dbpedia.org/resource/Kei Nishikori" in DBpedia, and may contain at least one attribute of pronunciation based on its attribute value of the corresponding entity and/or Or the pronunciation associated with the corresponding entity, determine the pronunciation of the entity "Kei Nishikori". In addition, "Nissin food" may be linked to the corresponding entity "http://ja.dbpedia.org/resource/Nissin food" in DBpedia, and at least one attribute of pronunciation may be contained based on its attribute value of the corresponding entity and/or the pronunciation associated with the corresponding entity, determine the pronunciation of the entity "Nissin Foods". Because each entity in the LOD is unique, the resulting pronunciations are also unambiguous.

优选地,所述至少一个属性可以包括其属性值直接为发音的至少一个第一预定属性。对于LOD的数据集中的实体,发音存在于属性值中,有些情况下属性值直接就是实体的发音。图2是示出关联开放数据的数据集中的一个实体的示例的图。具体地,图2示出了关联开放数据的数据集中的实体“http://ja.dbpedia.org/resource/錦織淳”。例如,图2所示的属性“http://ja.dbpedia.org/property/各国語表記”的属性值“にしこおりあつし”直接为“錦織淳”的发音。图3是示出关联开放数据的数据集中的另一个实体的示例的图。具体地,图3示出了关联开放数据的数据集中的实体“http://ja.dbpedia.org/resource/錦織一清”。例如,图3所示的属性“http://xmlns.com/foaf/0.1/name”的属性值“にしきおりかずきよ”直接为“錦織一清”的发音;此外,图3所示的属性“http://ja.dbpedia.org/property/ふりがな”的属性值“にしきおりかずきよ”也直接为“錦織一清”的发音。然而在不同的实体中表示发音的这些属性常常是不一样的,而且有些通用的属性不总是表示发音,所以我们需要对这些属性进行选择。Preferably, the at least one attribute may include at least one first predetermined attribute whose attribute value is directly the pronunciation. For the entities in the LOD dataset, the pronunciation exists in the attribute value. In some cases, the attribute value is directly the pronunciation of the entity. FIG. 2 is a diagram showing an example of one entity in a dataset of associated open data. Specifically, FIG. 2 shows the entity "http://ja.dbpedia.org/resource/Atsushi Nishikori" in the dataset of associated open data. For example, the property value "にしこおりあつし" of the property "http://ja.dbpedia.org/property/national language notation" shown in FIG. 2 is directly the pronunciation of "Nishikori Jun". FIG. 3 is a diagram illustrating an example of another entity in a data set associating open data. Specifically, FIG. 3 shows the entity "http://ja.dbpedia.org/resource/Nishikori Ichiki" in the dataset of associated open data. For example, the attribute value "にしきおかずきよ" of the attribute "http://xmlns.com/foaf/0.1/name" shown in Figure 3 is directly the pronunciation of "Nishikori Yiqing"; in addition, the attribute shown in Figure 3 The property value "にしきおりずきよ" of "http://ja.dbpedia.org/property/ふりがな" is also the pronunciation of "Nishikiri Ichiki". However, these attributes representing pronunciation in different entities are often different, and some general attributes do not always represent pronunciation, so we need to select these attributes.

优选地,所述至少一个第一预定属性可以是通过以下方式获得的:获得LOD的数据集中的每个实体的名字,根据该实体的名字中的每个字在字典中的所有发音来列出该实体的所有发音作为候选发音,如果在该实体的属性中存在其属性值与该实体的候选发音中的任一个发音完全匹配的属性,则选择该属性作为一个候选属性,并且在针对LOD的数据集中的所有实体所选择出的所有候选属性当中,选择其表示发音的概率大于预定阈值的至少一个候选属性作为至少一个第一预定属性。Preferably, the at least one first predetermined attribute may be obtained by obtaining the name of each entity in the LOD data set, and listing all the pronunciations of each character in the entity's name in the dictionary All the pronunciations of the entity are used as candidate pronunciations. If there is an attribute whose attribute value completely matches any of the candidate pronunciations of the entity in the attributes of the entity, this attribute is selected as a candidate attribute, and in the LOD Among all candidate attributes selected by all entities in the data set, at least one candidate attribute whose probability of representing pronunciation is greater than a predetermined threshold is selected as at least one first predetermined attribute.

具体地,为了选择表示发音的属性,我们首先获取LOD的数据集中每个实体的名字,根据名字中的每个字在字典中的所有发音,列出每个名字所有可能的发音,作为候选发音。将实体的候选发音和该实体的属性值进行一一对比,如果其中一个候选发音对应上其中一个属性值,则选择相应的属性作为候选属性,属性值为该实体的发音。然后,针对LOD的数据集中的所有实体所选择出的所有的候选属性,计算它们表示发音的概率。如果候选属性表示发音的概率大于预定阈值,则保留该候选属性作为一个第一预定属性,即,选择其表示发音的概率大于预定阈值的至少一个候选属性作为至少一个第一预定属性。Specifically, in order to select the attribute representing the pronunciation, we first obtain the name of each entity in the LOD dataset, and list all possible pronunciations of each name as candidate pronunciations according to all the pronunciations of each word in the dictionary in the name . Compare the candidate pronunciation of the entity with the attribute value of the entity one by one. If one of the candidate pronunciations corresponds to one of the attribute values, select the corresponding attribute as the candidate attribute, and the attribute value is the pronunciation of the entity. Then, for all candidate attributes selected from all entities in the LOD dataset, the probability that they represent pronunciation is calculated. If the probability of the candidate attribute representing the pronunciation is greater than the predetermined threshold, then retain the candidate attribute as a first predetermined attribute, that is, select at least one candidate attribute whose probability of representing the pronunciation is greater than the predetermined threshold as at least one first predetermined attribute.

优选地,候选属性的表示发音的概率可以是候选属性的属性值为发音的次数与该候选属性在LOD的数据集中的出现总次数的比值。Preferably, the probability of the candidate attribute representing pronunciation may be the ratio of the number of times the attribute value of the candidate attribute is pronounced to the total number of occurrences of the candidate attribute in the LOD data set.

若候选属性用a表示,则候选属性a表示发音的概率P(a)为候选属性a的属性值为发音的次数与候选属性a在LOD的数据集中出现的总次数的比值,如公式(1)所示:If the candidate attribute is represented by a, the probability P(a) of the candidate attribute a representing the pronunciation is the ratio of the number of times the attribute value of the candidate attribute a is pronounced to the total number of occurrences of the candidate attribute a in the LOD data set, such as the formula (1 ) as shown:

P(a)=a的属性值为发音的次数/属性a出现的总次数 (1)P(a) = the number of times the attribute value of a is pronounced / the total number of occurrences of attribute a (1)

优选地,上述预定阈值可以由本领域技术人员根据经验或实验确定。Preferably, the aforementioned predetermined threshold can be determined by those skilled in the art based on experience or experiments.

根据本公开的实施例,对于图2所示的示例,可以确定属性“http://ja.dbpedia.org/property/各国語表記”的属性值“にしこおりあつし”直接为“錦織淳”的发音。According to an embodiment of the present disclosure, for the example shown in FIG. 2 , it can be determined that the attribute value "にしこおりあつし" of the attribute "http://ja.dbpedia.org/property/national language representation" is directly the value of "Nishikori Jun". pronounce.

根据本本公开的实施例,对于图3所示的示例,可以确定属性“http://xmlns.com/foaf/0.1/name”的属性值“にしきおりかずきよ”直接为“錦織一清”的发音。此外,还可以确定属性“http://ja.dbpedia.org/property/ふりがな”的属性值“にしきおりかずきよ”直接为“錦織一清”的发音。According to an embodiment of the present disclosure, for the example shown in FIG. 3 , it can be determined that the attribute value "にしきおりずきよ" of the attribute "http://xmlns.com/foaf/0.1/name" is directly the value of "Nishikori Ichiki". pronounce. In addition, it can also be determined that the attribute value "にしきおりずきよ" of the attribute "http://ja.dbpedia.org/property/ふりがな" is directly the pronunciation of "Nishikiri Ichiki".

优选地,所述至少一个属性还可以包括其属性值包含能够利用至少一个发音提取模板所提取出的发音的至少一个第二预定属性。除了属性值直接就是实体的发音之外,有些情况下发音是包含在属性值中的,而且一般出现的位置是规律可循的,这时我们可以利用一些发音提取模板来确定实体的发音。图4是示出关联开放数据的数据集中的又一个实体的示例的图。具体地,图4示出了关联开放数据的数据集中的实体“http://ja.dbpedia.org/resource/錦織圭”。与图2和图3中示出的示例不同,图4中的属性不包括其属性值直接就是实体的发音的属性,即不能从图4中的属性的属性值直接确定“錦織圭”的发音。但是,“錦織圭”的发音包括在属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值中。在这种情况下我们可以通过利用发音提取模板来获得图4中实体的发音。Preferably, the at least one attribute may further include at least one second predetermined attribute whose attribute value contains a pronunciation that can be extracted by using at least one pronunciation extraction template. In addition to the fact that the attribute value is directly the pronunciation of the entity, in some cases the pronunciation is included in the attribute value, and generally the position of occurrence is regular. At this time, we can use some pronunciation extraction templates to determine the pronunciation of the entity. FIG. 4 is a diagram illustrating an example of still another entity in a dataset of associated open data. Specifically, FIG. 4 shows the entity "http://ja.dbpedia.org/resource/Kei Nishikori" in the dataset of associated open data. Unlike the examples shown in Figures 2 and 3, the attributes in Figure 4 do not include attributes whose attribute values are directly the pronunciation of the entity, that is, the pronunciation of "Kei Nishikori" cannot be directly determined from the attribute values of the attributes in Figure 4 . However, the pronunciation of "Kei Nishikori" is included in the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment". In this case we can obtain the pronunciation of the entity in Figure 4 by utilizing the pronunciation extraction template.

优选地,所述至少一个发音提取模板可以是通过以下方式生成的:对于LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的每个实体,可以根据该实体的所述任意第一预定属性的属性值确定该实体的发音,并且可以确定该发音在该实体的包含发音的其他属性的属性值中的出现位置的规律性,从而根据LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来自动生成所述至少一个发音提取模板。Preferably, the at least one pronunciation extraction template can be generated in the following manner: for each entity in the data set of LOD that includes any first predetermined attribute in the at least one first predetermined attribute, according to the entity's The attribute value of any first predetermined attribute determines the pronunciation of the entity, and the regularity of the occurrence position of the pronunciation in the attribute values of other attributes containing the pronunciation of the entity can be determined, so that according to the LOD data set including all All entities of any first predetermined attribute in the at least one first predetermined attribute are used to automatically generate the at least one pronunciation extraction template.

在LOD的数据集中,有些情况下发音不仅直接为某个属性的属性值,而且存在于别的属性值中,也就是说所述别的属性值中也包含实体的发音。而且有些情况下实体的发音并不直接是属性值,而发音只包含在属性值中并且其在属性值中的出现位置是有规律的。在这种情况下我们可以自动产生一些发音提取模板来匹配这些发音。In the LOD data set, in some cases, the pronunciation is not only directly an attribute value of a certain attribute, but also exists in other attribute values, that is to say, the other attribute values also include the pronunciation of the entity. And in some cases, the pronunciation of the entity is not directly the attribute value, but the pronunciation is only included in the attribute value and its occurrence position in the attribute value is regular. In this case we can automatically generate some pronunciation extraction templates to match these pronunciations.

要想生成发音提取模板,首先要收集训练模板语料。具体地,对于由所获得的所述至少一个第一预定属性组成的属性列表中的每个属性,在LOD的数据集中查找含有该属性的每个实体,并且根据该实体的该属性的属性值确定该实体的发音,以及针对该实体查找包含其发音的其他属性值作为模板训练语料。例如,对于图2中示出的实体“http://ja.dbpedia.org/resource/錦織淳”,属性“http://ja.dbpedia.org/property/各国語表記”为直接为发音的属性,由此可知此实体发音为“にしこおりあつし”,再查看其他属性值,发现属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值中含有发音“にしこおりあつし”,则将属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值作为模板训练语料。另外,对于图3中示出的实体“http://ja.dbpedia.org/resource/錦織一清”,属性“http://xmlns.com/foaf/0.1/name”为直接为发音的属性,由此可知此实体发音为“にしきおりかずきよ”,再查看其他属性值,发现属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值中含有发音“にしきおりかずきよ”,则将属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值也作为模板训练语料。To generate a pronunciation extraction template, the training template corpus must first be collected. Specifically, for each attribute in the attribute list composed of the obtained at least one first predetermined attribute, each entity containing the attribute is searched in the LOD data set, and according to the attribute value of the attribute of the entity Determine the pronunciation of the entity, and look for other attribute values containing its pronunciation for the entity as a template training corpus. For example, for the entity "http://ja.dbpedia.org/resource/Nishikori Jun" shown in FIG. 2, the property "http://ja.dbpedia.org/property/national language token" is directly pronounced attribute, so we can see that this entity is pronounced as "にしこおりあつし", and then check other attribute values, and find that the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" contains pronunciation "にしこおりあつし", the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" is used as the template training corpus. In addition, for the entity "http://ja.dbpedia.org/resource/Nishikori Ichiki" shown in FIG. 3, the attribute "http://xmlns.com/foaf/0.1/name" is an attribute directly pronounced , it can be seen that this entity is pronounced as "にしきおりずきよ", and then check other attribute values, and find that the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" contains pronunciation "にしきおりずきよ", the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" is also used as the template training corpus.

其次,要确定发音在实体的包含发音的其他属性的属性值中的出现位置的规律性,从而自动生成发音提取模板。对于每个训练语料,可以截取训练语料中发音字符串前后窗口长度为N的字符,并对所截取的N个字符中的数字和汉字进行泛化,从而确定发音在该实体的包含发音的其他属性的属性值中的出现位置,由此生成一个候选模板。例如,如上所述,图2所示的属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值为一个模板训练语料,对于该训练语料,可以截取包括发音字符串的字符串“錦織淳(にしこおりあつし、1945年7月30日-)は”,对所截取的字符串中的数字和汉字进行泛化,从而确定发音“にしこおりあつし”在属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值中的出现位置,由此生成一个候选模板。另外,如上所述,图3所示的属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值也为一个模板训练语料,对于该训练语料,可以截取包括发音字符串的字符串“錦織一清(にしきおりかずきよ、1965年5月22日-)”,对所截取的字符串中的数字和汉字进行泛化,从而确定发音“にしきおりかずきよ”在属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值中的出现位置,由此生成一个候选模板。为了简化描述,以上仅以图2和图3所示的实体为例进行了描述,而实际上可以根据LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来进行上述处理。Secondly, it is necessary to determine the regularity of the occurrence position of the pronunciation in the attribute values of other attributes of the entity, so as to automatically generate the pronunciation extraction template. For each training corpus, characters whose window length is N before and after the pronunciation string in the training corpus can be intercepted, and the numbers and Chinese characters in the intercepted N characters can be generalized, so as to determine the pronunciation in other entities containing pronunciation Occurrences in the attribute value of an attribute, from which a candidate template is generated. For example, as mentioned above, the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" shown in Figure 2 is a template training corpus, for which the training corpus can be intercepted including The character string of the pronunciation string "Nishikiri Jun (にしこおりあつし, July 30, 1945 -) は", generalize the numbers and Chinese characters in the intercepted string, so as to determine the pronunciation "にしこおりあつし" in the attribute occurrence in the attribute value of "http://www.w3.org/2000/01/rdf-schema#comment", thereby generating a candidate template. In addition, as mentioned above, the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" shown in Figure 3 is also a template training corpus, for which the training corpus can be intercepted The character string "Nishikori Ichiyo (にしきおりずきよ, May 22, 1965-)" including the pronunciation character string, the numbers and Chinese characters in the intercepted character string are generalized to determine the pronunciation "にしきおりずきよ" in the attribute value of attribute "http://www.w3.org/2000/01/rdf-schema#comment", thereby generating a candidate template. In order to simplify the description, only the entities shown in Figure 2 and Figure 3 are described above as examples, but in fact, all entities including any first predetermined attribute in the at least one first predetermined attribute in the LOD data set can be entity to perform the above processing.

另外,还可以针对所查找到的包含有第一预定属性的每个实体,对该实体的包含发音的其他属性的属性值中的字符进行泛化,并且提取泛化后句子的公共子集,从而确定发音在实体的包含发音的其他属性的属性值中的出现位置,由此生成一个候选模板。仍以图2和图3为例,可以对图2所示的属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值“錦織淳(にしこおりあつし、1945年7月30日-)は、日本の弁護士·政治家。元衆議院議員(1期)。島根県出雲市(旧平田市)出身。”进行泛化,由此得到一个泛化后的结构。此外,还可以图3所示的属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值“錦織一清(にしきおりかずきよ、1965年5月22日-)はジャニーズ事務所に所属するグループ「少年隊」のリーダー。愛称は、ファンからは「ニッキ」、メンバー内では「ニシキ」。小学校5年の時にオーディションを受け、江戸川区立平井南小学校6年の1977年7月に事務所に入所。東京都出身。少年隊のイメージカラーは赤。”进行泛化,由此也得到一个泛化后的结构。然后,提取如上所述两个泛化后的结构的公共子集,从而确定发音的出现位置的规律性,由此生成一个候选模板。为了简化描述,以上仅以图2和图3所示的实体为例进行了描述,而实际上可以根据LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来进行上述处理。In addition, for each entity found that includes the first predetermined attribute, generalize the characters in the attribute values of the entity's other attributes that include pronunciation, and extract the common subset of sentences after the generalization, Therefore, the occurrence position of the pronunciation in the attribute values of other attributes including the pronunciation of the entity is determined, thereby generating a candidate template. Still taking Fig. 2 and Fig. 3 as an example, the attribute value "Nishikori Jun(にしこおりあつし, July 30, 1945 -) は, Japan's の弗 nurse and politician. Member of the Yuan House of Representatives (1st period). Born in Izumo City, Shimane Prefecture (Old Hirata City)." Generalization, thus obtaining a generalized structure . In addition, the attribute value "Nishikori Ichiki (にしきおりずきよ, May 22, 1965) of the attribute "http://www.w3.org/2000/01/rdf-schema#comment" shown in FIG. -)はくるるグーズににののるるるープトリーダートスカグループ「Youth Team」のリーダー. Nickname は, Fankara は "ニッキ", メンバーティク "ニシキ".にオーディションを得け for the 5th year of elementary school, and entered the office in July 1977 for the 6th year of Edogawa Ward Hirai Minami Elementary School. Born in Tokyo. Youth Team のイメージカラーは红. "To generalize, and thus obtain a generalized structure. Then, extract the common subset of the above two generalized structures, so as to determine the regularity of the occurrence position of the pronunciation, thereby generating a candidate template In order to simplify the description, the above only describes the entity shown in Figure 2 and Figure 3 as an example, but in fact, according to the LOD data set that includes any first predetermined attribute in the at least one first predetermined attribute All entities to do the above processing.

最后对于所生成的所有候选模板进行排序,选取出现次数大于某个阈值的候选模板作为最终的发音提取模板。由此,可以根据LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来自动生成所述至少一个发音提取模板。Finally, all the generated candidate templates are sorted, and the candidate templates whose occurrence times are greater than a certain threshold are selected as the final pronunciation extraction templates. Thus, the at least one pronunciation extraction template may be automatically generated according to all entities in the LOD data set that include any first predetermined attribute in the at least one first predetermined attribute.

优选地,所述至少一个第一预定属性的属性值和所述至少一个第二预定属性的属性值可以是字符串类型的属性值。Preferably, the attribute value of the at least one first predetermined attribute and the attribute value of the at least one second predetermined attribute may be attribute values of a character string type.

优选地,对于所述至少一个实体中的每个实体,如果该实体所链接到的对应实体的属性包含所述至少一个第一预定属性中的一个第一预定属性,则可以将所述一个第一预定属性的属性值作为该实体的发音,而如果该实体所链接到的对应实体的属性不包含所述至少一个第一预定属性中的任一个第一预定属性,则可以利用所述至少一个发音提取模板来确定该实体的发音。Preferably, for each entity in the at least one entity, if the attribute of the corresponding entity to which the entity is linked contains one of the at least one first predetermined attribute, the one first predetermined attribute can be The attribute value of a predetermined attribute is used as the pronunciation of the entity, and if the attribute of the corresponding entity to which the entity is linked does not contain any one of the at least one first predetermined attribute, the at least one predetermined attribute may be used. A pronunciation extraction template to determine the pronunciation of the entity.

具体地,查看实体所链接到的对应实体的属性是否包含有任一个第一预定属性,如果有,即该对应实体包含其属性值直接就是发音的属性,那么可以取该任一个第一预定属性的属性值作为该实体的发音。以包括多音字的实体“錦織淳”为例,其对应实体为图2所示的“http://ja.dbpedia.org/resource/錦織淳”,因为该对应实体包含其属性值直接就是发音的属性“http://ja.dbpedia.org/property/各国語表記”,因此可以直接取属性“http://ja.dbpedia.org/property/各国語表記”的属性值“にしこおりあつし”作为“錦織淳”的发音。另外,还以包括多音字的实体“錦織一清”为例,其对应实体为图3所示的实体“http://ja.dbpedia.org/resource/錦織一清”,因为该对应实体包含其属性值直接就是发音的属性“http://xmlns.com/foaf/0.1/name”和“http://ja.dbpedia.org/property/ふりがな”,因此可以直接取属性“http://xmlns.com/foaf/0.1/name”和“http://ja.dbpedia.org/property/ふりがな”中的任一个属性的属性值“にしきおりかずきよ”作为“錦織一清”的发音。Specifically, check whether the attribute of the corresponding entity to which the entity is linked contains any first predetermined attribute, and if so, that is, the corresponding entity contains an attribute whose attribute value is directly pronounced, then any one of the first predetermined attributes can be taken The attribute value of is used as the pronunciation of the entity. Taking the entity "Jinzhichun" including polyphonic characters as an example, its corresponding entity is "http://ja.dbpedia.org/resource/Jinzhichun" as shown in Figure 2, because the corresponding entity contains its attribute value, which is directly the pronunciation The attribute "http://ja.dbpedia.org/property/International language notation", so you can directly get the attribute value "にしこおりあつし" of the attribute "http://ja.dbpedia.org/property/International language notation" As the pronunciation of "Nishikori Jun". In addition, taking the entity "Nishikori Ichiyo" including polyphonic characters as an example, its corresponding entity is the entity "http://ja.dbpedia.org/resource/Nishikori Ichiyo" shown in Figure 3, because the corresponding entity contains Its attribute value is directly the pronounced attribute "http://xmlns.com/foaf/0.1/name" and "http://ja.dbpedia.org/property/ふりがな", so the attribute "http:// xmlns.com/foaf/0.1/name" and "http://ja.dbpedia.org/property/ふりがな" attribute value "にしきおりずきよ" is used as the pronunciation of "Nishikori Ichiki".

而如果实体所链接到的对应实体的属性不包含任一个第一预定属性,即该对应实体不包含其属性值直接就是发音的属性,那么可以利用发音提取模板来确定该实体的发音。对于图4所示的实体“http://ja.dbpedia.org/resource/錦織圭”,由于该对应实体的属性不包含其属性值直接就是发音的属性,则需要利用发音提取模板来确定发音。And if the attribute of the corresponding entity to which the entity is linked does not contain any first predetermined attribute, that is, the corresponding entity does not contain an attribute whose attribute value is the pronunciation directly, then the pronunciation extraction template can be used to determine the pronunciation of the entity. For the entity "http://ja.dbpedia.org/resource/Nishikori Kei" shown in Figure 4, since the attributes of the corresponding entity do not contain attributes whose attribute values are directly pronunciation, it is necessary to use the pronunciation extraction template to determine the pronunciation .

优选地,利用所述至少一个发音提取模板来确定所述至少一个实体中的一个实体的发音可以包括:利用所述至少一个发音提取模板来匹配所述一个实体所链接到的对应实体的至少一个属性的字符串类型的属性值,并且将所匹配的字符串作为所述一个实体的发音。Preferably, using the at least one pronunciation extraction template to determine the pronunciation of one entity in the at least one entity may include: using the at least one pronunciation extraction template to match at least one of the corresponding entities to which the one entity is linked. The attribute value of the character string type of the attribute, and the matched character string is used as the pronunciation of the entity.

具体地,如果实体所链接到的对应实体的属性不包含任一个第一预定属性,则可以利用发音提取模板来匹配该对应实体的字符串类型的属性值,并且可以将匹配到的字符串作为该实体的发音。Specifically, if the attribute of the corresponding entity to which the entity is linked does not contain any first predetermined attribute, the pronunciation extraction template may be used to match the attribute value of the string type of the corresponding entity, and the matched character string may be used as The pronunciation of this entity.

以包括多音字的实体“錦織圭”为例,其对应实体为图4所示的“http://ja.dbpedia.org/resource/錦織圭”,由于该对应实体的属性不包含任一个第一预定属性(即,该对应实体的属性不包含其属性值直接就是发音的属性),则需要利用发音提取模板来确定发音。具体地,利用以上以图2和图3所示的实体中的属性值为模板训练语料为例所生成的发音提取模板,对实体“http://ja.dbpedia.org/resource/錦織圭”中的字符串类型的属性值进行匹配,假设利用发音提取模板匹配属性“http://www.w3.org/2000/01/rdf-schema#comment”的属性值时,匹配到字符串“にしこりけい”,则可以将匹配到的字符串“にしこりけい”作为实体“錦織圭”的发音。Taking the entity "Nishikori Kyu" including polyphonic characters as an example, its corresponding entity is "http://ja.dbpedia.org/resource/Nishikori Kyu" shown in Figure 4, since the attribute of the corresponding entity does not contain any first For a predetermined attribute (that is, the attribute of the corresponding entity does not include the attribute whose attribute value is the pronunciation directly), it is necessary to use the pronunciation extraction template to determine the pronunciation. Specifically, using the above pronunciation extraction template generated by taking the attribute value in the entity shown in Figure 2 and Figure 3 as an example of the template training corpus, the entity "http://ja.dbpedia.org/resource/Nishikori Kei" Match the attribute value of the string type in , assuming that when using the pronunciation extraction template to match the attribute value of the attribute "http://www.w3.org/2000/01/rdf-schema#comment", the string "にしこり" is matchedけい", the matched character string "にしこりけい" can be used as the pronunciation of the entity "Kyu Nishikori".

以上详细描述了基于相应实体的其属性值包含发音的至少一个属性来确定实体的发音。It has been described in detail above that the pronunciation of an entity is determined based on at least one attribute of the corresponding entity whose attribute value contains the pronunciation.

另外,如在实体识别步骤S102中所述,可以基于与相应实体相关联的发音,确定实体的发音。具体地,可以建立相应实体的一个单独的发音属性,在如上所述在LOD的数据集中找到发音之后,可以将所找到的发音存储在该单独的发音属性的属性值中。例如,对于图2所示的实体“http://ja.dbpedia.org/resource/錦織淳”,建立一个单独的发音属性,将发音“にしこおりあつし”作为其“发音属性”的属性值。对于图3所示的实体“http://ja.dbpedia.org/resource/錦織一清”,建立一个单独的发音属性,将发音“にしきおりかずきよ”作为其“发音属性”的属性值。对于图4所示的实体“http://ja.dbpedia.org/resource/錦織圭”,建立一个单独的发音属性,将发音“にしこりけい”作为其“发音属性”的属性值。这些发音属性以及其属性值不仅可以存储在本地,而且还可以在网络上发布。这样,当将实体链接到LOD的数据集中的相应实体之后,通过查询相应实体的“发音属性”的属性值,就可以得到实体的发音。例如,如果从输入的文本中识别出实体“錦織淳”,在将该实体链接到LOD的数据集中的相应实体“http://ja.dbpedia.org/resource/錦織淳”之后,通过查询该相应实体的“发音属性”的属性值,就可以得到“錦織淳”的发音“にしこおりあつし”。另外,如果从输入的文本中识别出实体“錦織圭”,在将该实体链接到LOD的数据集中的相应实体“http://ja.dbpedia.org/resource/錦織圭”之后,通过查询该相应实体的“发音属性”的属性值,就可以得到“錦織圭”的发音“にしこりけい”。In addition, as described in the entity recognition step S102, the utterances of the entities may be determined based on the utterances associated with the corresponding entities. Specifically, a separate pronunciation attribute of the corresponding entity may be established, and after the pronunciation is found in the LOD data set as described above, the found pronunciation may be stored in the attribute value of the separate pronunciation attribute. For example, for the entity "http://ja.dbpedia.org/resource/Nishikori Jun" shown in Figure 2, a separate pronunciation attribute is established, and the pronunciation "にしこおりあつし" is used as the attribute value of its "pronunciation attribute". For the entity "http://ja.dbpedia.org/resource/Nishikori Ichiyo" shown in Figure 3, a separate pronunciation attribute is established, and the pronunciation "にしきおりずきよ" is used as the attribute value of its "pronunciation attribute". For the entity "http://ja.dbpedia.org/resource/Nishikori Kei" shown in Figure 4, a separate pronunciation attribute is established, and the pronunciation "にしこりけい" is used as the attribute value of its "pronunciation attribute". These pronunciation attributes and their attribute values can not only be stored locally, but also published on the network. In this way, after the entity is linked to the corresponding entity in the LOD dataset, the pronunciation of the entity can be obtained by querying the attribute value of the "pronunciation attribute" of the corresponding entity. For example, if the entity "Atsushi Nishikori" is identified from the input text, after linking this entity to the corresponding entity "http://ja.dbpedia.org/resource/Atsushi Nishikori" in the LOD's dataset, by querying the According to the attribute value of the "pronunciation attribute" of the corresponding entity, the pronunciation "にしこおりあつし" of "Nishikori Jun" can be obtained. Also, if the entity "Kei Nishikori" is recognized from the input text, after linking this entity to the corresponding entity "http://ja.dbpedia.org/resource/Kei Nishikori" in the LOD's dataset, by querying the According to the attribute value of the "pronunciation attribute" of the corresponding entity, the pronunciation "にしこりけい" of "Nishikori Kei" can be obtained.

根据以上描述可知,根据本公开的实施例的实体多音字消歧方法通过把要确定其发音的实体多音字链接到LOD的相应实体,从该相应实体的相关的属性值中找到发音。因为LOD中的每个实体是唯一的,所以获得的发音也是无歧义的,从而实现对实体多音字的发音进行消歧。According to the above description, the entity polyphonic character disambiguation method according to the embodiment of the present disclosure links the entity polyphonic character whose pronunciation is to be determined to the corresponding entity of the LOD, and finds the pronunciation from the related attribute value of the corresponding entity. Because each entity in the LOD is unique, the obtained pronunciation is also unambiguous, so as to disambiguate the pronunciation of entity polyphonic characters.

与上述方法实施例相对应地,本公开还提供了以下设备实施例。Corresponding to the above method embodiments, the present disclosure also provides the following device embodiments.

图5是示出根据本公开的实施例的实体多音字消歧设备500的功能配置示例的框图。FIG. 5 is a block diagram showing an example of a functional configuration of an entity polyphone disambiguation device 500 according to an embodiment of the present disclosure.

如图5所示,根据本公开的实施例的实体多音字消歧设备500可以包括实体识别单元502和确定发音单元504。接下来将描述各个单元的功能配置示例。As shown in FIG. 5 , an entity polyphonic word disambiguation device 500 according to an embodiment of the present disclosure may include an entity recognition unit 502 and a pronunciation determination unit 504 . A functional configuration example of each unit will be described next.

在实体识别单元502中,可以从输入的文本中识别出包括多音字的至少一个实体。In the entity recognition unit 502, at least one entity including polyphonic characters may be recognized from the input text.

从输入的文本中识别出包括多音字的实体的具体方法可参见以上方法实施例中相应位置的描述,在此不再重复。For the specific method of identifying entities including polyphonic characters from the input text, refer to the descriptions at corresponding positions in the above method embodiments, and will not be repeated here.

在确定发音单元504中,可以对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据的数据集中的相应实体,并且可以基于相应实体的其属性值包含发音的至少一个属性和/或与相应实体相关联的发音,确定该实体的发音。In determining the pronunciation unit 504, for each entity in the at least one entity, the entity may be linked to the corresponding entity in the data set of the associated open data, and at least one attribute of the pronunciation may be included based on its attribute value of the corresponding entity and/or the pronunciation associated with the corresponding entity, determining the pronunciation of the entity.

对于在实体识别单元502中识别出的实体,可以将这些实体利用实体链接技术分别链接到LOD的数据集中的相应实体中。并且可以基于该相应实体的其属性值包含发音的至少一个属性和/或与该相应实体相关联的发音,确定发音。因为LOD中的每个实体是唯一的,所以获得的发音也是无歧义的。For the entities identified in the entity identification unit 502, these entities can be linked to the corresponding entities in the LOD dataset by using the entity linking technology. And the utterance may be determined based on at least one attribute of the corresponding entity whose attribute value contains an utterance and/or an utterance associated with the corresponding entity. Because each entity in the LOD is unique, the resulting pronunciations are also unambiguous.

优选地,所述至少一个属性可以包括其属性值直接为发音的至少一个第一预定属性。对于LOD的数据集中的实体,发音存在于属性值中,有些情况下属性值直接就是实体的发音。然而在不同的实体中表示发音的这些属性常常是不一样的,而且有些通用的属性不总是表示发音,所以我们需要对这些属性进行选择。Preferably, the at least one attribute may include at least one first predetermined attribute whose attribute value is directly the pronunciation. For the entities in the LOD dataset, the pronunciation exists in the attribute value. In some cases, the attribute value is directly the pronunciation of the entity. However, these attributes representing pronunciation in different entities are often different, and some general attributes do not always represent pronunciation, so we need to select these attributes.

优选地,所述至少一个第一预定属性可以是通过以下方式获得的:获得LOD的数据集中的每个实体的名字,根据该实体的名字中的每个字在字典中的所有发音来列出该实体的所有发音作为候选发音,如果在该实体的属性中存在其属性值与该实体的候选发音中的任一个发音完全匹配的属性,则选择该属性作为一个候选属性,并且在针对LOD的数据集中的所有实体所选择出的所有候选属性当中,选择其表示发音的概率大于预定阈值的至少一个候选属性作为至少一个第一预定属性。Preferably, the at least one first predetermined attribute may be obtained by obtaining the name of each entity in the LOD data set, and listing all the pronunciations of each character in the entity's name in the dictionary All the pronunciations of the entity are used as candidate pronunciations. If there is an attribute whose attribute value completely matches any of the candidate pronunciations of the entity in the attributes of the entity, this attribute is selected as a candidate attribute, and in the LOD Among all candidate attributes selected by all entities in the data set, at least one candidate attribute whose probability of representing pronunciation is greater than a predetermined threshold is selected as at least one first predetermined attribute.

具体地,为了选择表示发音的属性,我们首先获取LOD的数据集中每个实体的名字,根据名字中的每个字在字典中的所有发音,列出每个名字所有可能的发音,作为候选发音。将实体的候选发音和该实体的属性值进行一一对比,如果其中一个候选发音对应上其中一个属性值,则选择相应的属性作为候选属性,属性值为该实体的发音。然后,针对LOD的数据集中的所有实体所选择出的所有的候选属性,计算它们表示发音的概率。如果候选属性表示发音的概率大于预定阈值,则保留该候选属性作为一个第一预定属性,即,选择其表示发音的概率大于预定阈值的至少一个候选属性作为至少一个第一预定属性。Specifically, in order to select the attribute representing the pronunciation, we first obtain the name of each entity in the LOD dataset, and list all possible pronunciations of each name as candidate pronunciations according to all the pronunciations of each word in the dictionary in the name . Compare the candidate pronunciation of the entity with the attribute value of the entity one by one. If one of the candidate pronunciations corresponds to one of the attribute values, select the corresponding attribute as the candidate attribute, and the attribute value is the pronunciation of the entity. Then, for all candidate attributes selected from all entities in the LOD dataset, the probability that they represent pronunciation is calculated. If the probability of the candidate attribute representing the pronunciation is greater than the predetermined threshold, then retain the candidate attribute as a first predetermined attribute, that is, select at least one candidate attribute whose probability of representing the pronunciation is greater than the predetermined threshold as at least one first predetermined attribute.

优选地,候选属性的表示发音的概率可以是候选属性的属性值为发音的次数与该候选属性在LOD的数据集中的出现总次数的比值。Preferably, the probability of the candidate attribute representing pronunciation may be the ratio of the number of times the attribute value of the candidate attribute is pronounced to the total number of occurrences of the candidate attribute in the LOD data set.

优选地,上述预定阈值可以由本领域技术人员根据经验或实验确定。Preferably, the aforementioned predetermined threshold can be determined by those skilled in the art based on experience or experiments.

获得至少一个第一预定属性的具体示例可参见以上方法实施例中相应位置的描述,在此不再重复。For a specific example of obtaining at least one first predetermined attribute, reference may be made to the descriptions at corresponding positions in the above method embodiments, which will not be repeated here.

优选地,所述至少一个属性还可以包括其属性值包含能够利用至少一个发音提取模板所提取出的发音的至少一个第二预定属性。除了属性值直接就是实体的发音之外,有些情况下发音是包含在属性值中的,而且一般出现的位置是规律可循的,这时我们可以利用一些发音提取模板来确定实体的发音。Preferably, the at least one attribute may further include at least one second predetermined attribute whose attribute value contains a pronunciation that can be extracted by using at least one pronunciation extraction template. In addition to the fact that the attribute value is directly the pronunciation of the entity, in some cases the pronunciation is included in the attribute value, and generally the position of occurrence is regular. At this time, we can use some pronunciation extraction templates to determine the pronunciation of the entity.

优选地,所述至少一个发音提取模板可以是通过以下方式生成的:对于LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的每个实体,可以根据该实体的所述任意第一预定属性的属性值确定该实体的发音,并且可以确定该发音在该实体的包含发音的其他属性的属性值中的出现位置的规律性,从而根据LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来自动生成所述至少一个发音提取模板。Preferably, the at least one pronunciation extraction template can be generated in the following manner: for each entity in the data set of LOD that includes any first predetermined attribute in the at least one first predetermined attribute, according to the entity's The attribute value of any first predetermined attribute determines the pronunciation of the entity, and the regularity of the occurrence position of the pronunciation in the attribute values of other attributes containing the pronunciation of the entity can be determined, so that according to the LOD data set including all All entities of any first predetermined attribute in the at least one first predetermined attribute are used to automatically generate the at least one pronunciation extraction template.

在LOD的数据集中,有些情况下发音不仅直接为某个属性的属性值,而且存在于别的属性值中,也就是说所述别的属性值中也包含实体的发音。而且有些情况下实体的发音并不直接是属性值,而发音只包含在属性值中并且其在属性值中的出现位置是有规律的。在这种情况下我们可以自动产生一些发音提取模板来匹配这些发音。In the LOD data set, in some cases, the pronunciation is not only directly an attribute value of a certain attribute, but also exists in other attribute values, that is to say, the other attribute values also include the pronunciation of the entity. And in some cases, the pronunciation of the entity is not directly the attribute value, but the pronunciation is only included in the attribute value and its occurrence position in the attribute value is regular. In this case we can automatically generate some pronunciation extraction templates to match these pronunciations.

生成发音提取模板的具体方法可参见以上方法实施例中相应位置的描述,在此不再重复。For the specific method of generating the pronunciation extraction template, refer to the description at the corresponding position in the above method embodiment, which will not be repeated here.

优选地,所述至少一个第一预定属性的属性值和所述至少一个第二预定属性的属性值可以是字符串类型的属性值。Preferably, the attribute value of the at least one first predetermined attribute and the attribute value of the at least one second predetermined attribute may be attribute values of a character string type.

优选地,对于所述至少一个实体中的每个实体,如果该实体所链接到的对应实体的属性包含所述至少一个第一预定属性中的一个第一预定属性,则可以将所述一个第一预定属性的属性值作为该实体的发音,而如果该实体所链接到的对应实体的属性不包含所述至少一个第一预定属性中的任一个第一预定属性,则可以利用所述至少一个发音提取模板来确定该实体的发音。Preferably, for each entity in the at least one entity, if the attribute of the corresponding entity to which the entity is linked contains one of the at least one first predetermined attribute, the one first predetermined attribute can be The attribute value of a predetermined attribute is used as the pronunciation of the entity, and if the attribute of the corresponding entity to which the entity is linked does not contain any one of the at least one first predetermined attribute, the at least one predetermined attribute may be used. A pronunciation extraction template to determine the pronunciation of the entity.

具体地,查看实体所链接到的对应实体的属性是否包含任一个第一预定属性,如果有,则该对应实体包含其属性值直接就是发音的属性,那么可以取该任一个第一预定属性的属性值作为该实体的发音。而如果实体所链接到的对应实体的属性不包含任一个第一预定属性,则该对应实体不包含其属性值直接就是发音的属性,那么可以利用发音提取模板来确定该实体的发音。Specifically, it is checked whether the attribute of the corresponding entity to which the entity is linked contains any first predetermined attribute, and if so, the corresponding entity contains an attribute whose attribute value is directly pronounced, so the value of any one of the first predetermined attributes can be taken The attribute value serves as the pronunciation of the entity. And if the attribute of the corresponding entity to which the entity is linked does not contain any first predetermined attribute, then the corresponding entity does not contain the attribute whose attribute value is the pronunciation directly, then the pronunciation extraction template can be used to determine the pronunciation of the entity.

优选地,利用所述至少一个发音提取模板来确定所述至少一个实体中的一个实体的发音可以包括:利用所述至少一个发音提取模板来匹配所述一个实体所链接到的对应实体的至少一个属性的字符串类型的属性值,并且将所匹配的字符串作为所述一个实体的发音。Preferably, using the at least one pronunciation extraction template to determine the pronunciation of one entity in the at least one entity may include: using the at least one pronunciation extraction template to match at least one of the corresponding entities to which the one entity is linked. The attribute value of the character string type of the attribute, and the matched character string is used as the pronunciation of the entity.

具体地,如果实体所链接到的对应实体的属性不包含任一个第一预定属性,则可以利用发音提取模板来匹配该对应实体的字符串类型的属性值,并且可以将匹配到的字符串作为该实体的发音。Specifically, if the attribute of the corresponding entity to which the entity is linked does not contain any first predetermined attribute, the pronunciation extraction template may be used to match the attribute value of the string type of the corresponding entity, and the matched character string may be used as The pronunciation of this entity.

基于实体所链接到的对应实体的属性来获得实体的发音的具体示例可参见以上方法实施例中相应位置的描述,在此不再重复。For a specific example of obtaining the pronunciation of the entity based on the attribute of the corresponding entity to which the entity is linked, refer to the description of the corresponding position in the above method embodiment, which will not be repeated here.

另外,如在实体识别单元502中所述,还可以基于与相应实体相关联的发音,确定实体的发音。具体地,可以建立相应实体的一个单独的发音属性,在LOD的数据集中找到发音之后,可以将所找到的发音存储在该单独的发音属性的属性值中。如果从输入的文本中识别出多音字实体,在将该实体链接到LOD的数据集中的相应实体之后,通过查询该相应实体的“发音属性”的属性值,就可以得到该多音字实体的发音。In addition, as described in the entity recognition unit 502, the utterances of the entities may also be determined based on the utterances associated with the corresponding entities. Specifically, a separate pronunciation attribute of the corresponding entity may be established, and after the pronunciation is found in the LOD data set, the found pronunciation may be stored in an attribute value of the separate pronunciation attribute. If a polyphonic character entity is identified from the input text, after linking the entity to the corresponding entity in the LOD dataset, the pronunciation of the polyphonic character entity can be obtained by querying the attribute value of the corresponding entity's "pronunciation attribute" .

基于与相应实体相关联的发音来确定实体的发音的具体示例可参见以上方法实施例中相应位置的描述,在此不再重复。For a specific example of determining the pronunciation of an entity based on the pronunciation associated with the corresponding entity, reference may be made to the description of the corresponding position in the above method embodiment, which will not be repeated here.

根据以上描述可知,根据本公开的实施例的实体多音字消歧设备通过把要确定其发音的实体多音字链接到LOD的相应实体,从该相应实体的相关的属性值中找到发音。因为LOD中的每个实体是唯一的,所以获得的发音也是无歧义的,从而实现对实体多音字的发音进行消歧。According to the above description, the entity polyphonic character disambiguation device according to the embodiment of the present disclosure links the entity polyphonic character whose pronunciation is to be determined to the corresponding entity of the LOD, and finds the pronunciation from the related attribute value of the corresponding entity. Because each entity in the LOD is unique, the obtained pronunciation is also unambiguous, so as to disambiguate the pronunciation of entity polyphonic characters.

应指出,尽管以上描述了根据本公开的实施例的实体多音字消歧设备的功能配置,但是这仅是示例而非限制,并且本领域技术人员可根据本公开的原理对以上实施例进行修改,例如可对各个实施例中的功能模块进行添加、删除或者组合等,并且这样的修改均落入本公开的范围内。It should be pointed out that although the functional configuration of the entity polyphonic word disambiguation device according to the embodiment of the present disclosure has been described above, this is only an example rather than a limitation, and those skilled in the art can modify the above embodiment according to the principle of the present disclosure , for example, the functional modules in each embodiment may be added, deleted or combined, and such modifications all fall within the scope of the present disclosure.

此外,还应指出,这里的装置实施例是与上述方法实施例相对应的,因此在装置实施例中未详细描述的内容可参见方法实施例中相应位置的描述,在此不再重复描述。In addition, it should also be pointed out that the device embodiments here correspond to the above-mentioned method embodiments, therefore, for the content not described in detail in the device embodiments, refer to the descriptions in corresponding positions in the method embodiments, and the description will not be repeated here.

应理解,根据本公开的实施例的存储介质和程序产品中的机器可执行的指令还可以被配置成执行上述实体多音字消歧方法,因此在此未详细描述的内容可参考先前相应位置的描述,在此不再重复进行描述。It should be understood that the machine-executable instructions in the storage medium and the program product according to the embodiments of the present disclosure may also be configured to execute the above-mentioned entity polyphonic word disambiguation method, so for the content not described in detail here, please refer to the previous corresponding position description, which will not be repeated here.

相应地,用于承载上述包括机器可执行的指令的程序产品的存储介质也包括在本发明的公开中。该存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the above-mentioned program product including machine-executable instructions is also included in the disclosure of the present invention. Such storage media include, but are not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

另外,还应该指出的是,上述系列处理和装置也可以通过软件和/或固件实现。在通过软件和/或固件实现的情况下,从存储介质或网络向具有专用硬件结构的计算机,例如图6所示的通用个人计算机600安装构成该软件的程序,该计算机在安装有各种程序时,能够执行各种功能等等。In addition, it should also be noted that the series of processes and devices described above may also be implemented by software and/or firmware. In the case of realization by software and/or firmware, the program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware configuration, such as a general-purpose personal computer 600 shown in FIG. , can perform various functions and so on.

在图6中,中央处理单元(CPU)601根据只读存储器(ROM)602中存储的程序或从存储部分608加载到随机存取存储器(RAM)603的程序执行各种处理。在RAM 603中,也根据需要存储当CPU 601执行各种处理等时所需的数据。In FIG. 6 , a central processing unit (CPU) 601 executes various processes according to programs stored in a read only memory (ROM) 602 or loaded from a storage section 608 to a random access memory (RAM) 603 . In the RAM 603, data required when the CPU 601 executes various processes and the like is also stored as necessary.

CPU 601、ROM 602和RAM 603经由总线604彼此连接。输入/输出接口605也连接到总线604。The CPU 601 , ROM 602 , and RAM 603 are connected to each other via a bus 604 . The input/output interface 605 is also connected to the bus 604 .

下述部件连接到输入/输出接口605:输入部分606,包括键盘、鼠标等;输出部分607,包括显示器,比如阴极射线管(CRT)、液晶显示器(LCD)等,和扬声器等;存储部分608,包括硬盘等;和通信部分609,包括网络接口卡比如LAN卡、调制解调器等。通信部分609经由网络比如因特网执行通信处理。The following components are connected to the input/output interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker; a storage section 608 , including a hard disk, etc.; and the communication part 609, including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the Internet.

根据需要,驱动器610也连接到输入/输出接口605。可拆卸介质611比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器610上,使得从中读出的计算机程序根据需要被安装到存储部分608中。A driver 610 is also connected to the input/output interface 605 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that a computer program read therefrom is installed into the storage section 608 as necessary.

在通过软件实现上述系列处理的情况下,从网络比如因特网或存储介质比如可拆卸介质611安装构成软件的程序。In the case of realizing the above-described series of processes by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 611 .

本领域的技术人员应当理解,这种存储介质不局限于图6所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可拆卸介质611。可拆卸介质611的例子包含磁盘(包含软盘(注册商标))、光盘(包含光盘只读存储器(CD-ROM)和数字通用盘(DVD))、磁光盘(包含迷你盘(MD)(注册商标))和半导体存储器。或者,存储介质可以是ROM 602、存储部分608中包含的硬盘等等,其中存有程序,并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 611 shown in FIG. 6 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable media 611 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read only memory (CD-ROM) and digital versatile disks (DVD)), magneto-optical disks (including )) and semiconductor memory. Alternatively, the storage medium may be the ROM 602, a hard disk contained in the storage section 608, or the like, in which programs are stored and distributed to users together with devices containing them.

以上参照附图描述了本公开的优选实施例,但是本公开当然不限于以上示例。本领域技术人员可在所附权利要求的范围内得到各种变更和修改,并且应理解这些变更和修改自然将落入本公开的技术范围内。The preferred embodiments of the present disclosure have been described above with reference to the accompanying drawings, but the present disclosure is of course not limited to the above examples. A person skilled in the art may find various alterations and modifications within the scope of the appended claims, and it should be understood that they will naturally come under the technical scope of the present disclosure.

例如,在以上实施例中包括在一个单元中的多个功能可以由分开的装置来实现。替选地,在以上实施例中由多个单元实现的多个功能可分别由分开的装置来实现。另外,以上功能之一可由多个单元来实现。无需说,这样的配置包括在本公开的技术范围内。For example, a plurality of functions included in one unit in the above embodiments may be realized by separate devices. Alternatively, a plurality of functions implemented by a plurality of units in the above embodiments may be respectively implemented by separate devices. In addition, one of the above functions may be realized by a plurality of units. Needless to say, such a configuration is included in the technical scope of the present disclosure.

在该说明书中,流程图中所描述的步骤不仅包括以所述顺序按时间序列执行的处理,而且包括并行地或单独地而不是必须按时间序列执行的处理。此外,甚至在按时间序列处理的步骤中,无需说,也可以适当地改变该顺序。In this specification, the steps described in the flowcharts include not only processing performed in time series in the stated order but also processing performed in parallel or individually and not necessarily in time series. Furthermore, even in the steps of time-series processing, needless to say, the order can be appropriately changed.

另外,根据本公开的技术还可以如下进行配置。In addition, the technology according to the present disclosure may also be configured as follows.

附记1.一种实体多音字消歧方法,包括:Additional Note 1. A method for disambiguation of entity polyphonic characters, comprising:

实体识别步骤,用于从输入的文本中识别出包括多音字的至少一个实体;以及An entity recognition step for recognizing at least one entity comprising polyphonic characters from the input text; and

确定发音步骤,对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据LOD的数据集中的相应实体,并且基于所述相应实体的其属性值包含发音的至少一个属性和/或与所述相应实体相关联的发音,确定该实体的发音。determining the pronunciation step, for each entity of said at least one entity, linking the entity to a corresponding entity in the data set of the associated open data LOD, and containing at least one attribute of the pronunciation based on the value of its attribute of said corresponding entity and/or Or the pronunciation associated with said corresponding entity, determining the pronunciation of the entity.

附记2.根据附记1所述的实体多音字消歧方法,其中,所述至少一个属性包括其属性值直接为发音的至少一个第一预定属性。Supplement 2. The method for disambiguating entity polyphonic characters according to Supplement 1, wherein the at least one attribute includes at least one first predetermined attribute whose attribute value is directly pronounced.

附记3.根据附记2所述的实体多音字消歧方法,其中,所述至少一个属性还包括其属性值包含能够利用至少一个发音提取模板所提取出的发音的至少一个第二预定属性。Supplementary Note 3. The method for disambiguating entity polyphonic characters according to Supplementary Note 2, wherein the at least one attribute further includes at least one second predetermined attribute whose attribute value includes at least one pronunciation that can be extracted using at least one pronunciation extraction template .

附记4.根据附记2所述的实体多音字消歧方法,其中,所述至少一个第一预定属性是通过以下方式获得的:Supplement 4. The method for disambiguating entity polyphonic characters according to Supplement 2, wherein the at least one first predetermined attribute is obtained in the following manner:

获得所述LOD的数据集中的每个实体的名字;Obtain the name of each entity in the dataset for the LOD;

根据该实体的名字中的每个字在字典中的所有发音来列出该实体的所有发音作为候选发音;List all pronunciations of the entity as candidate pronunciations according to all pronunciations of each word in the entity's name in the dictionary;

如果在该实体的属性中存在其属性值与该实体的候选发音中的任一个发音完全匹配的属性,则选择该属性作为一个候选属性;以及If there is an attribute among the attributes of the entity whose attribute value exactly matches any one of the candidate utterances of the entity, that attribute is selected as a candidate attribute; and

在针对所述LOD的数据集中的所有实体所选择出的所有候选属性当中,选择其表示发音的概率大于预定阈值的至少一个候选属性作为所述至少一个第一预定属性。Among all candidate attributes selected for all entities in the data set of the LOD, at least one candidate attribute whose probability of representing an utterance is greater than a predetermined threshold is selected as the at least one first predetermined attribute.

附记5.根据附记4所述的实体多音字消歧方法,其中,所述候选属性的所述表示发音的概率是所述候选属性的属性值为发音的次数与所述候选属性在所述LOD的数据集中的出现总次数的比值。Supplementary Note 5. The method for disambiguating entity polyphonic characters according to Supplementary Note 4, wherein the probability that the said candidate attribute represents pronunciation is the number of times the attribute value of said candidate attribute is pronounced and said candidate attribute is in the The ratio of the total number of occurrences in the data set for the LOD described above.

附记6.根据附记3所述的实体多音字消歧方法,其中,所述至少一个发音提取模板是通过以下方式生成的:Supplementary Note 6. The method for disambiguating entity polyphonic characters according to Supplementary Note 3, wherein the at least one pronunciation extraction template is generated in the following manner:

对于所述LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的每个实体,根据该实体的所述任意第一预定属性的属性值确定该实体的发音;For each entity in the data set of the LOD that includes any of the at least one first predetermined attribute, determining the pronunciation of the entity based on an attribute value of the entity's any first predetermined attribute;

确定该发音在该实体的包含发音的其他属性的属性值中的出现位置的规律性;以及determining the regularity of occurrences of the utterance in attribute values of other attributes of the entity that contain utterances; and

根据所述LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来自动生成所述至少一个发音提取模板。The at least one pronunciation extraction template is automatically generated according to all entities in the data set of the LOD that include any first predetermined attribute of the at least one first predetermined attribute.

附记7.根据附记3所述的实体多音字消歧方法,其中,所述至少一个第一预定属性的属性值和所述至少一个第二预定属性的属性值是字符串类型的属性值。Supplementary Note 7. The method for disambiguation of entity polyphonic characters according to Supplementary Note 3, wherein the attribute value of the at least one first predetermined attribute and the attribute value of the at least one second predetermined attribute are attribute values of the character string type .

附记8.根据附记3所述的实体多音字消歧方法,其中,对于所述至少一个实体中的每个实体:Supplementary Note 8. The method for disambiguating entity polyphonic characters according to Supplementary Note 3, wherein, for each entity in the at least one entity:

如果该实体所链接到的对应实体的属性包含所述至少一个第一预定属性中的一个第一预定属性,则将所述一个第一预定属性的属性值作为该实体的发音;以及If the attribute of the corresponding entity to which the entity is linked includes a first predetermined attribute of the at least one first predetermined attribute, using the attribute value of the one first predetermined attribute as the pronunciation of the entity; and

如果该实体所链接到的对应实体的属性不包含所述至少一个第一预定属性中的任一个第一预定属性,则利用所述至少一个发音提取模板来确定该实体的发音。If the attribute of the corresponding entity to which the entity is linked does not contain any one of the at least one first predetermined attribute, the at least one pronunciation extraction template is used to determine the pronunciation of the entity.

附记9.根据附记8所述的实体多音字消歧方法,其中,利用所述至少一个发音提取模板来确定所述至少一个实体中的一个实体的发音包括:Additional note 9. The method for disambiguating entity polyphonic characters according to additional note 8, wherein, using the at least one pronunciation extraction template to determine the pronunciation of an entity in the at least one entity includes:

利用所述至少一个发音提取模板来匹配所述一个实体所链接到的对应实体的至少一个属性的字符串类型的属性值,并且将所匹配的字符串作为所述一个实体的发音。The at least one pronunciation extraction template is used to match the attribute value of the string type of at least one attribute of the corresponding entity to which the one entity is linked, and the matched string is used as the pronunciation of the one entity.

附记10.一种实体多音字消歧设备,包括:Additional note 10. An entity polyphone disambiguation device, comprising:

实体识别单元,被配置成从输入的文本中识别出包括多音字的至少一个实体;以及an entity recognition unit configured to recognize at least one entity comprising polyphonic characters from the input text; and

确定发音单元,被配置成对于所述至少一个实体中的每个实体,将该实体链接到关联开放数据LOD的数据集中的相应实体,并且基于所述相应实体的其属性值包含发音的至少一个属性和/或与所述相应实体相关联的发音,确定该实体的发音。Determining an utterance unit configured to, for each entity in the at least one entity, link the entity to a corresponding entity in the data set of the associated open data LOD, and contain at least one of the utterances based on the value of its attribute of the corresponding entity attributes and/or pronunciations associated with the corresponding entity, determine the pronunciation of the entity.

附记11.根据附记10所述的实体多音字消歧设备,其中,所述至少一个属性包括其属性值直接为发音的至少一个第一预定属性。Supplement 11. The entity polyphonic character disambiguation device according to Supplement 10, wherein the at least one attribute includes at least one first predetermined attribute whose attribute value is directly the pronunciation.

附记12.根据附记11所述的实体多音字消歧设备,其中,所述至少一个属性还包括其属性值包含能够利用至少一个发音提取模板所提取出的发音的至少一个第二预定属性。Supplementary Note 12. The entity polyphone disambiguation device according to Supplementary Note 11, wherein the at least one attribute further includes at least one second predetermined attribute whose attribute value contains a pronunciation that can be extracted by using at least one pronunciation extraction template .

附记13.根据附记11所述的实体多音字消歧设备,其中,所述至少一个第一预定属性是通过以下方式获得的:Supplement 13. The entity polyphone disambiguation device according to Supplement 11, wherein the at least one first predetermined attribute is obtained in the following manner:

获得所述LOD的数据集中的每个实体的名字;Obtain the name of each entity in the dataset for the LOD;

根据该实体的名字中的每个字在字典中的所有发音来列出该实体的所有发音作为候选发音;List all pronunciations of the entity as candidate pronunciations according to all pronunciations of each word in the entity's name in the dictionary;

如果在该实体的属性中存在其属性值与该实体的候选发音中的任一个发音完全匹配的属性,则选择该属性作为一个候选属性;以及If there is an attribute among the attributes of the entity whose attribute value exactly matches any one of the candidate utterances of the entity, that attribute is selected as a candidate attribute; and

在针对所述LOD的数据集中的所有实体所选择出的所有候选属性当中,选择其表示发音的概率大于预定阈值的至少一个候选属性作为所述至少一个第一预定属性。Among all candidate attributes selected for all entities in the data set of the LOD, at least one candidate attribute whose probability of representing an utterance is greater than a predetermined threshold is selected as the at least one first predetermined attribute.

附记14.根据附记13所述的实体多音字消歧设备,其中,所述候选属性的所述表示发音的概率是所述候选属性的属性值为发音的次数与所述候选属性在所述LOD的数据集中的出现总次数的比值。Supplementary Note 14. The entity polyphone disambiguation device according to Supplementary Note 13, wherein the probability of the said candidate attribute representing the pronunciation is the number of times the attribute value of the candidate attribute is pronounced and the number of times the candidate attribute is pronounced. The ratio of the total number of occurrences in the data set for the LOD described above.

附记15.根据附记12所述的实体多音字消歧设备,其中,所述至少一个发音提取模板是通过以下方式生成的:Supplement 15. The entity polyphone disambiguation device according to Supplement 12, wherein the at least one pronunciation extraction template is generated in the following manner:

对于所述LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的每个实体,根据该实体的所述任意第一预定属性的属性值确定该实体的发音;For each entity in the data set of the LOD that includes any of the at least one first predetermined attribute, determining the pronunciation of the entity based on an attribute value of the entity's any first predetermined attribute;

确定该发音在该实体的包含发音的其他属性的属性值中的出现位置的规律性;以及determining the regularity of occurrences of the utterance in attribute values of other attributes of the entity that contain utterances; and

根据所述LOD的数据集中的包括所述至少一个第一预定属性中的任意第一预定属性的所有实体来自动生成所述至少一个发音提取模板。The at least one pronunciation extraction template is automatically generated according to all entities in the data set of the LOD that include any first predetermined attribute of the at least one first predetermined attribute.

附记16.根据附记12所述的实体多音字消歧设备,其中,所述至少一个第一预定属性的属性值和所述至少一个第二预定属性的属性值是字符串类型的属性值。Supplement 16. The entity polyphone disambiguation device according to Supplement 12, wherein the attribute value of the at least one first predetermined attribute and the attribute value of the at least one second predetermined attribute are attribute values of the character string type .

附记17.根据附记12所述的实体多音字消歧设备,其中,对于所述至少一个实体中的每个实体:Supplement 17. The entity polyphone disambiguation device according to Supplement 12, wherein, for each entity in the at least one entity:

如果该实体所链接到的对应实体的属性包含所述至少一个第一预定属性中的一个第一预定属性,则将所述一个第一预定属性的属性值作为该实体的发音;以及If the attribute of the corresponding entity to which the entity is linked includes a first predetermined attribute of the at least one first predetermined attribute, using the attribute value of the one first predetermined attribute as the pronunciation of the entity; and

如果该实体所链接到的对应实体的属性不包含所述至少一个第一预定属性中的任一个第一预定属性,则利用所述至少一个发音提取模板来确定该实体的发音。If the attribute of the corresponding entity to which the entity is linked does not contain any one of the at least one first predetermined attribute, the at least one pronunciation extraction template is used to determine the pronunciation of the entity.

附记18.根据附记17所述的实体多音字消歧设备,其中,利用所述至少一个发音提取模板来确定所述至少一个实体中的一个实体的发音包括:Supplement 18. The entity polyphone disambiguation device according to Supplement 17, wherein, using the at least one pronunciation extraction template to determine the pronunciation of one entity in the at least one entity includes:

利用所述至少一个发音提取模板来匹配所述一个实体所链接到的对应实体的至少一个属性的字符串类型的属性值,并且将所匹配的字符串作为所述一个实体的发音。The at least one pronunciation extraction template is used to match the attribute value of the string type of at least one attribute of the corresponding entity to which the one entity is linked, and the matched string is used as the pronunciation of the one entity.

Claims (10)

1. a kind of entity polyphone disambiguation method, including:
Entity recognition step, for identifying at least one reality including polyphone from the text of input Body;And
It is determined that pronunciation step, for each entity at least one entity, by the entity link To the corresponding entity in the open data LOD of association data set, and based on the corresponding entity Its property value includes at least one attribute of pronunciation and/or the pronunciation associated with the corresponding entity, Determine the pronunciation of the entity.
2. entity polyphone disambiguation method according to claim 1, wherein, described at least one It is directly at least one first predetermined attribute of pronunciation that individual attribute, which includes its property value,.
3. entity polyphone disambiguation method according to claim 2, wherein, described at least one Individual attribute also includes its property value and includes and can extract what template was extracted using at least one pronunciation At least one second predetermined attribute of pronunciation.
4. entity polyphone disambiguation method according to claim 2, wherein, described at least one Individual first predetermined attribute obtains in the following manner:
Obtain the name of each entity in the data set of the LOD;
The entity is listed in all pronunciations of each word in dictionary in the name of the entity All pronunciations are pronounced as candidate;
If its property value and any in candidate's pronunciation of the entity in the entity attributes be present The attribute that individual pronunciation matches completely, then select the attribute as a candidate attribute;And
All candidate attributes gone out selected by all entities in the data set for the LOD are worked as In, select its pronouncing probability to be more than described at least one candidate attribute conduct of predetermined threshold extremely Few first predetermined attribute.
5. entity polyphone disambiguation method according to claim 4, wherein, candidate's category Property the pronouncing probability be the candidate attribute property value for pronunciation number with it is described The ratio of appearance total degree of the candidate attribute in the data set of the LOD.
6. entity polyphone disambiguation method according to claim 3, wherein, described at least one Individual pronunciation extraction template generates in the following manner:
For including appointing at least one first predetermined attribute in the data set of the LOD Anticipate each entity of the first predetermined attribute, according to the attribute of any first predetermined attribute of the entity Value determines the pronunciation of the entity;
Determine appearance position of the pronunciation in the property value of other attributes comprising pronunciation of the entity Regularity;And
Include appointing at least one first predetermined attribute in the data set of the LOD Anticipate the first predetermined attribute all entities come automatically generate it is described it is at least one pronunciation extraction template.
7. entity polyphone disambiguation method according to claim 3, wherein, described at least one The property value of the property value of individual first predetermined attribute and at least one second predetermined attribute is character The property value of string type.
8. entity polyphone disambiguation method according to claim 3, wherein, for it is described extremely Each entity in a few entity:
If it is predetermined that the attribute for the correspondent entity that the entity is linked to includes described at least one first First predetermined attribute in attribute, then it regard the property value of one first predetermined attribute as this The pronunciation of entity;And
If it is pre- that the attribute for the correspondent entity that the entity is linked to does not include described at least one first Determine any one first predetermined attribute in attribute, then using at least one pronunciation extraction template come really The pronunciation of the fixed entity.
9. entity polyphone disambiguation method according to claim 8, wherein, using it is described extremely A few pronunciation extracts template to determine that the pronunciation of an entity at least one entity includes:
Pair that one entity is linked to is matched using at least one pronunciation extraction template The property value of the character string type of at least one attribute of entity is answered, and the character string matched is made For the pronunciation of one entity.
10. a kind of entity polyphone disambiguation equipment, including:
Entity recognition unit, it is configured to identify at least one including polyphone from the text of input Individual entity;And
Pronunciation unit is determined, is configured to for each entity at least one entity, by this Entity link is based on the corresponding entity to the corresponding entity in the data set of the open data of association Its property value include pronunciation at least one attribute and/or the hair associated with the corresponding entity Sound, determine the pronunciation of the entity.
CN201610342051.1A 2016-05-20 2016-05-20 Entity polyphone disambiguation method and entity polyphone disambiguation equipment Pending CN107402933A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610342051.1A CN107402933A (en) 2016-05-20 2016-05-20 Entity polyphone disambiguation method and entity polyphone disambiguation equipment
JP2017100185A JP2017208097A (en) 2016-05-20 2017-05-19 Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610342051.1A CN107402933A (en) 2016-05-20 2016-05-20 Entity polyphone disambiguation method and entity polyphone disambiguation equipment

Publications (1)

Publication Number Publication Date
CN107402933A true CN107402933A (en) 2017-11-28

Family

ID=60388995

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610342051.1A Pending CN107402933A (en) 2016-05-20 2016-05-20 Entity polyphone disambiguation method and entity polyphone disambiguation equipment

Country Status (2)

Country Link
JP (1) JP2017208097A (en)
CN (1) CN107402933A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110277085A (en) * 2019-06-25 2019-09-24 腾讯科技(深圳)有限公司 Determine the method and device of polyphone pronunciation
CN112908293A (en) * 2021-03-11 2021-06-04 浙江工业大学 Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
WO2021127987A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Polyphonic character prediction method and disambiguation method, apparatuses, device and computer readable storage medium

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818657B (en) * 2019-11-15 2024-04-26 北京字节跳动网络技术有限公司 Method and device for determining pronunciation of polyphone, electronic equipment and storage medium
CN111078898B (en) * 2019-12-27 2023-08-08 出门问问创新科技有限公司 Multi-tone word annotation method, device and computer readable storage medium
CN111599340A (en) * 2020-07-27 2020-08-28 南京硅基智能科技有限公司 Polyphone pronunciation prediction method and device and computer readable storage medium
CN113823259B (en) * 2021-07-22 2024-07-02 腾讯科技(深圳)有限公司 Method and device for converting text data into phoneme sequence
CN115273809B (en) * 2022-06-22 2024-11-05 北京市商汤科技开发有限公司 Polyphonetic character pronunciation prediction network training method, speech generation method and device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271449A (en) * 2007-03-19 2008-09-24 株式会社东芝 Method and device for reducing vocabulary and Chinese character string phonetic notation
CN102236640A (en) * 2006-03-31 2011-11-09 谷歌公司 Disambiguation of named entities
CN102436456A (en) * 2010-09-29 2012-05-02 国际商业机器公司 Method and device for classifying named entities
CN102968419A (en) * 2011-08-31 2013-03-13 微软公司 Disambiguation method for interactive Internet entity name
CN103631970A (en) * 2013-12-20 2014-03-12 百度在线网络技术(北京)有限公司 Method and device for mining associated relationship between attributes and entities
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN104299623A (en) * 2013-07-15 2015-01-21 国际商业机器公司 Automated confirmation and disambiguation modules in voice applications
JP2015062117A (en) * 2013-09-22 2015-04-02 富士通株式会社 Entity linking method and entity linking apparatus
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
US20150227505A1 (en) * 2012-08-27 2015-08-13 Hitachi, Ltd. Word meaning relationship extraction device
CN105206261A (en) * 2014-06-18 2015-12-30 谷歌公司 Entity name recognition
CN105468605A (en) * 2014-08-25 2016-04-06 济南中林信息科技有限公司 Entity information map generation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236640A (en) * 2006-03-31 2011-11-09 谷歌公司 Disambiguation of named entities
CN101271449A (en) * 2007-03-19 2008-09-24 株式会社东芝 Method and device for reducing vocabulary and Chinese character string phonetic notation
CN102436456A (en) * 2010-09-29 2012-05-02 国际商业机器公司 Method and device for classifying named entities
CN102968419A (en) * 2011-08-31 2013-03-13 微软公司 Disambiguation method for interactive Internet entity name
US20150227505A1 (en) * 2012-08-27 2015-08-13 Hitachi, Ltd. Word meaning relationship extraction device
CN104182420A (en) * 2013-05-27 2014-12-03 华东师范大学 Ontology-based Chinese name disambiguation method
CN104299623A (en) * 2013-07-15 2015-01-21 国际商业机器公司 Automated confirmation and disambiguation modules in voice applications
JP2015062117A (en) * 2013-09-22 2015-04-02 富士通株式会社 Entity linking method and entity linking apparatus
CN103631970A (en) * 2013-12-20 2014-03-12 百度在线网络技术(北京)有限公司 Method and device for mining associated relationship between attributes and entities
CN105206261A (en) * 2014-06-18 2015-12-30 谷歌公司 Entity name recognition
CN105468605A (en) * 2014-08-25 2016-04-06 济南中林信息科技有限公司 Entity information map generation method and device
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110277085A (en) * 2019-06-25 2019-09-24 腾讯科技(深圳)有限公司 Determine the method and device of polyphone pronunciation
CN110277085B (en) * 2019-06-25 2021-08-24 腾讯科技(深圳)有限公司 Method and device for determining polyphone pronunciation
WO2021127987A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Polyphonic character prediction method and disambiguation method, apparatuses, device and computer readable storage medium
CN112908293A (en) * 2021-03-11 2021-06-04 浙江工业大学 Method and device for correcting pronunciations of polyphones based on semantic attention mechanism

Also Published As

Publication number Publication date
JP2017208097A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
US10176804B2 (en) Analyzing textual data
CN107402933A (en) Entity polyphone disambiguation method and entity polyphone disambiguation equipment
CN106407211B (en) Method and device for classifying semantic relationship of entity words
JP5599662B2 (en) System and method for converting kanji into native language pronunciation sequence using statistical methods
US11531693B2 (en) Information processing apparatus, method and non-transitory computer readable medium
CN109376239B (en) A method for generating a specific sentiment dictionary for Chinese microblog sentiment classification
CN103678285A (en) Machine translation method and machine translation system
CN107305768A (en) Easy wrongly written character calibration method in interactive voice
CN102982021A (en) Method for disambiguating multiple readings in language conversion
CN104239289B (en) Syllabification method and syllabification equipment
CN107437417A (en) Based on speech data Enhancement Method and device in Recognition with Recurrent Neural Network speech recognition
CN102681981A (en) Natural language lexical analysis method, device and analyzer training method
WO2025044865A1 (en) Cross-domain problem processing methods and apparatuses, electronic device and storage medium
JP5231484B2 (en) Voice recognition apparatus, voice recognition method, program, and information processing apparatus for distributing program
CN110222335A (en) A kind of text segmenting method and device
CN115101042B (en) Text processing method, device and equipment
TWI659411B (en) Multi-language hybrid speech recognition method
CN116415587A (en) Information processing device and information processing method
Prasad et al. Mining Training Data for Language Modeling Across the World's Languages.
KR20040038559A (en) Apparatus and method for recongnizing and classifying named entities from text document using iterated learning
JP5722375B2 (en) End-of-sentence expression conversion apparatus, method, and program
US20170154546A1 (en) Lexical dialect analysis system
Declerck Towards a new ontology for sign languages
JP5718406B2 (en) Utterance sentence generation device, dialogue apparatus, utterance sentence generation method, dialogue method, utterance sentence generation program, and dialogue program
CN106980390A (en) Supplementary translation input method and supplementary translation input equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20171128