CN114328600A - Method, device, equipment and storage medium for determining standard data element - Google Patents
Method, device, equipment and storage medium for determining standard data element Download PDFInfo
- Publication number
- CN114328600A CN114328600A CN202111674685.4A CN202111674685A CN114328600A CN 114328600 A CN114328600 A CN 114328600A CN 202111674685 A CN202111674685 A CN 202111674685A CN 114328600 A CN114328600 A CN 114328600A
- Authority
- CN
- China
- Prior art keywords
- data element
- standard data
- standard
- candidate
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请提供了一种确定标准数据元的方法、装置、设备及存储介质,该方法包括:获得目标数据元的至少一个特征信息;基于目标数据元的各特征信息,从标准数据元库中确定出与目标数据元匹配的第一数据元集合,从历史对标记录库中确定出目标数据元对应的第二数据元集合;按照第二数据元集合中第二标准数据元的标准化次数,确定第二标准数据元的推荐评分;结合第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从第一数据元集合和第二数据元集合中确定用于标准化目标数据元的至少一个第一候选标准数据元。本申请的方案可以提升标准数据元匹配的准确率。
The present application provides a method, apparatus, device and storage medium for determining a standard data element. The method includes: obtaining at least one characteristic information of a target data element; and determining from a standard data element library based on each characteristic information of the target data element A first data element set matching the target data element is obtained, and a second data element set corresponding to the target data element is determined from the historical benchmarking record library; according to the standardization times of the second standard data element in the second data element set, determine The recommendation score of the second standard data element; combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set and the second standard data element At least one first candidate standard data element for standardizing the target data element is determined in the two data element sets. The solution of the present application can improve the accuracy of standard data element matching.
Description
技术领域technical field
本申请涉及数据处理技术领域,更具体的说,是涉及一种确定标准数据元的方法、装置、设备及存储介质。The present application relates to the technical field of data processing, and more particularly, to a method, apparatus, device and storage medium for determining standard data elements.
背景技术Background technique
随着信息化与数字化的不断发展,企业可以基于数据中台进行数据规整。With the continuous development of informatization and digitization, enterprises can conduct data regulation based on the data center.
在数据规整阶段需要进行数据元的标准化。数据元的标准化是指将数据元转化为符合行业标准或者国家标准等标准规范的标准数据元。基于此,在数据元的标准化过程中,需要确定出数据元匹配的标准数据元。目前,很难准确地从标准数据元库中匹配出数据元适合的标准数据元,使得标准数据元的匹配准确率较低。In the data normalization stage, data element standardization needs to be carried out. Standardization of data elements refers to converting data elements into standard data elements that conform to industry standards or national standards and other standard specifications. Based on this, in the standardization process of the data elements, it is necessary to determine the standard data elements that match the data elements. At present, it is difficult to accurately match standard data elements suitable for the data elements from the standard data element library, so that the matching accuracy of the standard data elements is low.
发明内容SUMMARY OF THE INVENTION
鉴于上述问题,本申请提供了一种确定标准数据元的方法、装置、设备及存储介质,以提升标准数据元匹配的准确率。具体方案如下:In view of the above problems, the present application provides a method, apparatus, device and storage medium for determining standard data elements, so as to improve the accuracy of standard data element matching. The specific plans are as follows:
在本申请的第一方面,提供了一种确定标准数据元的方法,包括:In a first aspect of the present application, a method for determining standard data elements is provided, comprising:
获得待标准化的目标数据元的至少一个特征信息;Obtain at least one feature information of the target data element to be standardized;
基于所述目标数据元的各特征信息,从标准数据元库中确定出与所述目标数据元匹配的第一数据元集合,所述第一数据元集合包括:所述标准数据元库中与所述目标数据元匹配的各第一标准数据元,以及所述第一标准数据元与所述目标数据元的匹配度;Based on each feature information of the target data element, a first data element set matching the target data element is determined from the standard data element library, where the first data element set includes: Each first standard data element matched by the target data element, and the degree of matching between the first standard data element and the target data element;
基于所述目标数据元的各特征信息,从历史对标记录库中确定出所述目标数据元对应的第二数据元集合,所述历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,所述非标准数据元历史上被标准化后的至少一个标准数据元,以及,所述非标准数据元历史上分别被标准化为各标准数据元的标准化次数;所述第二数据元集合中包括:所述目标数据元历史上被标准化后的各第二标准数据元以及所述第二标准数据元对应的标准化次数;Based on the characteristic information of the target data element, the second data element set corresponding to the target data element is determined from the historical benchmarking record library, and the historical benchmarking record library stores: characteristic information of the non-standard data element, at least one standard data element after the non-standard data element has been standardized in the history, and the non-standard data element has been standardized as the normalization times of each standard data element in the history; the The second data element set includes: each second standard data element that has been standardized in the history of the target data element and the normalization times corresponding to the second standard data element;
按照所述第二数据元集合中所述第二标准数据元的标准化次数,确定所述第二标准数据元的推荐评分,其中,第二标准数据元的标准化次数越多,所述第二标准数据元的推荐评分越高,所述第二标准数据元的推荐评分用于表征所述第二标准数据元适合作为所述目标数据元的标准数据元的适合程度;The recommendation score of the second standard data element is determined according to the normalization times of the second standard data element in the second data element set, wherein, the higher the normalization times of the second standard data element, the higher the second standard data element. The higher the recommendation score of the data element is, the recommendation score of the second standard data element is used to represent the suitability of the second standard data element as the standard data element of the target data element;
结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元。Combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set and the second data element set At least one first candidate standard data element for normalizing the target data element is determined.
在一种可能的实现方式中,所述结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元,包括:In a possible implementation manner, the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set are combined, from the first data element set. Determining at least one first candidate standard data element for standardizing the target data element in a data element set and a second data element set includes:
结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,确定所述第一数据元集合和第二数据元集合中各标准数据元的第一综合评分;Combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, it is determined that the first data element set and the second data element set are in the The first comprehensive score of each standard data element;
生成用于标准化所述目标数据元的第一候选标准化列表,所述第一候选标准化列表中包括:从所述第一数据元集合和第二数据元集合中确定出的第一综合评分较高的至少一个第一候选标准数据元。generating a first candidate normalization list for normalizing the target data element, where the first candidate normalization list includes: a first comprehensive score determined from the first data element set and the second data element set is higher of at least one first candidate standard data element.
在又一种可能的实现方式中,所述第一候选标准化列表中还包括:所述第一候选标准数据元的第一综合评分;In another possible implementation manner, the first candidate standardization list further includes: a first comprehensive score of the first candidate standard data element;
所述方法还包括:The method also includes:
基于所述目标数据元的至少一个特征信息,构建所述目标数据元的特征分词集合,所述特征词集合中包括所述至少一个特征信息分词出的至少一个特征分词;Based on at least one feature information of the target data element, construct a feature word segment set of the target data element, and the feature word set includes at least one feature word segment obtained from the at least one feature information segment;
基于所述特征分词集合,确定所述标准数据元库中与所述目标数据元相似的第三数据元集合,所述第三数据元集合包括:特征信息集与所述特征分词集合的相似度较高的至少一个第三标准数据元,以及,所述第三标准数据元的特征信息集与所述特征分词集合的第一相似度;标准数据元的特征信息集包括所述标准数据元的各个特征信息;Based on the feature word segmentation set, determine a third data element set in the standard data element library that is similar to the target data element, where the third data element set includes: the similarity between the feature information set and the feature word segmentation set Higher at least one third standard data element, and the first similarity between the feature information set of the third standard data element and the feature word segmentation set; the feature information set of the standard data element includes the each characteristic information;
基于各第三标准数据元对应的第一相似度以及所述第一候选标准化列表中各第一候选标准数据元的第一综合评分,从所述第三数据元集合和所述第一候选标准化列表中,确定用于标准化所述目标数据元的至少一个目标标准数据元。Based on the first similarity corresponding to each third standard data element and the first comprehensive score of each first candidate standard data element in the first candidate standardization list, from the third data element set and the first candidate standardization In the list, at least one target standard data element for normalizing the target data element is determined.
在本申请的第二方面,提供了一种确定标准数据元的装置,包括:In a second aspect of the present application, a device for determining standard data elements is provided, comprising:
信息获得单元,用于获得待标准化的目标数据元的至少一个特征信息;an information obtaining unit for obtaining at least one feature information of the target data element to be standardized;
第一集合确定单元,用于基于所述目标数据元的各特征信息,从标准数据元库中确定出与所述目标数据元匹配的第一数据元集合,所述第一数据元集合包括:所述标准数据元库中与所述目标数据元匹配的各第一标准数据元,以及所述第一标准数据元与所述目标数据元的匹配度;a first set determining unit, configured to determine a first set of data elements matching the target data element from a standard data element library based on each feature information of the target data element, the first set of data elements comprising: Each first standard data element in the standard data element library that matches the target data element, and the degree of matching between the first standard data element and the target data element;
第二集合确定单元,用于基于所述目标数据元的各特征信息,从历史对标记录库中确定出所述目标数据元对应的第二数据元集合,所述历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,所述非标准数据元历史上被标准化后的至少一个标准数据元,以及,所述非标准数据元历史上分别被标准化为各标准数据元的标准化次数;所述第二数据元集合中包括:所述目标数据元历史上被标准化后的各第二标准数据元以及所述第二标准数据元对应的标准化次数;The second set determining unit is configured to determine, based on each feature information of the target data element, a second data element set corresponding to the target data element from the historical benchmarking record library, and the historical benchmarking record library stores There are: characteristic information of the non-standard data elements that have been standardized in the history, at least one standard data element that has been standardized in the history of the non-standard data elements, and, the non-standard data elements have been standardized into each standard in the history. Standardization times of data elements; the second data element set includes: each second standard data element that has been standardized in the history of the target data element and the standardization times corresponding to the second standard data element;
推荐评分单元,用于按照所述第二数据元集合中所述第二标准数据元的标准化次数,确定所述第二标准数据元的推荐评分,其中,第二标准数据元的标准化次数越多,所述第二标准数据元的推荐评分越高,所述第二标准数据元的推荐评分用于表征所述第二标准数据元适合作为所述目标数据元的标准数据元的适合程度;A recommendation scoring unit, configured to determine the recommendation score of the second standard data element according to the normalization times of the second standard data element in the second data element set, wherein the more the normalization times of the second standard data element , the higher the recommendation score of the second standard data element is, the recommendation score of the second standard data element is used to represent the suitability of the second standard data element as a standard data element of the target data element;
第一数据元确定单元,用于结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元。The first data element determining unit is configured to combine the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set. At least one first candidate standard data element for normalizing the target data element is determined in the meta set and the second data element set.
在本申请的第三方面,提供了一种计算机设备,包括:存储器和处理器;In a third aspect of the present application, a computer device is provided, comprising: a memory and a processor;
所述存储器,用于存储程序;the memory for storing programs;
所述处理器,用于执行所述程序,实现本申请任意一个实施例中确定标准数据元的方法的各个步骤。The processor is configured to execute the program to implement each step of the method for determining a standard data element in any embodiment of the present application.
在本申请的第四方面,提供了一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现本申请任意一个实施例中确定标准数据元的方法的各个步骤。In a fourth aspect of the present application, a storage medium is provided on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the method for determining a standard data element in any one of the embodiments of the present application is implemented. of the various steps.
借由上述技术方案,本申请不仅会从标准数据元库中搜索与该目标数据元匹配的第一标准数据元,还会基于目标数据元的各特征信息,确定该目标数据元历史上被标准化后的各第二标准数据元以及目标数据元被标准化为各第二标准数据元的标准化次数。由于第二标准数据元的标准化次数反映的是历史上该目标数据元被标准化为该第二标准数据元的次数,因此,基于各第二标准数据元的标准化次数确定出的第二标准数据元的推荐评分可以反映第二标准数据元适合用于标准化该目标数据元的适合程度。在此基础上,本申请综合各第一标准数据元与目标数据元的匹配度以及各第二标准数据元的推荐评分,确定用于标准化该目标数据元,既考虑到目标数据元的特征信息与标准数据元的匹配情况,又考虑到目标数据元历史上被标准化的一些行为信息,从而可以更为准确地确定出适合标准化该目标数据元的标准数据元,也就可以更为准确的推荐用于标准化该数据元的标准数据元。With the above technical solution, the present application will not only search for the first standard data element matching the target data element from the standard data element library, but also determine that the target data element has been standardized historically based on each feature information of the target data element. Each subsequent second standard data element and the target data element are normalized to the normalization times of each second standard data element. Since the normalization times of the second standard data element reflects the number of times the target data element has been normalized to the second standard data element in history, the second standard data element determined based on the normalization times of each second standard data element The recommendation score of can reflect the suitability of the second standard data element for normalizing the target data element. On this basis, the present application combines the matching degree of each first standard data element with the target data element and the recommendation score of each second standard data element to determine the standardization of the target data element, taking into account the characteristic information of the target data element The matching situation with the standard data element, and considering some behavior information that has been standardized in the history of the target data element, the standard data element suitable for standardizing the target data element can be more accurately determined, and the recommendation can be more accurate. The standard data element used to standardize this data element.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:
图1为本申请实施例提供的确定标准数据元的方法的一种流程示意图;1 is a schematic flowchart of a method for determining a standard data element provided by an embodiment of the present application;
图2为本申请实施例提供的历史对标记录库中存储非标准数据元历史上被标准化为的标准数据元的一种存储结构示意图;2 is a schematic diagram of a storage structure for storing standard data elements to which non-standard data elements are historically standardized in a historical benchmarking record library provided by an embodiment of the present application;
图3为本申请实施例提供的确定标准数据元的方法的一种流程示意图;3 is a schematic flowchart of a method for determining a standard data element provided by an embodiment of the present application;
图4为本申请实施例提供的基于目标数据元的至少一个特征信息构建特征分词集合的一种实现流程示意图;4 is a schematic diagram of an implementation process of constructing a feature word segmentation set based on at least one feature information of a target data element provided by an embodiment of the present application;
图5为本申请实施例提供的确定候选标准数据元的整体评分的一种实现流程示意图;FIG. 5 is a schematic flowchart of an implementation of determining an overall score of a candidate standard data element according to an embodiment of the present application;
图6为本申请实施例提供的确定标准数据元的方法的一种实现原理框架示意图;6 is a schematic diagram of an implementation principle framework of a method for determining a standard data element provided by an embodiment of the present application;
图7为本申请实施例提供的确定标准数据元的装置的一种组成结构示意图;7 is a schematic structural diagram of a composition of an apparatus for determining a standard data element provided by an embodiment of the present application;
图8为本申请实施例的方案适用计算机设备的一种组成架构示意图。FIG. 8 is a schematic diagram of a composition structure of a computer device to which the solution according to the embodiment of the present application is applied.
具体实施方式Detailed ways
不同行业或者同一行业的不同领域对于同一数据元的命名方式可能会有所不同。而在需要利用数据中台等技术同一管理数据时,则需要对数据元进行统一标准化,这就需要确定可用于标准化该数据元的至少一个标准化数据元,以供用户选择某个标准数据元作为数据元的标准化数据元。Different industries or different fields of the same industry may name the same data element differently. When it is necessary to use the data center and other technologies to manage the same data, the data elements need to be unified and standardized, which requires at least one standardized data element that can be used to standardize the data element, so that the user can choose a standard data element as the The normalized data element of the data element.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请的发明人经过对标准数据元的匹配过程进行研究发现:对于待标准化的数据元,可以利用该数据元的特征信息从标准数据元库中匹配标准数据元。但是该种方式只能匹配出与数据元的特征信息强相关的一个标准数据元,如果待标准化的数据元的特征信息为较为普遍的常用词,则很难准确匹配到适合的标准数据元。The inventors of the present application have found through research on the matching process of standard data elements: for the data element to be standardized, the characteristic information of the data element can be used to match the standard data element from the standard data element library. However, this method can only match a standard data element that is strongly related to the characteristic information of the data element. If the characteristic information of the data element to be standardized is a common word, it is difficult to accurately match a suitable standard data element.
为了能够更为准确地匹配出标准数据元,降低无法匹配到标准数据元的情况,本申请在基于待匹配的目标数据元的特征信息从标准数据元库匹配标准数据元的基础上,还会结合目标数据元的特征信息,从预先构建的历史对标记录库中检测历史上该目标数据元被标准化成的各标准数据元。结合对标准数据元库和历史对标记录库的检测结果,综合确定用于标准化该目标数据元的标准数据元,从而可以结合多个维度的匹配结果来综合确定标准数据元,也就可以更为准确地确定出适合标准化目标数据元的标准数据元。In order to match the standard data elements more accurately and reduce the situation that the standard data elements cannot be matched, the present application also matches the standard data elements from the standard data element library based on the characteristic information of the target data elements to be matched. Combined with the feature information of the target data element, each standard data element into which the target data element has been standardized in the history is detected from the pre-built historical benchmarking record library. Combined with the detection results of the standard data element library and the historical benchmarking record library, the standard data element used to standardize the target data element is comprehensively determined, so that the standard data element can be comprehensively determined in combination with the matching results of multiple dimensions, which can also be updated. In order to accurately determine the standard data element suitable for the standardization target data element.
下面结合流程图对本申请提供的确定标准数据元的方法进行介绍。The method for determining standard data elements provided by the present application will be introduced below with reference to the flowchart.
如图1所示,其示出了本申请实施例提供的确定标准数据元的方法的一种流程示意图,本实施例的方法可以包括:As shown in FIG. 1 , which shows a schematic flowchart of a method for determining a standard data element provided by an embodiment of the present application, the method in this embodiment may include:
S101,获得待标准化的目标数据元的至少一个特征信息。S101, at least one characteristic information of the target data element to be standardized is obtained.
数据元是指不同行业维护的一个数据单元。如,数据元可以为姓名、身份证号码以及邮政编码等等。A data element refers to a unit of data maintained by different industries. For example, the data elements can be names, ID numbers, postal codes, and so on.
可以理解的是,为了便于区分,本申请将需要标准化的数据元称为目标数据元。It can be understood that, in order to facilitate the distinction, this application refers to the data elements that need to be standardized as target data elements.
目标数据元的特征信息可以包括数据元的名称,如数据元的英文编码和中文名称中的一个或者两个。数据元的特征信息还可以是用于表征数据元所具有的特征的描述信息,如,数据元的数据类型、数据元的说明信息以及数据元所属行业等信息中的一种或者几种。The feature information of the target data element may include the name of the data element, such as one or both of the English code and the Chinese name of the data element. The feature information of the data element may also be description information used to characterize the features of the data element, such as one or more of the data type of the data element, the description information of the data element, and the industry to which the data element belongs.
在本申请中,目标数据元的至少一个特征信息至少包括数据元的名称,如数据元的英文编码,还可以包括数据元的其他特征,对此不加限制。In this application, at least one feature information of the target data element includes at least the name of the data element, such as the English code of the data element, and may also include other features of the data element, which is not limited.
S102,基于目标数据元的各特征信息,从标准数据元库中确定出与目标数据元匹配的第一数据元集合。S102, based on each feature information of the target data element, determine a first data element set matching the target data element from a standard data element library.
该第一数据元集合包括:标准数据元库中与目标数据元匹配的各第一标准数据元,以及第一标准数据元与目标数据元的匹配度。The first data element set includes: each first standard data element matched with the target data element in the standard data element library, and the matching degree between the first standard data element and the target data element.
其中,标准数据元库中可以存储有多个标准数据元各自的特征信息。标准数据元的特征信息同样可以包括标准数据元的英文编码和中文名称等标识,还可以包括标准数据元的说明,所属行业、表示格式以及类别等等信息中的一种或者多种。The standard data element library may store respective characteristic information of a plurality of standard data elements. The characteristic information of the standard data element may also include the English code and Chinese name of the standard data element and other identifiers, and may also include one or more of the description of the standard data element, the industry to which it belongs, the representation format, and the category.
例如,标准数据元库可以采用如下表1的形式记录各个标准数据元的特征信息。For example, the standard data element library may record the characteristic information of each standard data element in the form of Table 1 below.
表1Table 1
在表1中每一行对应一个标准数据元的特征信息。In Table 1, each row corresponds to the feature information of a standard data element.
其中,每个标准数据元都具有英文编码和中文名称。例如,表1中第一行记录的是英文编码为“GMSFHM”的标准数据元,该标准数据元的中文名称为“公民身份证号”。同时,标准数据元库中还具有标准数据元的说明、同义词、数据类型、标准数据元所表征的对象类别、标准数据元的表示格式和码长标签。Among them, each standard data element has an English code and a Chinese name. For example, the first row in Table 1 records a standard data element whose English code is "GMSFHM", and the Chinese name of the standard data element is "citizen ID number". At the same time, the standard data element library also has descriptions, synonyms, data types of standard data elements, object categories represented by standard data elements, representation formats and code length labels of standard data elements.
其中,对象类别可以包括:人、道路、时间、证件以及车辆等等,对此不加限制。The object categories may include: people, roads, time, documents, vehicles, etc., which are not limited.
表示格式为表示标准数据元所采用的位数长度,如公民身份证号码的表示格式为18位。The representation format is the length of digits used to represent standard data elements, for example, the representation format of a citizen ID number is 18 digits.
由于表示格式过于精细,对待标准化的数据元进行匹配时,很容易造成混淆,本申请中还提出了使用码长标签的方式来进行标准数据元长度的标示。在本申请中,码长标签可以分为两大类,一类为固定型码长标签,一类为动态型码长标签。固定型码长标签表现为标准数据元具有特定的编码长度(比如身份证,车牌号,手机号等)、类型值长度、时间字段长度等。动态型的码长标签表现为编码长度在一定的区间进行浮动(比如人名,地址,描述等)。Since the representation format is too refined, it is easy to cause confusion when matching data elements to be standardized. This application also proposes to use a code length label to mark the length of standard data elements. In this application, the code length labels can be divided into two categories, one is a fixed code length label, and the other is a dynamic code length label. The fixed code length label shows that the standard data element has a specific code length (such as ID card, license plate number, mobile phone number, etc.), type value length, time field length, etc. The dynamic code length label is expressed as the code length floating in a certain interval (such as person's name, address, description, etc.).
当然,动态型码长便签还可以进一步可分为窄码与宽码,划分方式可基于数据元样本的长度分布情况来划分置信值,例如,窄码为码长便签的长度<=10,宽码为码长便签的长度>10。Of course, the dynamic code-length sticky note can be further divided into narrow code and wide code, and the division method can divide the confidence value based on the length distribution of the data element samples. The code is the length of the code length note> 10.
可以理解的是,对于目标数据元的每个特征信息,可以分别从标准数据库元库中匹配标准数据元,最终可以得到目标数据元的至少一个特征信息分别匹配到的标准数据元。对于从标准数据库元中匹配标准数据元的具体实现方式,本申请不加限制。It can be understood that, for each feature information of the target data element, the standard data elements can be matched from the standard database metabase respectively, and finally the standard data elements to which at least one feature information of the target data element is respectively matched can be obtained. This application does not limit the specific implementation of matching standard data elements from standard database elements.
为了便于区分,本申请将从标准数据元库匹配出的标准数据元称为第一标准数据元。For the convenience of distinction, the standard data element matched from the standard data element library is referred to as the first standard data element in this application.
其中,与目标数据元匹配的第一标准数据元可以是:特征信息与目标数据元的某一个或者多个特征信息的匹配度超过设定阈值的标准数据元;或者是,属于特征信息与目标数据元的特征信息的匹配度较高的设定个标准数据元。Wherein, the first standard data element matched with the target data element may be: a standard data element whose matching degree between the characteristic information and one or more characteristic information of the target data element exceeds the set threshold; or, belongs to the characteristic information and the target data element The standard data element is set for the higher matching degree of the characteristic information of the data element.
其中,第一标准数据元与目标数据元的匹配度也就是第一标准数据元的特征信息与目标数据元的特征信息的相似度。The degree of matching between the first standard data element and the target data element is the similarity between the characteristic information of the first standard data element and the characteristic information of the target data element.
S103,基于目标数据元的各特征信息,从历史对标记录库中确定出目标数据元对应的第二数据元集合。S103, based on each feature information of the target data element, determine a second data element set corresponding to the target data element from the historical benchmarking record library.
该历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,非标准数据元历史上被标准化后的至少一个标准数据元,以及,非标准数据元历史上分别被标准化为各标准数据元的标准化次数。The historical benchmarking record library stores: characteristic information of the non-standard data elements that have been standardized in the history, at least one standard data element that has been standardized in the history of the non-standard data elements, and, respectively, the non-standard data elements that have been historically standardized Normalization is the number of normalization times for each standard data element.
其中,非标准化数据元为历史上获得到的需要进行标准化的数据元。The non-standardized data elements are historically obtained data elements that need to be standardized.
非标准数据元历史上被标准化后的标准数据元可以为:历史上用户选择将该非标准数据元标准化为的标准数据元。相应的,历史对标记录库可以基于不同用户历史上为待标准化的数据元选择的标准数据元的行为记录构建。The standard data element to which the non-standard data element has been standardized in the past may be: the standard data element to which the non-standard data element has been standardized by the user in the history. Correspondingly, the historical benchmarking record library can be constructed based on the behavior records of standard data elements selected for the data elements to be standardized by different users in history.
可以理解的是,由于业务场景不同,针对同一数据元,用户为该数据元选择的标准数据元也可能会有所不同。另外,由于误操作等其他原因,也可能会导致用户在不同次标准化过程中,为同一个数据元选择不同的标准化数据元。基于此,历史对标记录库中每个非标准数据元可能会对应多个标准数据元,且每个标准数据元对应的标准化次数不同。It can be understood that, due to different business scenarios, for the same data element, the standard data element selected by the user for the data element may also be different. In addition, due to other reasons such as misoperation, the user may also select different standardized data elements for the same data element in different standardization processes. Based on this, each non-standard data element in the historical benchmarking record library may correspond to multiple standard data elements, and the normalization times corresponding to each standard data element are different.
如,历史对标记录库中可以记录有每个非标准数据元的英文编码(或者是中文名称)对应的至少一个标准数据元以及每个标准数据元的标准化次数。在该历史对标记录库中,非标准数据元对应的每个标准数据元都是该非标准数据元历史上被用户选择标准化为的一个标准数据元。用For example, at least one standard data element corresponding to the English code (or Chinese name) of each non-standard data element and the standardization times of each standard data element may be recorded in the historical benchmarking record library. In the historical benchmarking record library, each standard data element corresponding to the non-standard data element is a standard data element that the non-standard data element has been selected and standardized into in the history. use
例如,历史对标记录库可以采用如图2的形式记录各个非标准数据元及其历史上被标准化信息。For example, the historical benchmarking record library can record each non-standard data element and its historically standardized information in the form of FIG. 2 .
由图2可以看出,一个非标准数据元可以对应一个或者多个历史上被标准化为的标准数据元,而每个标准数据元后面的数字就是该非标准数据元历史上选择标准化为该标准数据元的次数。如图2中非标准数据元历史上被标准化为标准数据元1的次数为2次,而历史上选择标准化为标准数据元2的次数为5次。As can be seen from Figure 2, a non-standard data element can correspond to one or more standard data elements that have been standardized in history, and the number after each standard data element is the standard data element that was selected to be standardized to the standard in history. The number of data elements. As shown in FIG. 2 , the number of times the non-standard data element is normalized to the standard data element 1 in the history is 2 times, and the number of times that the standard data element 2 is selected to be normalized in the history is 5 times.
可以理解的是,本申请可以结合历史对标记录库中记录的非标准数据元的特征信息的种类,将目标数据元相应种类的特征信息与历史对标记录库中各非标准数据元的特征信息进行比对。例如,历史对标记录库中至少记录有非标准数据元的英文编码,那么至少可以将目标数据元的英文编码与历史对标记录库中各非标准数据元的英文编码进行匹配,以匹配出第二数据元集合。It can be understood that the present application can combine the types of feature information of non-standard data elements recorded in the historical benchmarking record library, and combine the feature information of the corresponding types of target data elements with the characteristics of each non-standard data element in the historical benchmarking record library. information is compared. For example, if at least the English code of the non-standard data element is recorded in the historical benchmarking record database, then at least the English encoding of the target data element can be matched with the English encoding of each non-standard data element in the historical benchmarking record database to match the English code of each non-standard data element. The second set of data elements.
在本申请中,第二数据元集合中包括:目标数据元历史上被标准化后的各第二标准数据元以及第二标准数据元对应的标准化次数。In this application, the second data element set includes: each second standard data element after the target data element has been standardized in history and the normalization times corresponding to the second standard data element.
为了便于区分,将从历史对标记录库匹配出的标准数据元称为第二标准数据元,但是可以理解的是,从历史对标记录库中匹配出的标准数据元可能会与从标准数据元库中匹配出的标准数据元存在重复,而可能使得某个第二标准数据元可能会与某个第一标准数据元为同一个标准数据元。For the convenience of distinction, the standard data element matched from the historical benchmarking record library is called the second standard data element, but it is understandable that the standard data element matched from the historical benchmarking record library may be different from the standard data element from the standard data element. The matched standard data elements in the metabase are repeated, which may cause a second standard data element and a first standard data element to be the same standard data element.
S104,按照第二数据元集合中第二标准数据元的标准化次数,确定第二标准数据元的推荐评分。S104, according to the normalization times of the second standard data element in the second data element set, determine the recommendation score of the second standard data element.
其中,第二标准数据元的标准化次数越多,第二标准数据元的推荐评分越高。Wherein, the more normalization times of the second standard data element, the higher the recommendation score of the second standard data element.
可以理解的是,第二标准数据元的标准化次数越多,则说明第二标准数据元被用户选择作为标准数据元的次数越多,也就说明该第二标准数据元更为适用于标准化对应的非标准化数据元。基于此,第二标准数据元的标准化次数越多,该第二标准数据元越适合作为标准化目标数据元的标准化数据元,因此,其推荐评分也就也高。由此也可知,第二标准数据元的推荐评分可以用于表征第二标准数据元适合作为目标数据元的标准数据元的适合程度。It can be understood that the more times the second standard data element is standardized, the more times the second standard data element is selected by the user as the standard data element, which means that the second standard data element is more suitable for standardized correspondence. of unnormalized data elements. Based on this, the more times the second standard data element is standardized, the more suitable the second standard data element is as the standardized data element of the standardization target data element, and therefore, the higher the recommendation score is. It can also be known that the recommendation score of the second standard data element can be used to represent the suitability of the second standard data element as the standard data element of the target data element.
可以理解的是,对于非标准数据元对应的一个第二标准数据元而言,该第二标准数据元最近一次被选择作为标准化数据元的时间距离当前时刻的时间越近,该第二标准数据元被再次选择作为该非标准数据元的标准化数据元的可能性也越大,也就是说该第二标准数据元越适合标准化该非标准数据元(或者说目标数据元)。基于此,本申请中,历史对标记录库中还存储有:每个非标准数据元历史上最近一次被标准化为各标准数据元的标准时间。It can be understood that, for a second standard data element corresponding to a non-standard data element, the time when the second standard data element was last selected as a standardized data element is closer to the time at the current moment, the second standard data element. The possibility that the element is re-selected as the normalized data element of the non-standard data element is also higher, that is to say, the second standard data element is more suitable for normalizing the non-standard data element (or the target data element). Based on this, in the present application, the historical benchmarking record library also stores: the standard time when each non-standard data element was last normalized to each standard data element in the history.
为了便于区分,将非标准化数据元历史上最近一次标准化为某个标准数据元的时间称为该标准数据元作为该非标准化数据的标准数据元的最近一次标准化时间。For the convenience of distinction, the time when the non-standardized data element is last normalized to a certain standard data element in the history is referred to as the last normalization time of the standard data element as the standard data element of the non-standardized data.
相应的,第二数据元集合中还可以包括第二标准数据元对应的最近一次标准化时间。在此基础上,本申请还可以结合按照第二数据元集合中第二标准数据元的标准化次数以及最近一次标准化时刻,确定第二标准数据元的推荐评分。其中,第二标准数据元的标准化次数越多且其最近一次标准化时间距离当前时间的时长越短,该第二标准数据元的推荐评分越高。Correspondingly, the second data element set may further include the latest standardized time corresponding to the second standard data element. On this basis, the present application may further determine the recommendation score of the second standard data element according to the standardization times of the second standard data element in the second data element set and the latest standardization time. Wherein, the more the normalization times of the second standard data element and the shorter the duration from the last normalization time to the current time, the higher the recommendation score of the second standard data element.
其中,确定第二标准数据元的推荐评分的具体方式可以有多种可能,对此不加限制。There are various possibilities for the specific manner of determining the recommendation score of the second standard data element, which is not limited.
为了便于理解,下面以一种计算第二标准数据元的推荐评分的一种可能情况为例说明:For ease of understanding, a possible situation for calculating the recommendation score of the second standard data element is described below as an example:
如果确定出的第二标准数据元的个数仅有一个,那么可以将该第二标准数据元的推荐评分设定为100;If the determined number of the second standard data element is only one, the recommendation score of the second standard data element may be set to 100;
如果确定出的第二标准数据元的个数有多个:If there are multiple second standard data elements determined:
对于多个第二标准数据元中标准化次数最多的第二标准数据元,可以将该第二标准数据元设置为100分;For the second standard data element with the largest number of normalization times among the plurality of second standard data elements, the second standard data element may be set to 100 points;
对于多个标准数据元中标准化次数不是最多的每个第二标准数据元,可以通过如下公式一,计算该第二标准数据元的推荐评分Y:For each second standard data element whose normalization times are not the most among the plurality of standard data elements, the following formula 1 can be used to calculate the recommendation score Y of the second standard data element:
Y=119.91X-0.286*e(-λ*max(0,t-d)) (公式一);Y=119.91X -0.286 *e (-λ*max(0,td)) (Formula 1);
其中,X为第二标准数据元的标准化次数与该最大标准化次数的差值,最大标准化次数为目标数据元匹配的多个第二标准数据元的标准化次数中的最大值;λ为设定的基于时间分数衰减的速率,其可以根据实际需要设置;t为第二标准数据元对应的最近一次标准化时间与当前天之间的天数差,d为设定的不需要进行衰减的时间区间天数,如,假设认为7天可以不考虑由于时间对于推荐评分的影响,则d的取值为7。Wherein, X is the difference between the normalization times of the second standard data element and the maximum normalization times, and the maximum normalization times is the maximum value among the normalization times of multiple second standard data elements matched by the target data element; λ is the set Based on the rate of time fraction decay, it can be set according to actual needs; t is the difference in days between the last normalized time corresponding to the second standard data element and the current day, d is the set number of days in the time interval that does not require decay, For example, if it is assumed that 7 days can be ignored due to the influence of time on the recommendation score, the value of d is 7.
当然,此处是一种计算推荐评分的方式为例说明,对于通过其他方式计算推荐评分也同样适用于本申请,对此不加限制。Of course, here is an example of a method for calculating the recommendation score, and the calculation of the recommendation score by other methods is also applicable to this application, and there is no restriction on this.
S105,结合第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从第一数据元集合和第二数据元集合中确定用于标准化目标数据元的至少一个第一候选标准数据元。S105: Determine the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set and the second data element set. at least one first candidate standard data element in the standardization target data element.
其中,为了便于区分,将从第一数据元集合和第二数据元集合中选取出的标准数据元称为第一候选标准数据元。Wherein, for the convenience of distinction, the standard data element selected from the first data element set and the second data element set is referred to as the first candidate standard data element.
可以理解的是,标准数据元的匹配度和推荐评分本质上都是反映该标准数据元适合用于标准化该目标数据元的适合程度。而且,很多情况下,第一数据元集合和第二数据元集合中标准数据元会存在重合。基于此,结合第一数据元集合和第二数据元集合中各标准数据元是否同时存在于这两个数据元集合中,以及各标准数据元具有的匹配度和推荐评分中的至少一个,可以综合确定第一数据元集合和第二数据元集合中更为适合或者更可能被用户选择用于标准化的标准数据元作为候选标准数据元。It can be understood that the matching degree and recommendation score of the standard data element essentially reflect the suitability of the standard data element for normalizing the target data element. Moreover, in many cases, the standard data elements in the first data element set and the second data element set may overlap. Based on this, considering whether each standard data element in the first data element set and the second data element set exists in the two data element sets at the same time, and at least one of the matching degree and recommendation score of each standard data element, it is possible to The standard data elements that are more suitable or more likely to be selected by the user for standardization in the first data element set and the second data element set are comprehensively determined as candidate standard data elements.
其中,确定第一候选标准数据元的具体实现方式可以有多种可能,对此不加限制。There are many possibilities for the specific implementation manner of determining the first candidate standard data element, which is not limited.
如,在一种可能的实现方式中,可以结合该第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,确定该第一数据元集合和第二数据元集合中各标准数据元的第一综合评分。然后,结合各标准数据元的第一综合评分,生成用于标准化该目标数据元的第一候选标准化列表。For example, in a possible implementation manner, the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set may be combined to determine the first standard data element. The first comprehensive score of each standard data element in the data element set and the second data element set. Then, combined with the first comprehensive score of each standard data element, a first candidate normalization list for normalizing the target data element is generated.
其中,该第一候选标准化列表中包括:从该第一数据元集合和第二数据元集合中确定出的第一综合评分较高的至少一个第一候选标准数据元。Wherein, the first candidate standardization list includes: at least one first candidate standard data element with a higher first comprehensive score determined from the first data element set and the second data element set.
当然,作为一种可选方式,该第一候选标准化列表中还可以包括:各第一候选标准数据元的第一综合评分。基于此,可以结合各个第一候选标准数据元的第一综合评分对至少一个第一候选标准数据元进行排序输出,以供用户选择所需的标准数据元。Of course, as an optional manner, the first candidate standardization list may further include: a first comprehensive score of each first candidate standard data element. Based on this, the at least one first candidate standard data element can be sorted and output in combination with the first comprehensive score of each first candidate standard data element, so that the user can select the required standard data element.
其中,对于第一数据元集合和第二数据元集合中任意一个标准数据元,如果该标准数据元同时存在于第一数据元集合和第二数据元集合中,那么该标准数据元的第一综合评分与该标准数据元的匹配度和推荐评分有关;如果标准数据元仅属于第一数据元集合或者第二数据元集合,那么标准数据元的第一综合评分仅与其匹配度或者推荐评分有关。Wherein, for any standard data element in the first data element set and the second data element set, if the standard data element exists in both the first data element set and the second data element set, then the first data element of the standard data element The comprehensive score is related to the matching degree and recommendation score of the standard data element; if the standard data element only belongs to the first data element set or the second data element set, the first comprehensive score of the standard data element is only related to its matching degree or recommendation score .
当然,确定该第一综合评分的方式具体可以有多种可能。下面以一种可能的实现方式为例进行说明:Certainly, there may be many specific ways of determining the first comprehensive score. The following is an example of a possible implementation:
首先,可以先将匹配度和推荐评分都统一为同一个数值区间。如,以匹配度和推荐评分都是百分值为例说明。对于任意一个第一标准数据元,如果匹配度为属于0-1的概率值,那么可以先对匹配度进行标准化。First, the matching degree and recommendation score can be unified into the same numerical range. For example, take the matching degree and the recommendation score as percentage values as an example. For any first standard data element, if the matching degree is a probability value belonging to 0-1, the matching degree can be standardized first.
例如,可以通过如下公式二将第一标准数据元的匹配度进行标准化,将一标准数据元的匹配度score被标准化后的匹配度score':For example, the matching degree of the first standard data element can be standardized by the following formula 2, and the matching degree score' after the matching degree score of a standard data element is normalized:
其中,max(score)为第一数据元集合中所有第一标准数据元的匹配度的最大值;min(score)为第一数据元集合中所有第一标准数据元的匹配度的最小值。当然,如果在确定第一标准数据元对应的匹配度时,已经将标准化后的匹配度作为第一标准数据元的匹配度则无需执行匹配度标准化的处理。Wherein, max(score) is the maximum matching degree of all the first standard data elements in the first data element set; min(score) is the minimum matching degree of all the first standard data elements in the first data element set. Of course, if the matching degree corresponding to the first standard data element is determined, if the standardized matching degree is taken as the matching degree of the first standard data element, it is not necessary to perform the processing of standardizing the matching degree.
其次,对于第一数据元集合和第二数据元集合中任意一个标准数据元,对于该标准数据元具有的匹配度(标准化后的匹配度)和推荐评分中任意一个分数socre(i),可以通过如下公式三进行计算出该分数socre(i)激励后评分Item_socre(i):Secondly, for any standard data element in the first data element set and the second data element set, for the matching degree (standardized matching degree) of the standard data element and any score socre(i) in the recommendation score, we can Calculate the score socre(i) and score Item_socre(i) after stimulation by the following formula 3:
Item_socre(i)=socre(i)+(max_socre-socre(i))*hitBoost (公式三);Item_socre(i)=socre(i)+(max_socre-socre(i))*hitBoost (formula 3);
其中,max_socre为设定的置顶分数,在采用标准化后的匹配度和推荐评分都采用百分制的情况下,该置顶分数可以为100。该hitBoost为设定的激励权重,其中,对于匹配度和推荐评分可以采用同一个激励权重,也可以根据需要分别设置匹配度和推荐评分的激励权重。Among them, max_socre is the set top score, and the top score can be 100 when both the standardized matching degree and the recommendation score are based on a percentage system. The hitBoost is a set incentive weight, wherein the same incentive weight can be used for the matching degree and the recommendation score, or the incentive weights of the matching degree and the recommendation score can be set separately as required.
最后,可以对于第一数据元集合和第二数据元集合中任意一个标准数据元,可以求取该标准数据元对应的所有激励后评分的平均值,将求出的评分值作为该标准数据元的第一综合评分。Finally, for any standard data element in the first data element set and the second data element set, the average value of all post-stimulation scores corresponding to the standard data element can be obtained, and the obtained score value can be used as the standard data element the first overall score.
如,标准数据元同时属于第一数据元集合和第二数据元集合,则该标准数据元同时具有对应的匹配度和推荐评分,那么可以计算标准数据元的匹配度的激励后评分和推荐评分的激励后评分的平均值,得到第一综合评分。如果标准数据元仅属于第一数据元集合或者第二数据元集合,那么该标准数据元的第一综合评分就是匹配度的激励后评分或者推荐评分的激励后评分。For example, if the standard data element belongs to both the first data element set and the second data element set, then the standard data element has the corresponding matching degree and recommendation score at the same time, then the post-stimulation score and recommendation score of the matching degree of the standard data element can be calculated The average value of the post-incentive scores was obtained to obtain the first comprehensive score. If the standard data element only belongs to the first data element set or the second data element set, then the first comprehensive score of the standard data element is the post-stimulation score of the matching degree or the post-stimulation score of the recommendation score.
可以理解的是,考虑到基于目标数据元的特征信息直接从标准数据元库进行匹配标准数据元是一种相对较为精准的数据元匹配方式,因此,在实际应用中,用户还可以根据需要设定是否开启单个标准数据元直接推荐的选项。如果用户选择开启单个数据元直接推荐的选项,而且从标准数据元库和历史对标记录库中匹配出唯一一个相同的标准数据元,那么可以直接将该标准数据元确定为第一候选标准数据元,则无需再执行计算第一综合评分的操作。It is understandable that, considering that matching the standard data element directly from the standard data element library based on the characteristic information of the target data element is a relatively accurate data element matching method, in practical applications, the user can also set the standard data element according to their needs. Determines whether to enable the option to directly recommend individual standard objects. If the user chooses to enable the option of direct recommendation of a single data element, and a unique and identical standard data element is matched from the standard data element library and the historical benchmarking record library, then the standard data element can be directly determined as the first candidate standard data element, then the operation of calculating the first comprehensive score is no longer required.
由以上可知,本申请不仅会从标准数据元库中搜索与该目标数据元匹配的第一标准数据元,还会基于目标数据元的各特征信息,确定该目标数据元历史上被标准化后的各第二标准数据元以及目标数据元被标准化为各第二标准数据元的标准化次数。由于第二标准数据元的标准化次数反映的是历史上该目标数据元被标准化为该第二标准数据元的次数,因此,基于各第二标准数据元的标准化次数确定出的第二标准数据元的推荐评分可以反映第二标准数据元适合用于标准化该目标数据元的适合程度。在此基础上,本申请综合各第一标准数据元与目标数据元的匹配度以及各第二标准数据元的推荐评分,确定用于标准化该目标数据元,既考虑到目标数据元的特征信息与标准数据元的匹配情况,又考虑到目标数据元历史上被标准化的一些行为信息,从而可以更为准确地确定出适合标准化该目标数据元的标准数据元,也就可以更为准确的推荐用于标准化该数据元的标准数据元。It can be seen from the above that the present application not only searches the standard data element library for the first standard data element matching the target data element, but also determines the historically standardized value of the target data element based on the characteristic information of the target data element. Each second standard data element and the target data element are normalized to the normalization count of each second standard data element. Since the normalization times of the second standard data element reflects the number of times the target data element has been normalized to the second standard data element in history, the second standard data element determined based on the normalization times of each second standard data element The recommendation score of can reflect the suitability of the second standard data element for normalizing the target data element. On this basis, the present application combines the matching degree of each first standard data element with the target data element and the recommendation score of each second standard data element to determine the standardization of the target data element, taking into account the characteristic information of the target data element The matching situation with the standard data element, and considering some behavior information that has been standardized in the history of the target data element, the standard data element suitable for standardizing the target data element can be more accurately determined, and the recommendation can be more accurate. The standard data element used to standardize this data element.
另外,由于目标数据元为非标准化数据元,因此,目标数据元的特征信息的格式或者内容等也并不规范,从而可能导致基于目标数据元的特征信息无法从标准数据元库中准确匹配甚至是无法匹配到标准数据元。但是,由于历史对标记录库中记录的是非标准数据元的特征信息,因此,基于该目标数据元的特征信息,更容易准确从历史对标记录库中定位出匹配的非标准数据元,进而获得目标数据元匹配的非标准数据元历史上标准化为的标准数据元的信息,以减少由于目标数据元的特征信息无法被标准数据元库覆盖到而导致无法确定用于标准化的标准数据元的情况。In addition, since the target data element is a non-standardized data element, the format or content of the feature information of the target data element is not standardized, which may lead to the fact that the feature information based on the target data element cannot be accurately matched or even from the standard data element library. is unable to match standard data elements. However, since the feature information of non-standard data elements is recorded in the historical benchmarking record library, based on the feature information of the target data element, it is easier to accurately locate the matching non-standard data elements from the historical benchmarking record library, and then Obtain the information of the standard data elements to which the non-standard data elements matched by the target data elements are historically normalized to reduce the problem that the standard data elements for standardization cannot be determined because the characteristic information of the target data elements cannot be covered by the standard data element library. Happening.
可以理解的是,在本申请以上实施例中第一候选标准数据元是利用目标数据元的特征信息直接对标准数据库元库和历史对标记录库进行匹配得到的,也就是说,基于目标数据元的特征信息进行精准匹配得到的。It can be understood that, in the above embodiments of the present application, the first candidate standard data element is obtained by directly matching the standard database element library and the historical benchmarking record library by using the feature information of the target data element, that is, based on the target data. The feature information of the meta is accurately matched.
但是,在实际应用中,由于目标数据元的特征信息的不规范等各种原因,可能会导致基于目标数据元的特征信息无法从标准数据元库和历史对标记录库中匹配出标准数据元。而且,如果目标数据元历史上未被标准化过,那么从历史对标记录库中也可能无法匹配到历史上采用的标准数据元。However, in practical applications, due to various reasons such as the irregularity of the feature information of the target data element, the feature information based on the target data element may not be able to match the standard data element from the standard data element library and the historical benchmarking record library. . Moreover, if the target data element has not been standardized in history, the historically adopted standard data element may also not be matched from the historical benchmarking record library.
基于此,为了能够提升基于目标数据元的特征信息从标准数据元库中匹配到标准数据元的概率,本申请还提供了基于目标数据元的特征信息对标准数据元库进行模糊匹配的方式。在此基础上,本申请可以结合前面精准匹配和采用模糊匹配两种方式匹配到的标准数据元综合确定用于标准化该目标数据元的候选标准数据元。Based on this, in order to improve the probability of matching from the standard data element library to the standard data element based on the feature information of the target data element, the present application also provides a method for fuzzy matching to the standard data element library based on the feature information of the target data element. On this basis, the present application can comprehensively determine a candidate standard data element for standardizing the target data element in combination with the standard data elements matched in the previous precise matching and fuzzy matching.
进一步的,本申请还可以基于目标数据元的特征信息对历史对标记录库中不同非标准数据元对应的标准数据元进行匹配,从而得到匹配出的标准数据元,然后结合精准匹配和模糊匹配得到的各个标准数据元,最终确定用于标准化目标数据元的标准数据元。Further, the present application can also match the standard data elements corresponding to different non-standard data elements in the historical benchmarking record library based on the feature information of the target data elements, so as to obtain the matched standard data elements, and then combine precise matching and fuzzy matching. For each standard data element obtained, a standard data element for standardizing the target data element is finally determined.
下面结合流程图对精准匹配与模糊匹配相结合来确定标准数据元的过程进行介绍。The following describes the process of combining exact matching and fuzzy matching to determine standard data elements with reference to the flowchart.
如图3所示,其示出了本申请一种确定标准数据元的方法又一种流程示意图,本实施例的方法可以包括:As shown in FIG. 3 , which shows another schematic flowchart of a method for determining a standard data element in the present application, the method in this embodiment may include:
S301,获得待标准化的目标数据元的至少一个特征信息。S301, at least one characteristic information of the target data element to be standardized is obtained.
S302,基于目标数据元的各特征信息,从标准数据元库中确定出与目标数据元匹配的第一数据元集合。S302, based on each feature information of the target data element, determine a first data element set matching the target data element from the standard data element library.
该第一数据元集合包括:标准数据元库中与目标数据元匹配的各第一标准数据元,以及第一标准数据元与目标数据元的匹配度。The first data element set includes: each first standard data element matched with the target data element in the standard data element library, and the matching degree between the first standard data element and the target data element.
S303,基于目标数据元的各特征信息,从历史对标记录库中确定出目标数据元对应的第二数据元集合。S303, based on each feature information of the target data element, determine a second data element set corresponding to the target data element from the historical benchmarking record library.
其中,第二数据元集合中包括:目标数据元历史上被标准化后的各第二标准数据元以及第二标准数据元对应的标准化次数。The second data element set includes: each second standard data element that has been standardized in the history of the target data element and the normalization times corresponding to the second standard data element.
S304,按照第二数据元集合中第二标准数据元的标准化次数,确定第二标准数据元的推荐评分。S304: Determine the recommendation score of the second standard data element according to the normalization times of the second standard data element in the second data element set.
其中,第二标准数据元的标准化次数越多,第二标准数据元的推荐评分越高。Wherein, the more normalization times of the second standard data element, the higher the recommendation score of the second standard data element.
S305,结合该第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,确定该第一数据元集合和第二数据元集合中各标准数据元的第一综合评分。S305, combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, determine the first data element set and the second data element set. The first overall score for each standard data element.
S306,结合各标准数据元的第一综合评分,生成用于标准化该目标数据元的第一候选标准化列表。S306, generating a first candidate standardization list for standardizing the target data element in combination with the first comprehensive score of each standard data element.
其中,该第一候选标准化列表包括:从该第一数据元集合和第二数据元集合中确定出的第一综合评分较高的至少一个第一候选标准数据元,以及,各第一候选标准数据元的第一综合评分。The first candidate standardization list includes: at least one first candidate standard data element with a higher first comprehensive score determined from the first data element set and the second data element set, and each first candidate standard The first composite score for the element.
以上步骤S302到S306实际上是基于目标数据元的特征信息对标准数据元库和历史对标数据库进行精准匹配,得到候选标准数据元的过程。The above steps S302 to S306 are actually the process of accurately matching the standard data element database and the historical benchmarking database based on the feature information of the target data element to obtain candidate standard data elements.
以上步骤S301到S306可以参见前面实施例的相关介绍,在此不再赘述For the above steps S301 to S306, reference may be made to the relevant introductions in the previous embodiments, which will not be repeated here.
S307,基于该目标数据元的至少一个特征信息,构建该目标数据元的特征分词集合。S307, based on at least one feature information of the target data element, construct a feature word segmentation set of the target data element.
其中,特征词集合中包括至少一个特征信息分词出的至少一个特征分词。Wherein, the feature word set includes at least one feature word segmented by at least one feature information word segment.
其中,将至少一个特征信息进行分词的目的是为了减少由于单个特征信息整体无法匹配到特征相似的标准数据元的情况。The purpose of performing word segmentation on at least one feature information is to reduce the situation that a single feature information as a whole cannot match standard data elements with similar features.
如,在一种可能的实现方式中,可以先将该目标数据元的至少一个特征信息组合成文本;然后,对该文本进行分词,得到由该文本分词出的至少一个特征分词组成的特征分词集合。For example, in a possible implementation manner, at least one feature information of the target data element may be combined into a text; then, the text is segmented to obtain a feature segment consisting of at least one feature segment obtained from the text segment. gather.
进一步的,考虑到目标数据元的特征信息的格式可能会不规范,从而不利于从标准数据元库中命中标准数据元的特征信息,基于此,本申请对将目标数据元的至少一个特征信息组合为文本之前,还需要对目标数据元的特征信息进行标准化,然后再进行分词及构建特征分词集合。Further, considering that the format of the feature information of the target data element may not be standardized, it is not conducive to hitting the feature information of the standard data element from the standard data element library. Before combining into text, it is also necessary to standardize the feature information of the target data element, and then perform word segmentation and build a feature word segmentation set.
如图4,为本申请中基于目标数据元的至少一个特征信息构建特征分词集合的一种实现流程示意图,该流程包括:Figure 4 is a schematic diagram of an implementation process of constructing a feature segmentation set based on at least one feature information of a target data element in this application, and the process includes:
S41,针对目标数据元的每个特征信息,基于标准特征库中存储的多个标准特征,确定与该特征信息匹配的标准特征,基于标准特征对该特征信息进行标准化。S41 , for each feature information of the target data element, based on a plurality of standard features stored in a standard feature library, determine a standard feature matching the feature information, and standardize the feature information based on the standard feature.
其中,标准特征库可以是基于收集到的不同场景和行业的标准特征名称构建出的语料库,因此,标准特征库中存储的都是不同特征信息的规范命名。The standard feature library may be a corpus constructed based on the collected standard feature names of different scenarios and industries. Therefore, the standard feature library stores standard names of different feature information.
在一种示例中,标准特征库中的多个标准特征还可以划分为三种类型,分别为:主干型、分支型和叶子型。In an example, the multiple standard features in the standard feature library can be further divided into three types, namely: trunk type, branch type, and leaf type.
其中,对于主干型Trunk的标准特征,在目标数据元的特征信息与该标准特征匹配的情况下,会将该特征信息完全替换为该标准特征信息。例如,“姓名”的定义,不同业务场景的数据元的中文名称字段有:“名字”、“小名”、“花名”、“绰号”等等不同的描述,但描述的语义一致,则这些特征信息会被替换为唯一的标准语义特征“姓名”。Wherein, for the standard feature of the trunk type Trunk, if the feature information of the target data element matches the standard feature, the feature information will be completely replaced with the standard feature information. For example, in the definition of "name", the Chinese name fields of data elements in different business scenarios include: "name", "nickname", "flower name", "nickname", etc., but the semantics of the description are the same, then These feature information are replaced with a unique standard semantic feature "name".
对于分支型Branch的标准特征为带有修饰性名词的特征,在基于目标数据元的特征信息进行标准特征匹配时,可以检测目标数据元的特征信息中是否包含与该类型的标准特征匹配的部分特征信息,如果存在,则将该部分特征信息替换为标准特征。例如,对于“籍贯地行政区划中文名”这一特征信息,通过包含规则匹配到标准特征“实现代码”和“名称”,则仅需要将“行政区划”替换为“市县代码”、“中文名”替换为“名称”,最终标准化为“籍贯地市县代码名称”。For branch-type Branch, the standard feature is a feature with a modified noun. When the standard feature matching is performed based on the feature information of the target data element, it can be detected whether the feature information of the target data element contains a part that matches the standard feature of this type. Feature information, if it exists, replace this part of the feature information with standard features. For example, for the feature information of "the Chinese name of the administrative division of the place of origin", the standard features "implementation code" and "name" need to be matched to the standard features through inclusion rules, then only the "administrative division" needs to be replaced with "city and county code", "Chinese "Name" is replaced with "Name", and finally standardized to "city and county code name of the place of origin".
对于叶子型Leaf的标准特征,在目标数据元的特征信息匹配到该类标准特征后,可以对特征信息中的后缀词进行扩展,比如“xxxid”适配结尾“id”,扩展为“xxxid主键”。For the standard features of leaf-type Leaf, after the feature information of the target data element matches the standard features of this type, the suffix words in the feature information can be expanded, for example, "xxxid" is adapted to the ending "id" and expanded to "xxxid primary key" ".
如下面表2所示,表2为标准特征库中各类型标准特征相关信息。As shown in Table 2 below, Table 2 is information related to various types of standard features in the standard feature library.
如表2中第二列为各个标准特征的名称,而第一列为标准特征对应的非标准的特征信息描述,如果目标数据元的特征信息与标准特征的非标准化的特征信息匹配,则可以基于该标准特征对目标数据元的特征信息进行标准化。As shown in Table 2, the second column is the name of each standard feature, and the first column is the description of the non-standard feature information corresponding to the standard feature. If the feature information of the target data element matches the non-standard feature information of the standard feature, you can Based on the standard feature, the feature information of the target data element is standardized.
其中,词性标签表示各个标准特征的类别。Among them, the part-of-speech tag represents the category of each standard feature.
在本申请中,针对目标数据元的每个特征信息,可以按照Trunk型、Branch型、Leaf型的顺序进行标准特征匹配。具体的,可以先检测对于主干型Trunk的标准特征中是否存在与该特征信息匹配的标准特征,如果存在,则基于匹配到的标准特征对该特征进行标准化;如果主干型Trunk的标准特征中不存在与该特征信息匹配的标准特征,则继续监测Branch型中的标准特征中是否存在与该特征信息匹配的标准特征,如果仍不存在,则检测Leaf型的标准特征中是否存在与该特征信息匹配的标准特征。In this application, for each feature information of the target data element, standard feature matching may be performed in the order of Trunk type, Branch type, and Leaf type. Specifically, it is possible to first detect whether there is a standard feature matching the feature information in the standard features of the trunk-type Trunk, and if so, standardize the feature based on the matched standard feature; if the standard features of the trunk-type Trunk do not exist If there is a standard feature that matches the feature information, continue to monitor whether there is a standard feature that matches the feature information in the standard features in the Branch type. If it still does not exist, then check whether there is a standard feature in the Leaf type that matches the feature information. matching standard features.
可以理解的是,通过步骤S41最终可以得到目标数据元对应的标准化后的至少一个特征信息。It can be understood that at least one standardized feature information corresponding to the target data element can be finally obtained through step S41.
S42,将该目标数据元的至少一个特征信息组合成文本。S42, combine at least one feature information of the target data element into text.
S43,对该文本进行分词,得到由所述文本分词出的至少一个特征分词组成的特征分词集合。S43, perform word segmentation on the text to obtain a characteristic word segmentation set consisting of at least one characteristic word segmentation obtained from the word segmentation of the text.
如,假设目标数据元“付款金额”的英文编码为“payNum”,且中文名称为“付款数额”,将目标数据元的特征信息标准化清洗并最终组成一个文本为“付款数额金额额度窄码payNum”,对该文本进行分词,可以得到多个特征分词,而这多个特征分词可以构成特征分词集合,例如,特征分词集合可以为{付款款数数额金额额度窄码paynum},该特征分词集合可以看成多个特征分词构成的向量。For example, assuming that the English code of the target data element "payment amount" is "payNum", and the Chinese name is "payment amount", the characteristic information of the target data element is standardized and cleaned, and finally a text is formed as "payment amount and amount amount narrow code payNum" ”, the text is segmented, and multiple feature segments can be obtained, and these multiple feature segments can form a feature segment set. For example, the feature segment set can be {payment amount, amount, amount, narrow code paynum}, the feature segment set It can be regarded as a vector composed of multiple feature word segmentations.
其中,对文本分词的方式可以有多种,如,本申请可以采用IK分词方式进行分词等,对此不加限制。Wherein, there may be many ways to segment the text. For example, the present application may use the IK word segmentation method to perform word segmentation, etc., which is not limited.
S308,基于特征分词集合,确定标准数据元库中与该目标数据元相似的第三数据元集合。S308, based on the feature word segmentation set, determine a third data element set in the standard data element library that is similar to the target data element.
其中,第三数据元集合包括:特征信息集与该特征分词集合的相似度较高的至少一个第三标准数据元,以及,第三标准数据元的特征信息集与该特征分词集合的第一相似度。Wherein, the third data element set includes: at least one third standard data element with a high similarity between the feature information set and the feature word segmentation set, and the first data element between the feature information set of the third standard data element and the feature word segmentation set similarity.
其中,标准数据元的特征信息集包括标准数据元的各个特征信息。Wherein, the characteristic information set of the standard data element includes each characteristic information of the standard data element.
其中,计算特征分词集合与标准数据元的特征信息集的相似度可以采用目前常用的计算相似度的算法来计算,如,可以采用向量空间模型(Vector space model,VSM)模型或者是BIM模型计算相似度等,对此不加限制。The calculation of the similarity between the feature word segmentation set and the feature information set of the standard data element can be calculated by using a commonly used algorithm for calculating similarity. For example, a vector space model (VSM) model or a BIM model can be used for calculation. Similarity, etc., there is no limit to this.
可以理解的是,由于特征分词集合包含是至少一个特征信息组合并分词后的各个分词,结合这些分词组合成的特征分词集合与标准数据元的各特征信息构成的集合之间的相似度来选取标准数据元,并非是精准匹配标准数据元,但是却可以减少由于直接基于目标数据元的特征信息分别与标准数据元的各特征信息进行匹配而导致无法匹配出标准数据元的情况。It can be understood that, since the feature word segmentation set includes each word segment after at least one feature information combination and word segmentation, the similarity between the feature word segmentation set composed of these word segmentations and the set composed of each feature information of the standard data element is selected to select. The standard data element is not an exact match of the standard data element, but it can reduce the situation that the standard data element cannot be matched because the feature information of the target data element is directly matched with the characteristic information of the standard data element.
可以理解的是,考虑到标准数据元库中各个标准数据元及其特征信息的内容较多,不利于较为快速检索出各第三标准数据元,本申请还可以预先构建出标准数据元库对应的标准数据元索引库。It can be understood that, considering that there are many standard data elements and their characteristic information in the standard data element library, which is not conducive to relatively quickly retrieving each third standard data element, the application can also pre-build the standard data element library corresponding standard data element index library.
其中,标准数据元索引库中可以存储不同标准数据元的数据元索引文件。而标准数据元的数据元索引文件内存储有该标准数据元库中记录的该标准数据元的特征信息。The standard data element index library can store data element index files of different standard data elements. The feature information of the standard data element recorded in the standard data element library is stored in the data element index file of the standard data element.
在一个示例中,本申请可以采用Lucene引擎对标准数据元库的每一个标准数据元项建立索引。具体可以如下:In one example, the present application may use the Lucene engine to index each standard data element item of the standard data element library. Specifically, it can be as follows:
首先,针对每个标准数据元,以该标准数据元的英文编码作为标准数据元的数据元索引文件的文件名,并将该标准数据元的“英文编码”、“中文名称”、“同义词”以及“说明”等特征信息作为该数据元索引文件的内容项。First, for each standard data element, the English code of the standard data element is used as the file name of the data element index file of the standard data element, and the "English code", "Chinese name", "synonym" of the standard data element And feature information such as "description" as the content item of the data element index file.
其次,通过Lucene对每一个标准数据元的数据元索引文件建立对应的索引,其中,数据元索引文件的文件名与文件内容作为索引中不同的域,写入到索引库。Secondly, create a corresponding index for the data element index file of each standard data element through Lucene, wherein the file name and file content of the data element index file are written into the index library as different fields in the index.
其中,该标准数据元索引库可以采用IK分词引擎进行索引查询。Among them, the standard data element index library can use the IK word segmentation engine to perform index query.
在构建的标准数据元索引库的基础上,本实施例可以直接基于标准数据元索引库中存储的不同标准数据元对应的数据元索引文件,从该标准数据元索引库中确定数据元索引文件与特征分词集合相似的第三数据元集合。On the basis of the constructed standard data element index library, this embodiment can directly determine the data element index file from the standard data element index library based on the data element index files corresponding to different standard data elements stored in the standard data element index library A third set of data elements similar to the feature word segmentation set.
相应的,第三数据元集合中包括:数据元索引文件与该特征分词集合的相似度较高的至少一个第三标准数据元,以及,第三标准数据元的数据元索引文件与特征分词集合的第一相似度。Correspondingly, the third data element set includes: at least one third standard data element with a high similarity between the data element index file and the characteristic word segmentation set, and the data element index file of the third standard data element and the characteristic word segmentation set. the first similarity.
可以理解的是,由于标准数据元索引库中将各个标准数据元各个特征作为数据元索引文件的内容项,因此,可以直接基于特征分词集合对各个数据元索引文件进行索引查询,有利于提高匹配速度,从而可以较为快速的确定第三数据元集合。It can be understood that, because each feature of each standard data element is used as the content item of the data element index file in the standard data element index library, the index query of each data element index file can be directly based on the feature word segmentation set, which is conducive to improving the matching. speed, so that the third data element set can be determined relatively quickly.
需要说明的是,本实施例是以基于目标数据元的特征分词集合同时对标准数据元库和历史对标记录库进行模糊匹配为例说明。但是可以理解的是,如果仅仅基于该目标数据元的特征分词集合匹配出第三标准数据元集合,那么也可以是结合第三数据元集合中各第三标准数据元对应的第一相似度以及第一候选标准化列表中各第一候选标准数据元的第一综合评分,从第三数据元集合和第一候选标准化列表中,确定用于标准化目标数据元的至少一个目标标准数据元,从而无需执行后续操作。It should be noted that, in this embodiment, fuzzy matching is performed on the standard data element library and the historical benchmarking record library simultaneously based on the characteristic word segmentation set of the target data element as an example for description. However, it can be understood that, if the third standard data element set is matched only based on the characteristic word segmentation set of the target data element, it can also be combined with the first similarity and the corresponding third standard data elements in the third data element set. The first comprehensive score of each first candidate standard data element in the first candidate standardization list, from the third data element set and the first candidate standardization list, to determine at least one target standard data element for standardizing the target data element, so that no need Perform subsequent operations.
S309,基于该特征分词集合,确定历史对标记录库中与目标数据元相似的第四数据元集合。S309, based on the feature word segmentation set, determine a fourth data element set in the historical benchmarking record library that is similar to the target data element.
其中,第四数据元集合中包括:对应的非标准数据元的特征信息与该特征分词集合相似度较高的至少一个第四标准数据元以及该第四标准数据元对应的非标准数据元的特征信息与该特征分词集合的第二相似度。Wherein, the fourth data element set includes: at least one fourth standard data element whose feature information of the corresponding non-standard data element is highly similar to the feature word segmentation set, and the non-standard data element corresponding to the fourth standard data element. The second similarity between the feature information and the feature word segmentation set.
在该步骤S309中,针对历史对标记录库中出现的每个标准数据元,可以将对应该标准数据元的每个非标准数据元的特征信息作为一个集合,然后与该特征分词集合进行相似度计算,并最终确定出对应的非标准数据元的特征信息与该特征分词集合相似度较高的至少一个第四标准数据元。In this step S309, for each standard data element appearing in the historical benchmarking record library, the feature information of each non-standard data element of the corresponding standard data element can be regarded as a set, and then similar to the feature word segmentation set degree calculation, and finally determine at least one fourth standard data element whose feature information of the corresponding non-standard data element is highly similar to the feature word segmentation set.
在一种可能的实现方式中,为了提高确定第四标准数据元的速度,本申请实施例还可以预先基于历史对标记录库中每个非标准数据元的特征信息对应的至少一个标准数据元,构建出的历史对标记录索引库。In a possible implementation manner, in order to improve the speed of determining the fourth standard data element, this embodiment of the present application may also be based on at least one standard data element corresponding to the feature information of each non-standard data element in the historical benchmarking record library in advance , the constructed historical benchmarking record index library.
其中,该历史对标记录索引库中存储有:历史对标记录库中出现的每个标准数据元对应的历史对标索引文件。每个标准数据元的历史对标索引文件中包括:基于历史对标数据库中不同非标准数据元的特征信息对应的至少一个标准数据元,确定出的该标准数据元对应的各非标准数据元的特征信息。Wherein, the historical benchmarking record index database stores: a historical benchmarking index file corresponding to each standard data element appearing in the historical benchmarking record database. The historical benchmarking index file of each standard data element includes: based on at least one standard data element corresponding to the characteristic information of different non-standard data elements in the historical benchmarking database, each non-standard data element corresponding to the determined standard data element is determined. characteristic information.
相应的,可以直接从历史对标索引库中确定出历史对标索引文件与该特征分词集合相似的第四数据元集合,从而提升检索出第四数据元集合的速度。Correspondingly, the fourth data element set in which the historical benchmarking index file is similar to the characteristic word segmentation set can be directly determined from the historical benchmarking index database, thereby improving the speed of retrieving the fourth data element set.
在该可能的实现方式中,该第四数据元集合包括:历史对标索引文件与该特征分词集合的相似度较高的至少一个第四标准数据元,以及,该第四标准数据元的历史对标索引文件与特征分词集合的第二相似度。In this possible implementation manner, the fourth data element set includes: at least one fourth standard data element with a high similarity between the historical benchmarking index file and the feature word segmentation set, and a history of the fourth standard data element The second similarity between the benchmark index file and the feature word segmentation set.
S310,基于该第三数据元集合中各第三标准数据元对应的第一相似度以及该第四数据元集合中各第四标准数据元对应的第二相似度,确定该第三数据元集合和第四数据元集合中各标准数据元的第二综合评分。S310: Determine the third data element set based on the first similarity corresponding to each third standard data element in the third data element set and the second similarity corresponding to each fourth standard data element in the fourth data element set and the second comprehensive score of each standard data element in the fourth data element set.
其中,确定第二综合评分的过程可以与确定第一综合评分的过程相似。The process of determining the second comprehensive score may be similar to the process of determining the first comprehensive score.
如,首先,可以依据公式二,先将第一相似度和第二相似度分别进行标准化。其次,可以按照公式三,分别确定计算出每个标准数据元的第一相似度对应的激励后评分以及每个标准数据元的第二相似度对应的激励后评分。最后,针对第三数据元集合和第四数据元集合中每个标准数据元,可以计算其具有的第一相似度的激励后评分和第二相似度的激励后评分的平均值,该评分值就是该标准数据元的第二综合评分。For example, first, according to formula 2, the first similarity and the second similarity are respectively standardized. Secondly, the post-stimulation score corresponding to the first similarity of each standard data element and the post-stimulation score corresponding to the second similarity of each standard data element can be determined and calculated respectively according to formula 3. Finally, for each standard data element in the third data element set and the fourth data element set, the average value of the post-stimulation score of the first similarity and the post-stimulation score of the second similarity can be calculated. It is the second comprehensive score of the standard data element.
其中,如果某个标准数据元仅仅属于第三数据元集合或者第四数据元集合,那么其第一相似度的激励后评分或者第二相似度的激励后评分可以认为是零,在此基础上,该标准数据元的第二综合评分实际上就是其具有的第一相似度的激励后评分或者第二相似度的激励后评分。Among them, if a certain standard data element only belongs to the third data element set or the fourth data element set, then the post-stimulation score of the first similarity degree or the post-stimulation score of the second similarity degree can be regarded as zero, and on this basis , the second comprehensive score of the standard data element is actually the post-stimulation score of the first similarity or the post-stimulation score of the second similarity.
S311,生成用于标准化该目标数据元的第二候选标准化列表。S311. Generate a second candidate normalization list for normalizing the target data element.
其中,第二候选标准化列表中包括:从该第三数据元集合和第四数据元集合中确定出的第二综合评分较高的至少一个第二候选标准数据元,以及该第二候选标准数据元的第二综合评分。Wherein, the second candidate standardization list includes: at least one second candidate standard data element with a higher second comprehensive score determined from the third data element set and the fourth data element set, and the second candidate standard data Meta's second composite score.
在本实施例中,通过S307到S311通过模糊匹配的方式确定出的另一种可能的候选标准化列表,以弥补通过步骤S302到S306的精准匹配的方式无法匹配到候选标准数据元或者匹配效果差的问题。In this embodiment, another possible standardized candidate list is determined through fuzzy matching through S307 to S311 to make up for the inability to match the candidate standard data elements or the poor matching effect through the precise matching of steps S302 to S306 The problem.
可以理解的是,步骤S302到S306,与步骤S307到S311可以是同步执行,也可以是先执行步骤S302到S306,然后再执行步骤S307到S311;或者是,先执行步骤S307到S311然后再执行步骤S302到S306。It can be understood that steps S302 to S306 can be executed synchronously with steps S307 to S311, or steps S302 to S306 can be executed first, and then steps S307 to S311 can be executed; or, steps S307 to S311 can be executed first and then executed Steps S302 to S306.
S312,结合第一候选标准化列表中各第一候选标准数据元的第一综合评分以及第二候选标准化列表中第二候选标准数据元的第二综合评分,确定第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分。S312: Determine the first candidate standardization list and the second candidate standardization list by combining the first comprehensive score of each first candidate standard data element in the first candidate standardization list and the second comprehensive score of the second candidate standard data element in the second candidate standardization list The overall score for each candidate criterion data element in the standardized list.
可以理解的是,结合第一候选标准化列表和第二候选标准化列表中每个候选标准数据元是否同时出现在两个列表中,以及每个候选标准数据元具有的第一综合评分和第二综合评分中的至少一个,确定出该候选标准数据元的整体评分。该整体评分可以反映候选标准数据元用于标准化目标数据元的整合适合度。It can be understood that, combining the first candidate standardization list and the second candidate standardization list whether each candidate standard data element appears in the two lists at the same time, and the first comprehensive score and the second comprehensive score of each candidate standard data element. At least one of the scores determines an overall score for the candidate standard data element. The overall score may reflect the integrated fitness of the candidate standard data elements for normalizing the target data elements.
其中,确定整体评分的方式可以有多种可能。Among them, there are various possibilities for determining the overall score.
如,在一种可能的实现方式中,可以设定第一综合评分和第二综合评分的权重,然后,针对第一候选标准化列表和第二候选标准化列表中每个候选标准数据元,可以计算该候选标准数据元的第一综合评分和第二综合评分的加权求和,将加权求和的结果作为该候选标准数据元的整体评分。For example, in a possible implementation manner, the weights of the first comprehensive score and the second comprehensive score may be set, and then, for each candidate standard data element in the first candidate standardization list and the second candidate standardization list, you may calculate The weighted summation of the first comprehensive score and the second comprehensive score of the candidate standard data element, and the result of the weighted summation is used as the overall score of the candidate standard data element.
当然,如果候选标准数据元仅仅属于第一候选标准化列表或者第二候选标准化列表,那么在加权求和时,可以认为该候选标准数据元不具有的第一综合评分或者第二综合评分的值为零。Of course, if the candidate standard data element only belongs to the first candidate standardization list or the second candidate standardization list, then in the weighted summation, it can be considered that the value of the first comprehensive score or the second comprehensive score that the candidate standard data element does not have is the value of zero.
在又一种可能的实现方式中,针对第一候选标准化列表和第二候选标准化列表中每个候选标准化数据元,可以基于候选标准化数据元的特征信息以及该候选标准化数据元具有的第一综合评分和第二综合评分中的至少一个,利用预先训练出的权重评估模型,确定该候选标准数据元对应的第一权重和第二权重。In yet another possible implementation manner, for each candidate normalization data element in the first candidate normalization list and the second candidate normalization list, it may be based on the feature information of the candidate normalization data element and the first integrated data element possessed by the candidate normalization data element At least one of the score and the second comprehensive score uses a pre-trained weight evaluation model to determine the first weight and the second weight corresponding to the candidate standard data element.
其中,候选标准化数据元如果仅具有第一综合评分,而不具有第二综合评分,则可以认为其第二综合评分为零;对于候选标准数据元仅具有第二综合评分的情况,则可以认为其具有的第一综合评分为零,从而将该候选标准数据元的第一综合评分、第二综合评分和奖惩分数都输入到权重评估模型中,以预测出这两个权重。Among them, if the candidate standardized data element only has the first comprehensive score but not the second comprehensive score, it can be considered that its second comprehensive score is zero; for the case where the candidate standard data element only has the second comprehensive score, it can be considered that It has a first comprehensive score of zero, so that the first comprehensive score, the second comprehensive score and the reward and punishment score of the candidate standard data element are all input into the weight evaluation model to predict the two weights.
其中,该第一权重为候选标准数据元的第一综合评分具有的权重占比,该第二权重为候选标准数据元的第二综合评分具有的权重占比。The first weight is the weight proportion of the first comprehensive score of the candidate standard data element, and the second weight is the weight proportion of the second comprehensive score of the candidate standard data element.
该权重评估模型为利用标注有是否被用户选择的标签的多个数据元样本对应的特征信息、第一综合评分和第二综合评分训练得到。其中,数据元样本可以历史上确定出的推荐给用户的多个标准数据元,数据元样本的标签表征该数据元是否被用户选用为标准数据元。The weight evaluation model is obtained by training using the feature information corresponding to a plurality of data element samples marked with labels selected by the user, the first comprehensive score and the second comprehensive score. The data element sample may be a plurality of standard data elements recommended to the user determined in history, and the label of the data element sample indicates whether the data element is selected by the user as a standard data element.
该权重评估模型可以是通过对随机森林或者极端梯度提升(eXtreme GradientBoosting,XGBOOST)模型等进行训练得到。The weight evaluation model may be obtained by training a random forest or an eXtreme Gradient Boosting (eXtreme Gradient Boosting, XGBOOST) model.
S313,结合该第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分,从第一候选标准化列表和第二候选标准化列表中确定用于标准化该目标数据元的至少一个目标标准数据元。S313: Determine at least one target for normalizing the target data element from the first candidate normalization list and the second candidate normalization list in combination with the overall score of each candidate standard data element in the first candidate normalization list and the second candidate normalization list Standard data element.
如,可以选取出整体评分较高的设定数量个候选标准数据元作为目标标准数据元,以将目标标准数据元推荐给用户进行选择。For example, a set number of candidate standard data elements with higher overall scores may be selected as target standard data elements, so as to recommend the target standard data elements to the user for selection.
在本实施例中,在结合待标准化的目标数据元的各特征信息直接从标准数据元库和历史对标记录中匹配标准数据元的基础上,还会对目标数据元的个特征信息组合成的文本进行分词,并基于由该文本分词出的各个分词组成的特征分词集合,对标准数据元库和历史对标记录进行模糊匹配,从而可以更为全面查询出与该目标数据元匹配的标准数据元,也可以减少由于基于目标数据元的各特征信息未被标准数据元库覆盖到而导致无法匹配到标准数据元的情况。In this embodiment, on the basis of directly matching the standard data elements from the standard data element library and the historical benchmarking records in combination with the characteristic information of the target data elements to be standardized, the characteristic information of the target data elements is also combined into a The text is segmented, and based on the feature segment set composed of the segmented words from the text, fuzzy matching is performed on the standard data element library and the historical benchmarking records, so that the standard matching the target data element can be queried more comprehensively. The data element can also reduce the situation that the standard data element cannot be matched because the characteristic information based on the target data element is not covered by the standard data element library.
而且,结合通过这两种方式匹配出的标准数据元的综合评分来确定最终用于标准化目标数据元的目标标准数据元,有利于更为全面和准确选取目标标准数据元,提升确定出的目标标准数据元适用于标准化目标数据元的适用度。Moreover, combining the comprehensive scores of the standard data elements matched by these two methods to determine the target standard data element that is finally used to standardize the target data element is conducive to more comprehensive and accurate selection of the target standard data element, and improves the determined target. The standard data element is suitable for normalizing the fitness of the target data element.
可以理解的是,适合标准化目标数据元的标准数据元一般会与该目标数据元属于相同的实体类别。在此基础上,本申请中还可以综合考虑第一候选标准化列表和第二候选标准化列表中各候选标准数据元的实体类别是否与该目标数据元的实体类别一致,来确定该候选标准数据元的整体评分。It can be understood that a standard data element suitable for a standardized target data element generally belongs to the same entity category as the target data element. On this basis, the present application can also comprehensively consider whether the entity category of each candidate standard data element in the first candidate standardization list and the second candidate standardization list is consistent with the entity category of the target data element to determine the candidate standard data element. overall rating.
如图5,其示出了本申请中确定候选标准数据元的整体评分的一种实现流程示意图,本实施例的方法可以包括:FIG. 5 shows a schematic flowchart of an implementation of determining the overall score of candidate standard data elements in the present application. The method of this embodiment may include:
S501,基于目标数据元的至少一个特征信息以及预先构建的实体规则库,确定目标数据元所属的实体类别。S501 , based on at least one feature information of the target data element and a pre-built entity rule base, determine the entity category to which the target data element belongs.
其中,数据元所属的实体类别可以用于表征数据元所保证的对象的类型。The entity category to which the data element belongs can be used to represent the type of object guaranteed by the data element.
该实体规则库中记录有不同实体类别的特征信息所需满足的特征匹配规则。如果目标数据元的至少一个特征信息与实体规则库中某个实体类别对应的特征匹配规则相符,那么该目标数据元属于该实体类别。Feature matching rules that need to be satisfied by feature information of different entity categories are recorded in the entity rule base. If at least one feature information of the target data element matches the feature matching rule corresponding to a certain entity category in the entity rule base, then the target data element belongs to the entity category.
在一个示例中,该实体规则库中记录的数据形式可以为如下表1所示:In an example, the form of data recorded in the entity rule base may be as shown in Table 1 below:
表3table 3
在表3中,每个实体名称对应着实体类型(也称为实体类别),以及属于该实体类型的特征信息所需满足的正则表达式。如,以身份证号码为例,身份证号码这一实体的实体类型为“人”,如果待标准化的目标数据元的特征信息能够与该身份证号码对应的正则表达式匹配,那么则目标数据元的实体类型为“人”。In Table 3, each entity name corresponds to an entity type (also called entity category), and a regular expression that needs to be satisfied by the feature information belonging to the entity type. For example, taking the ID card number as an example, the entity type of the ID card number is "person". If the feature information of the target data element to be standardized can match the regular expression corresponding to the ID card number, then the target data The entity type of the meta is "person".
S502,确定该第一候选标准化列表和第二候选标准化列表中各候选标准数据元的实体类别。S502: Determine the entity type of each candidate standard data element in the first candidate standardization list and the second candidate standardization list.
可以理解的是,如果标准数据元库中存储有标准数据元的实体类型,那么可以直接从标准数据元库中查询出候选标准数据元对应的实体类别。It can be understood that, if the entity type of the standard data element is stored in the standard data element library, the entity type corresponding to the candidate standard data element can be directly queried from the standard data element library.
如果标准数据元库中未存储有标准数据元的实体类型,可以根据候选标准数据元的特征信息从实体规则库中,匹配出该候选标准数据元对应的实体类别。If the entity type of the standard data element is not stored in the standard data element base, the entity type corresponding to the candidate standard data element can be matched from the entity rule base according to the feature information of the candidate standard data element.
S503,对于该第一候选标准化列表和第二候选标准化列表中每个候选标准数据元,确定该候选标准数据元的奖惩分数。S503, for each candidate standard data element in the first candidate standardization list and the second candidate standardization list, determine the reward and punishment score of the candidate standard data element.
其中,如果候选标准数据元的实体类别与该目标数据元的实体类别相同,该候选标准数据元的奖惩分数为正数,例如设定的正分数值。如果该候选标准数据元的实体类型与该目标数据元的实体类别不同,那么该候选标准数据元的奖惩分数为负数,例如,奖惩分数为设定的负分数值。Wherein, if the entity category of the candidate standard data element is the same as the entity category of the target data element, the reward and punishment score of the candidate standard data element is a positive number, such as a set positive score value. If the entity type of the candidate standard data element is different from the entity type of the target data element, the reward and punishment score of the candidate standard data element is a negative number, for example, the reward and punishment score is a set negative score value.
S504,结合该第一候选标准化列表中各第一候选标准数据元的第一综合评分和奖惩分数,以及第二候选标准化列表中第二候选标准数据元的第二综合评分和奖惩分数,确定该第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分。S504, determine the The overall score of each candidate standard data element in the first candidate standardization list and the second candidate standardization list.
如,可以结合候选标准化数据元是否同时属于两个候选标准化列表、其综合评分和奖惩分数的取值来综合确定整体评分。例如,对于第一候选标准化列表和第二候选标准化列表中均包含的候选标准数据元而言,候选标准数据元的奖惩分数为正数,那么候选标准数据元的第一综合评分和第二综合评分越高,该候选标准数据元的整体评分越高。For example, the overall score can be comprehensively determined by combining whether the candidate normalization data element belongs to two candidate normalization lists at the same time, its comprehensive score, and the value of the reward and punishment score. For example, for the candidate standard data elements contained in both the first candidate standardization list and the second candidate standardization list, the reward and punishment score of the candidate standard data element is a positive number, then the first comprehensive score and the second comprehensive score of the candidate standard data element are The higher the score, the higher the overall score for that candidate criterion data element.
在一个示例中,针对第一候选标准化列表和第二候选标准化列表中每个候选标准化数据元,可以基于候选标准化数据元的特征信息以及该候选标准化数据元具有的第一综合评分、第二综合评分和奖惩分数中的至少两个,利用预先训练出的权重评估模型,确定该候选标准数据元对应的第一权重、第二权重和第三权重。In one example, for each candidate normalization data element in the first candidate normalization list and the second candidate normalization list, it may be based on the feature information of the candidate normalization data element and the first comprehensive score and the second comprehensive score of the candidate normalization data element. At least two of the score and the reward and punishment score are determined by using the pre-trained weight evaluation model to determine the first weight, the second weight and the third weight corresponding to the candidate standard data element.
其中,第一权重和第二权重分别对应候选标准化数据元的第一综合评分和第二综合评分的权重占比,具体如前面介绍。该第三权重为该候选标准数据元的奖惩分数具有的权重占比。Wherein, the first weight and the second weight respectively correspond to the weight ratio of the first comprehensive score and the second comprehensive score of the candidate standardized data element, as described above. The third weight is the weight ratio of the reward and punishment score of the candidate standard data element.
与前面类似,该权重评估模型为利用标注有是否被用户选择的标签的多个数据元样本对应的特征信息、第一综合评分、第二综合评分和奖惩分数训练得到。Similar to the above, the weight evaluation model is obtained by training the feature information, the first comprehensive score, the second comprehensive score and the reward and punishment score corresponding to multiple data element samples marked with labels selected by the user.
相应的,针对第一候选标准化列表和第二候选标准化列表中每个候选标准化数据元,结合该候选标准数据元对应的第一权重、第二权重和第三权重,以及该候选标准数据元具有的第一综合评分、第二综合评分和奖惩分数,确定该候选标准数据元的整体评分。Correspondingly, for each candidate standardization data element in the first candidate standardization list and the second candidate standardization list, the first weight, the second weight and the third weight corresponding to the candidate standard data element are combined, and the candidate standard data element has The first comprehensive score, the second comprehensive score and the reward and punishment score are determined to determine the overall score of the candidate standard data element.
可以理解的是,在确定出候选标准数据元的整体评分之后,同样可以结合各候选标准数据元的整体评分,确定用于标准化目标数据元的至少一个目标标准数据元。It can be understood that, after the overall scores of the candidate standard data elements are determined, at least one target standard data element for normalizing the target data elements can also be determined in combination with the overall scores of the candidate standard data elements.
在图5中综合考虑到候选标准数据元与目标数据元的实体类型是否一致,来确定候选标准数据元的整体评分,使得整体评分能够更准确反映候选标准数据元作为目标数据元的标准数据元的适合程度,进而有利于能够更为准确的确定出用于标准化目标数据元的标准数据元。In Fig. 5, considering whether the entity types of the candidate standard data element and the target data element are consistent, the overall score of the candidate standard data element is determined, so that the overall score can more accurately reflect the candidate standard data element as the standard data element of the target data element. The degree of suitability of the standard data element for standardizing the target data element can be more accurately determined.
为了便于理解本申请的方案,下面结合本申请方案的实现框图进行说明。如图6所示,其示出了本申请实施例提供的确定标准数据元的方法的实现原理框架示意图。In order to facilitate the understanding of the solution of the present application, the following description will be given with reference to the implementation block diagram of the solution of the present application. As shown in FIG. 6 , it shows a schematic diagram of an implementation principle framework of the method for determining a standard data element provided by an embodiment of the present application.
由图6可以看出,本申请在获得待标准化的目标数据元的特征信息之后,会基于该特征信息分别进行精细匹配和模糊匹配。It can be seen from FIG. 6 that, after obtaining the feature information of the target data element to be standardized, the present application will perform fine matching and fuzzy matching respectively based on the feature information.
一方面,在精细匹配的过程中,本申请会基于目标数据元的特征信息,分别从标准数据元库和历史对标记录库中匹配与目标数据元的特征信息相似的标准数据元。On the one hand, in the process of fine matching, based on the characteristic information of the target data element, the present application matches standard data elements similar to the characteristic information of the target data element from the standard data element library and the historical benchmarking record library respectively.
在本申请中,目标数据元的至少一个特征信息中可以全部是由用户输入的一个或者多个特征信息,也可以是在获得用户输入的至少一个特征信息的基础上,挖掘出的多个特征信息。In this application, the at least one feature information of the target data element may all be one or more feature information input by the user, or may be multiple features mined on the basis of obtaining at least one feature information input by the user information.
如,本申请可以获得的目标数据元的特征信息至少可以包括:目标数据元的英文编码和目标数据元的中文名称中的至少一个。For example, the feature information of the target data element that can be obtained by this application may at least include: at least one of the English code of the target data element and the Chinese name of the target data element.
该目标数据元的至少一个特征信息还可以包括:目标数据元的说明,目标数据元历史被标准化到的历史标准数据元,以及,与目标数据元相同类别的参考数据元的信息。The at least one characteristic information of the target data element may further include: a description of the target data element, a historical standard data element to which the target data element is historically normalized, and information of reference data elements of the same category as the target data element.
其中,目标数据元的英文编码、中文名称、说明和参考数据元的信息可以由用户输入。Wherein, the English code, Chinese name, description and information of the reference data element of the target data element can be input by the user.
该目标数据元对应的历史标准数据元可以是由用户输入,也可以是根据目标数据元的英文编码或者中文名称,从用户侧的历史标准化记录中查询出的。The historical standard data element corresponding to the target data element may be input by the user, or may be queried from the historical standardization record on the user side according to the English code or Chinese name of the target data element.
进一步的,为了提升标准数据元匹配的准确率,本申请还可以获得的目标数据元的特征信息,分析出目标数据元的码长标签类别、实体名称和实体类别。Further, in order to improve the accuracy of standard data element matching, the present application can also obtain the feature information of the target data element, and analyze the code length label category, entity name and entity category of the target data element.
其中,目标数据元的码长标签类别可以结合目标数据元对应的各参考数据元的长度确定。The code length label category of the target data element may be determined in combination with the length of each reference data element corresponding to the target data element.
例如,如果各参考数据元的长度的最大值,与各参考数据元的长度的最小值之差为零,则可以确定该目标数据元的码长标签类别为固码。For example, if the difference between the maximum length of each reference data element and the minimum length of each reference data element is zero, it can be determined that the code length tag type of the target data element is a fixed code.
如果各参考数据元的长度都小于或者等于窄码置信度(如10),则该目标数据元的码长便签类别为窄码。If the length of each reference data element is less than or equal to the narrow code confidence level (for example, 10), the code length note type of the target data element is narrow code.
其中,目标数据元的实体名称可以通过基于目标数据元的其他特征信息从实体规则库中匹配出;也可以是通过命名实体识别(NER)技术识别出其实体名称。The entity name of the target data element may be matched from the entity rule base based on other feature information of the target data element; the entity name may also be identified by named entity recognition (NER) technology.
该目标数据元的实体类别可以结合实体规则库查询得到,具体如前面实施例的介绍。The entity category of the target data element can be obtained by querying the entity rule base, as described in the foregoing embodiments.
在以上基础上,本申请可以依据目标数据元的英文编码、中文名称、历史标准数据元的信息以及实体类型,分别从标准数据元库匹配出特征信息相似的标准数据元,并得到标准数据元的特征信息与目标数据元的特征信息的匹配度。On the basis of the above, the present application can match the standard data elements with similar characteristic information from the standard data element database respectively according to the English code, Chinese name, historical standard data element information and entity type of the target data element, and obtain the standard data element The matching degree of the feature information of the target data element with the feature information of the target data element.
同时,可以依据目标数据元的英文编码(或者中文名称),从历史对标记录库中匹配出英文编码与该目标数据元的英文编码匹配的非标准数据元,并确定非标准数据元历史上被标准化为的各标准数据元及各标准数据元对应的标准化次数。然后,结合基于从历史对标记录中匹配出的各标准数据元的标准化次数,可以确定各标准数据元的推荐评分。At the same time, according to the English code (or Chinese name) of the target data element, a non-standard data element whose English code matches the English code of the target data element can be matched from the historical benchmarking record database, and the non-standard data element in the history of the non-standard data element can be determined. Each standard data element to be normalized to and the number of normalization times corresponding to each standard data element. Then, the recommendation score of each standard data element can be determined based on the normalization times of each standard data element matched from the historical benchmarking records.
如图6中精准匹配部分所示,结合从标准数据元库和历史对标记录库中匹配出的各标准数据元具有的匹配度和推荐评分中的一个或者两个,通过精准匹配部分可以最终确定出第一标准化列表,在该第一标准化列表中包含多个第一候选标准化数据元的第一综合评分。As shown in the exact matching part in Figure 6, combining one or both of the matching degree and recommendation score of each standard data element matched from the standard data element library and the historical benchmarking record library, the exact matching part can finally A first standardized list is determined, and the first standardized list includes the first comprehensive scores of the plurality of first candidate standardized data elements.
图6中精准匹配的具体实现可以参见前面图3实施例中步骤S302到S306的相关介绍。For the specific implementation of the precise matching in FIG. 6 , reference may be made to the related introduction of steps S302 to S306 in the embodiment of FIG. 3 above.
又一方面,在模糊匹配过程中,本申请会基于目标数据元的特征信息构建出目标数据元的特征分词集合,分别对标准数据元索引库和历史对标记录索引库进行搜索匹配,并结合这两个索引库的搜索得到的标准数据元最终生成第二标准化列表。该第二标准化列表中包括多个第二候选数据元的第二综合评分。On the other hand, in the fuzzy matching process, the present application will construct a feature word segmentation set of the target data element based on the feature information of the target data element, search and match the standard data element index library and the historical benchmark record index library respectively, and combine The standard data elements obtained by searching the two index bases finally generate a second standard list. The second standardized list includes second comprehensive scores of the plurality of second candidate data elements.
另一方面,针对模糊匹配和精准匹配得到的第一标准化列表和第二标准化列表中每个候选标准数据元,本申请还可以通过图5实施例所示流程方式,确定候选标准数据元的奖惩分数。On the other hand, for each candidate standard data element in the first standardized list and the second standardized list obtained by fuzzy matching and exact matching, the present application can also determine the reward and punishment of the candidate standard data element through the process shown in the embodiment of FIG. 5 Fraction.
如图6所示,如果候选标准数据元与目标数据元属于相同的实体类别,则候选标准数据元的奖惩分数实际上是一个奖励分数,奖励分数为正数;反之,候选标准数据元的奖惩分数为惩罚分数,惩罚分数为一个负数。As shown in Figure 6, if the candidate standard data element and the target data element belong to the same entity category, the reward and punishment score of the candidate standard data element is actually a reward score, and the reward score is a positive number; otherwise, the reward and punishment of the candidate standard data element The score is the penalty score, and the penalty score is a negative number.
在以上基础上,本申请结合各候选标准数据元具有的第一综合评分、第二综合评分和奖惩分数进行有权重的分数集成,最终得到各候选标准数据元的整体评分,具体详见图5实施例中的步骤S504的相关介绍。On the basis of the above, this application combines the first comprehensive score, the second comprehensive score and the reward and punishment score of each candidate standard data element to carry out weighted score integration, and finally obtains the overall score of each candidate standard data element. See Figure 5 for details. The related introduction of step S504 in the embodiment.
结合各候选标准数据元的整体评分,可以确定适合推荐作为用于标准化目标数据元的至少一个目标标准数据元,并可以确定至少一个目标标准数据元的排序。Combined with the overall score of each candidate standard data element, at least one target standard data element suitable for recommendation as a standardization target data element can be determined, and the ordering of the at least one target standard data element can be determined.
又一方面,对应本申请实施例提供的确定标准数据元的方法,本申请还提供了一种确定标准数据元的装置。In another aspect, corresponding to the method for determining a standard data element provided by the embodiments of the present application, the present application further provides an apparatus for determining a standard data element.
如图7所示,其示出了本申请实施例提供的确定标准数据元的装置的一种组成结构示意图,本实施例的装置可以包括:As shown in FIG. 7 , which shows a schematic structural diagram of a composition of an apparatus for determining a standard data element provided by an embodiment of the present application, the apparatus in this embodiment may include:
信息获得单元701,用于获得待标准化的目标数据元的至少一个特征信息;an
第一集合确定单元702,用于基于所述目标数据元的各特征信息,从标准数据元库中确定出与所述目标数据元匹配的第一数据元集合,所述第一数据元集合包括:所述标准数据元库中与所述目标数据元匹配的各第一标准数据元,以及所述第一标准数据元与所述目标数据元的匹配度;A first
第二集合确定单元703,用于基于所述目标数据元的各特征信息,从历史对标记录库中确定出所述目标数据元对应的第二数据元集合,所述历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,所述非标准数据元历史上被标准化后的至少一个标准数据元,以及,所述非标准数据元历史上分别被标准化为各标准数据元的标准化次数;所述第二数据元集合中包括:所述目标数据元历史上被标准化后的各第二标准数据元以及所述第二标准数据元对应的标准化次数;The second
推荐评分单元704,用于按照所述第二数据元集合中所述第二标准数据元的标准化次数,确定所述第二标准数据元的推荐评分,其中,第二标准数据元的标准化次数越多,所述第二标准数据元的推荐评分越高,所述第二标准数据元的推荐评分用于表征所述第二标准数据元适合作为所述目标数据元的标准数据元的适合程度;The
第一数据元确定单元705,用于结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元。The first data
在一种可能的实现方式中,历史对标记录库中还存储有:所述非标准数据元历史上最近一次被标准化为各标准数据元的最近一次标准化时间;In a possible implementation manner, the historical benchmarking record library also stores: the non-standard data element was last normalized to the last normalization time of each standard data element in history;
相应的,第二集合确定单元确定出的第二数据元集合中还包括所述第二标准数据元对应的最近一次标准化时间;Correspondingly, the second data element set determined by the second set determining unit further includes the latest standardized time corresponding to the second standard data element;
该推荐评分单元,包括:The recommended scoring unit includes:
推荐评分子单元,用于按照所述第二数据元集合中所述第二标准数据元的标准化次数以及最近一次标准化时间,确定所述第二标准数据元的推荐评分,其中,所述第二标准数据元的标准化次数越多且最近一次标准化时间距离当前时间的时长越短,所述第二标准数据元的推荐评分越高。A recommendation scoring subunit, configured to determine the recommendation score of the second standard data element according to the normalization times of the second standard data element in the second data element set and the last normalization time, wherein the second standard data element The more normalization times of the standard data element and the shorter the duration from the last normalization time to the current time, the higher the recommendation score of the second standard data element.
在又一种可能的实现方式中,该第一数据元确定单元,包括:In yet another possible implementation, the first data element determining unit includes:
第一综合评分单元,用于结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,确定所述第一数据元集合和第二数据元集合中各标准数据元的第一综合评分;a first comprehensive scoring unit, configured to determine the first data element in combination with the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set the first comprehensive score of each standard data element in the set and the second data element set;
第一列表生成单元,用于生成用于标准化所述目标数据元的第一候选标准化列表,所述第一候选标准化列表中包括:从所述第一数据元集合和第二数据元集合中确定出的第一综合评分较高的至少一个第一候选标准数据元。a first list generation unit, configured to generate a first candidate normalization list for normalizing the target data element, the first candidate normalization list includes: determining from the first data element set and the second data element set at least one first candidate standard data element with a higher first comprehensive score.
在又一种可能的实现方式中,该第一列表生成单元生成的第一候选标准化列表中还包括:所述第一候选标准数据元的第一综合评分;In another possible implementation manner, the first candidate standardization list generated by the first list generating unit further includes: a first comprehensive score of the first candidate standard data element;
该装置还包括:The device also includes:
分词集合构建单元,用于基于所述目标数据元的至少一个特征信息,构建所述目标数据元的特征分词集合,所述特征词集合中包括所述至少一个特征信息分词出的至少一个特征分词;A word segmentation set construction unit, configured to construct a characteristic word segmentation set of the target data element based on at least one characteristic information of the target data element, the characteristic word set includes at least one characteristic word segmentation obtained from the word segmentation of the at least one characteristic information ;
第三集合确定单元,用于基于所述特征分词集合,确定所述标准数据元库中与所述目标数据元相似的第三数据元集合,所述第三数据元集合包括:特征信息集与所述特征分词集合的相似度较高的至少一个第三标准数据元,以及,所述第三标准数据元的特征信息集与所述特征分词集合的第一相似度;标准数据元的特征信息集包括所述标准数据元的各个特征信息;A third set determining unit, configured to determine a third set of data elements in the standard data element library that is similar to the target data element based on the set of feature word segmentations, where the third set of data elements includes: a set of feature information and At least one third standard data element with a higher similarity of the feature word segment set, and the first similarity between the feature information set of the third standard data element and the feature word segment set; the feature information of the standard data element The set includes each characteristic information of the standard data element;
目标确定单元,用于基于各第三标准数据元对应的第一相似度以及所述第一候选标准化列表中各第一候选标准数据元的第一综合评分,从所述第三数据元集合和所述第一候选标准化列表中,确定用于标准化所述目标数据元的至少一个目标标准数据元。The target determination unit is configured to, based on the first similarity corresponding to each third standard data element and the first comprehensive score of each first candidate standard data element in the first candidate standardization list, from the third data element set and In the first candidate standardization list, at least one target standard data element for standardizing the target data element is determined.
在又一种可能的实现方式中,该装置还包括:In yet another possible implementation, the device further includes:
第四集合确定单元,用于在目标确定单元确定用于标准化所述目标数据元的至少一个目标标准数据元之前,基于所述特征分词集合,确定所述历史对标记录库中与所述目标数据元相似的第四数据元集合,所述第四数据元集合中包括:对应的非标准数据元的特征信息与所述特征分词集合相似度较高的至少一个第四标准数据元以及所述第四标准数据元对应的非标准数据元的特征信息与所述特征分词集合的第二相似度;The fourth set determination unit is configured to determine, based on the feature word segmentation set, before the target determination unit determines at least one target standard data element used to standardize the target data element, determine the relationship between the target data element in the historical benchmarking record library and the target data element. A fourth data element set with similar data elements, the fourth data element set includes: at least one fourth standard data element whose feature information of the corresponding non-standard data element is highly similar to the feature word segmentation set and the The second similarity between the feature information of the non-standard data element corresponding to the fourth standard data element and the feature word segmentation set;
相应的,该目标确定单元,包括:Correspondingly, the target determination unit includes:
第二综合评分单元,用于基于所述第三数据元集合中各第三标准数据元对应的第一相似度以及所述第四数据元集合中各第四标准数据元对应的第二相似度,确定所述第三数据元集合和第四数据元集合中各标准数据元的第二综合评分;The second comprehensive scoring unit is configured to be based on the first similarity corresponding to each third standard data element in the third data element set and the second similarity corresponding to each fourth standard data element in the fourth data element set , determining the second comprehensive score of each standard data element in the third data element set and the fourth data element set;
第二列表生成单元,用于生成用于标准化所述目标数据元的第二候选标准化列表,所述第二候选标准化列表中包括:从所述第三数据元集合和第四数据元集合中确定出的第二综合评分较高的至少一个第二候选标准数据元,以及,所述第二候选标准数据元的第二综合评分;A second list generating unit, configured to generate a second candidate normalization list for normalizing the target data element, the second candidate normalization list includes: determining from the third data element set and the fourth data element set at least one second candidate standard data element with a higher second comprehensive score obtained, and the second comprehensive score of the second candidate standard data element;
整体评分单元,用于结合所述第一候选标准化列表中各第一候选标准数据元的第一综合评分以及第二候选标准化列表中第二候选标准数据元的第二综合评分,确定所述第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分;The overall scoring unit is configured to combine the first comprehensive score of each first candidate standard data element in the first candidate standardization list and the second comprehensive score of the second candidate standard data element in the second candidate standardization list to determine the first candidate standard data element. The overall score of each candidate standard data element in the candidate standardization list and the second candidate standardization list;
目标综合确定单元,用于结合所述第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分,从所述第一候选标准化列表和第二候选标准化列表中确定用于标准化所述目标数据元的至少一个目标标准数据元。A target comprehensive determination unit, configured to combine the overall scores of each candidate standard data element in the first candidate standardization list and the second candidate standardization list, and determine from the first candidate standardization list and the second candidate standardization list for standardization At least one target standard data element of the target data element.
在又一种可能的实现方式中,该装置还包括:In yet another possible implementation, the device further includes:
类别查询单元,用于在整体评分单元确定所述第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分之前,基于所述目标数据元的至少一个特征信息以及预先构建的实体规则库,确定所述目标数据元所属的实体类别,所述实体规则库中记录有不同实体类别的特征信息所需满足的特征匹配规则;The category query unit is configured to, before the overall scoring unit determines the overall score of each candidate standard data element in the first candidate standardization list and the second candidate standardization list, based on at least one feature information of the target data element and a pre-built An entity rule base, which determines the entity category to which the target data element belongs, and records the feature matching rules that need to be satisfied by the feature information of different entity categories in the entity rule base;
类别确定单元,用于确定所述第一候选标准化列表和第二候选标准化列表中各候选标准数据元的实体类别;a category determination unit, configured to determine the entity category of each candidate standard data element in the first candidate standardization list and the second candidate standardization list;
奖惩分数确定单元,用于对于所述第一候选标准化列表和第二候选标准化列表中每个候选标准数据元,确定所述候选标准数据元的奖惩分数,其中,如果所述候选标准数据元的实体类别与所述目标数据元的实体类别相同,所述候选标准数据元的奖惩分数为正数;否则,所述候选标准数据元的奖惩分数为负数;The reward and punishment score determination unit is used for, for each candidate standard data element in the first candidate standardization list and the second candidate standardization list, to determine the reward and punishment score of the candidate standard data element, wherein, if the candidate standard data element has a The entity category is the same as the entity category of the target data element, and the reward and punishment score of the candidate standard data element is a positive number; otherwise, the reward and punishment score of the candidate standard data element is a negative number;
相应的,该整体评分单元,具体为,用于结合所述第一候选标准化列表中各第一候选标准数据元的第一综合评分和奖惩分数,以及第二候选标准化列表中第二候选标准数据元的第二综合评分和奖惩分数,确定所述第一候选标准化列表和第二候选标准化列表中各候选标准数据元的整体评分。Correspondingly, the overall scoring unit is specifically configured to combine the first comprehensive score and reward and punishment score of each first candidate standard data element in the first candidate standardization list, and the second candidate standard data in the second candidate standardization list. The second comprehensive score and reward and punishment score of the element are used to determine the overall score of each candidate standard data element in the first candidate standardization list and the second candidate standardization list.
在又一种可能的实现方式中,该整体评分单元,包括In yet another possible implementation, the overall scoring unit includes
权重确定子单元,用于针对所述第一候选标准化列表和第二候选标准化列表中每个候选标准化数据元,基于所述候选标准化数据元的特征信息以及所述候选标准化数据元具有的第一综合评分、第二综合评分和奖惩分数中的至少两个,利用预先训练出的权重评估模型,确定所述候选标准数据元对应的第一权重、第二权重和第三权重;A weight determination subunit, configured to, for each candidate normalization data element in the first candidate normalization list and the second candidate normalization list, based on the feature information of the candidate normalization data element and the first normalization data element that the candidate normalization data element has At least two of the comprehensive score, the second comprehensive score, and the reward and punishment score, using a pre-trained weight evaluation model to determine the first weight, the second weight, and the third weight corresponding to the candidate standard data element;
整体评分子单元,用于针对所述第一候选标准化列表和第二候选标准化列表中每个候选标准化数据元,结合所述候选标准数据元对应的第一权重、第二权重和第三权重,以及所述候选标准数据元具有的第一综合评分、第二综合评分和奖惩分数,确定所述候选标准数据元的整体评分;an overall scoring subunit, configured to combine the first weight, second weight and third weight corresponding to the candidate standard data element for each candidate standardization data element in the first candidate standardization list and the second candidate standardization list, And the first comprehensive score, the second comprehensive score and the reward and punishment score that the candidate standard data element has, determine the overall score of the candidate standard data element;
其中,所述第一权重为所述候选标准数据元的第一综合评分具有的权重占比,所述第二权重为所述候选标准数据元的第二综合评分具有的权重占比,所述第三权重为所述候选标准数据元的奖惩分数具有的权重占比;Wherein, the first weight is the weight proportion of the first comprehensive score of the candidate standard data element, the second weight is the weight proportion of the second comprehensive score of the candidate standard data element, and the The third weight is the weight ratio of the reward and punishment score of the candidate standard data element;
所述权重评估模型为利用标注有是否被用户选择的标签的多个数据元样本对应的特征信息、第一综合评分、第二综合评分和奖惩分数训练得到。The weight evaluation model is obtained by training using the feature information, the first comprehensive score, the second comprehensive score and the reward and punishment score corresponding to a plurality of data element samples marked with labels selected by the user.
在又一种可能的实现方式中,第三集合确定单元,具体为,用于基于标准数据元索引库中存储的不同标准数据元对应的数据元索引文件,从所述标准数据元索引库中确定数据元索引文件与所述特征分词集合相似的第三数据元集合;In yet another possible implementation manner, the third set determining unit is specifically configured to, based on the data element index files corresponding to different standard data elements stored in the standard data element index library, select from the standard data element index library Determine a third data element set that is similar to the feature word segmentation set in the data element index file;
其中,所述标准数据元的数据元索引文件内存储所述标准数据元库中记录的所述标准数据元的特征信息;Wherein, the data element index file of the standard data element stores the characteristic information of the standard data element recorded in the standard data element library;
所述第三数据元集合中包括:数据元索引文件与所述特征分词集合的相似度较高的至少一个第三标准数据元,以及,所述第三标准数据元的数据元索引文件与所述特征分词集合的第一相似度。The third data element set includes: at least one third standard data element whose similarity between the data element index file and the feature word segmentation set is high, and the data element index file of the third standard data element is the same as that of the third standard data element. Describe the first similarity of the feature word set.
在又一种可能的实现方式中,第四集合确定单元,具体用于基于历史对标记录索引库中存储的不同标准数据元对应的历史对标索引文件,从所述历史对标索引库中确定出历史对标索引文件与所述特征分词集合相似的第四数据元集合;In yet another possible implementation manner, the fourth set determination unit is specifically configured to, based on the historical benchmarking index files corresponding to different standard data elements stored in the historical benchmarking record index database, select from the historical benchmarking index database Determine a fourth data element set that is similar to the feature word segmentation set in the historical benchmarking index file;
其中,所述标准数据元对应的历史对标索引文件包括:基于所述历史对标数据库中不同非标准数据元的特征信息对应的至少一个标准数据元,确定出的所述标准数据元对应的各非标准数据元的特征信息;Wherein, the historical benchmarking index file corresponding to the standard data element includes: based on at least one standard data element corresponding to the characteristic information of different non-standard data elements in the historical benchmarking database, the determined standard data element corresponding to the standard data element Characteristic information of each non-standard data element;
所述第四数据元集合包括:历史对标索引文件与所述特征分词集合的相似度较高的至少一个第四标准数据元,以及,所述第四标准数据元的历史对标索引文件与所述特征分词集合的第二相似度。The fourth data element set includes: at least one fourth standard data element with a high similarity between the historical benchmarking index file and the feature word segmentation set, and the historical benchmarking index file of the fourth standard data element and the The second similarity of the feature word segment set.
在又一种可能的实现方式中,该分词集合构建单元包括:In another possible implementation manner, the word segmentation set construction unit includes:
文本组合单元,用于将所述目标数据元的至少一个特征信息组合成文本;a text combining unit for combining at least one feature information of the target data element into text;
文本分词单元,用于对所述文本进行分词,得到由所述文本分词出的至少一个特征分词组成的特征分词集合。The text segmentation unit is used for segmenting the text to obtain a feature segment set consisting of at least one feature segment obtained from the text segment.
在又一种可能的实现方式中,该装置还包括:In yet another possible implementation, the device further includes:
特征清洗单元,用于在文本组合单元将所述目标数据元的至少一个特征信息组合成文本之前,针对所述目标数据元的每个特征信息,基于标准特征库中存储的多个标准特征,确定与所述特征信息匹配的标准特征,基于匹配出的标准特征对所述特征信息标准化,得到所述目标数据元对应的标准化后的至少一个特征信息。A feature cleaning unit, configured to, for each feature information of the target data element, based on a plurality of standard features stored in a standard feature library, before the text combining unit combines at least one feature information of the target data element into text, A standard feature matching the feature information is determined, and the feature information is standardized based on the matched standard feature to obtain at least one standardized feature information corresponding to the target data element.
本申请实施例提供的确定标准数据元的装置可应用于计算机设备。该计算机设备可以为独立的服务器,也可以是云平台或者分布式系统内的服务器等。可选的,图8示出了本申请的方案适用的计算机设备的硬件结构框图,参照图8,计算机设备的硬件结构可以包括:至少一个处理器801,至少一个通信接口802,至少一个存储器803和至少一个通信总线804;The apparatus for determining standard data elements provided by the embodiments of the present application can be applied to computer equipment. The computer device may be an independent server, or may be a cloud platform or a server in a distributed system, or the like. Optionally, FIG. 8 shows a block diagram of the hardware structure of a computer device to which the solution of the present application is applicable. Referring to FIG. 8 , the hardware structure of the computer device may include: at least one
在本申请实施例中,处理器801、通信接口802、存储器803、通信总线804的数量为至少一个,且处理器801、通信接口802、存储器803通过通信总线804完成相互间的通信;In the embodiment of the present application, the number of the
处理器801可能是一个中央处理器CPU,或者是特定集成电路ASIC(ApplicationSpecific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;The
存储器803可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;The
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获得待标准化的目标数据元的至少一个特征信息;Obtain at least one feature information of the target data element to be standardized;
基于所述目标数据元的各特征信息,从标准数据元库中确定出与所述目标数据元匹配的第一数据元集合,所述第一数据元集合包括:所述标准数据元库中与所述目标数据元匹配的各第一标准数据元,以及所述第一标准数据元与所述目标数据元的匹配度;Based on each feature information of the target data element, a first data element set matching the target data element is determined from the standard data element library, where the first data element set includes: Each first standard data element matched by the target data element, and the degree of matching between the first standard data element and the target data element;
基于所述目标数据元的各特征信息,从历史对标记录库中确定出所述目标数据元对应的第二数据元集合,所述历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,所述非标准数据元历史上被标准化后的至少一个标准数据元,以及,所述非标准数据元历史上分别被标准化为各标准数据元的标准化次数;所述第二数据元集合中包括:所述目标数据元历史上被标准化后的各第二标准数据元以及所述第二标准数据元对应的标准化次数;Based on the characteristic information of the target data element, the second data element set corresponding to the target data element is determined from the historical benchmarking record library, and the historical benchmarking record library stores: feature information of the non-standard data element, at least one standard data element after the non-standard data element has been standardized in the history, and the non-standard data element has been standardized as the normalization times of each standard data element in the history; the The second data element set includes: each second standard data element that has been standardized in the history of the target data element and the normalization times corresponding to the second standard data element;
按照所述第二数据元集合中所述第二标准数据元的标准化次数,确定所述第二标准数据元的推荐评分,其中,第二标准数据元的标准化次数越多,所述第二标准数据元的推荐评分越高,所述第二标准数据元的推荐评分用于表征所述第二标准数据元适合作为所述目标数据元的标准数据元的适合程度;The recommendation score of the second standard data element is determined according to the normalization times of the second standard data element in the second data element set, wherein, the higher the normalization times of the second standard data element, the higher the second standard data element. The higher the recommendation score of the data element is, the recommendation score of the second standard data element is used to represent the suitability of the second standard data element as the standard data element of the target data element;
结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元。Combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set and the second data element set At least one first candidate standard data element for normalizing the target data element is determined.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the refinement function and extension function of the program may refer to the above description.
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:An embodiment of the present application further provides a storage medium, where the storage medium can store a program suitable for the processor to execute, and the program is used for:
获得待标准化的目标数据元的至少一个特征信息;Obtain at least one feature information of the target data element to be standardized;
基于所述目标数据元的各特征信息,从标准数据元库中确定出与所述目标数据元匹配的第一数据元集合,所述第一数据元集合包括:所述标准数据元库中与所述目标数据元匹配的各第一标准数据元,以及所述第一标准数据元与所述目标数据元的匹配度;Based on each feature information of the target data element, a first data element set matching the target data element is determined from the standard data element library, where the first data element set includes: Each first standard data element matched by the target data element, and the degree of matching between the first standard data element and the target data element;
基于所述目标数据元的各特征信息,从历史对标记录库中确定出所述目标数据元对应的第二数据元集合,所述历史对标记录库中存储有:历史上被标准化过的非标准数据元的特征信息,所述非标准数据元历史上被标准化后的至少一个标准数据元,以及,所述非标准数据元历史上分别被标准化为各标准数据元的标准化次数;所述第二数据元集合中包括:所述目标数据元历史上被标准化后的各第二标准数据元以及所述第二标准数据元对应的标准化次数;Based on the characteristic information of the target data element, the second data element set corresponding to the target data element is determined from the historical benchmarking record library, and the historical benchmarking record library stores: feature information of the non-standard data element, at least one standard data element after the non-standard data element has been standardized in the history, and the non-standard data element has been standardized as the normalization times of each standard data element in the history; the The second data element set includes: each second standard data element that has been standardized in the history of the target data element and the normalization times corresponding to the second standard data element;
按照所述第二数据元集合中所述第二标准数据元的标准化次数,确定所述第二标准数据元的推荐评分,其中,第二标准数据元的标准化次数越多,所述第二标准数据元的推荐评分越高,所述第二标准数据元的推荐评分用于表征所述第二标准数据元适合作为所述目标数据元的标准数据元的适合程度;The recommendation score of the second standard data element is determined according to the normalization times of the second standard data element in the second data element set, wherein, the higher the normalization times of the second standard data element, the higher the second standard data element. The higher the recommendation score of the data element is, the recommendation score of the second standard data element is used to represent the suitability of the second standard data element as the standard data element of the target data element;
结合所述第一数据元集合中各第一标准数据元的匹配度以及第二数据元集合中各第二标准数据元的推荐评分,从所述第一数据元集合和第二数据元集合中确定用于标准化所述目标数据元的至少一个第一候选标准数据元。Combining the matching degree of each first standard data element in the first data element set and the recommendation score of each second standard data element in the second data element set, from the first data element set and the second data element set At least one first candidate standard data element for normalizing the target data element is determined.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the refinement function and extension function of the program may refer to the above description.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as required, and the same and similar parts can be referred to each other. .
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (14)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111674685.4A CN114328600B (en) | 2021-12-31 | 2021-12-31 | Method, device, equipment and storage medium for determining standard data element |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111674685.4A CN114328600B (en) | 2021-12-31 | 2021-12-31 | Method, device, equipment and storage medium for determining standard data element |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114328600A true CN114328600A (en) | 2022-04-12 |
CN114328600B CN114328600B (en) | 2024-11-29 |
Family
ID=81020556
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111674685.4A Active CN114328600B (en) | 2021-12-31 | 2021-12-31 | Method, device, equipment and storage medium for determining standard data element |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114328600B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117455630A (en) * | 2023-09-12 | 2024-01-26 | 南通尚轩金属制品有限公司 | A data processing method for non-standard parts of building materials |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196834A (en) * | 2019-05-21 | 2019-09-03 | 厦门市美亚柏科信息股份有限公司 | It is a kind of for data item, file, database to mark method and system |
CN111858567A (en) * | 2020-06-18 | 2020-10-30 | 南京市江宁区信息化管理服务中心 | Method and system for cleaning government affair data through standard data elements |
US11010675B1 (en) * | 2017-03-14 | 2021-05-18 | Wells Fargo Bank, N.A. | Machine learning integration for a dynamically scaling matching and prioritization engine |
CN113392133A (en) * | 2021-06-29 | 2021-09-14 | 浪潮软件科技有限公司 | Intelligent data identification method based on machine learning |
WO2021184995A1 (en) * | 2020-03-19 | 2021-09-23 | 华为技术有限公司 | Data processing method and data standard management system |
-
2021
- 2021-12-31 CN CN202111674685.4A patent/CN114328600B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11010675B1 (en) * | 2017-03-14 | 2021-05-18 | Wells Fargo Bank, N.A. | Machine learning integration for a dynamically scaling matching and prioritization engine |
CN110196834A (en) * | 2019-05-21 | 2019-09-03 | 厦门市美亚柏科信息股份有限公司 | It is a kind of for data item, file, database to mark method and system |
WO2021184995A1 (en) * | 2020-03-19 | 2021-09-23 | 华为技术有限公司 | Data processing method and data standard management system |
CN111858567A (en) * | 2020-06-18 | 2020-10-30 | 南京市江宁区信息化管理服务中心 | Method and system for cleaning government affair data through standard data elements |
CN113392133A (en) * | 2021-06-29 | 2021-09-14 | 浪潮软件科技有限公司 | Intelligent data identification method based on machine learning |
Non-Patent Citations (2)
Title |
---|
李敏: "一种标准数据元与数据项匹配算法", 《电脑知识与技术》, vol. 12, no. 01, 1 March 2016 (2016-03-01) * |
李辉: "电子政务法人单位基础信息共享的标准化研究", 《中国优秀硕士论文全文数据库 信息科技辑》, no. 1, 15 December 2011 (2011-12-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117455630A (en) * | 2023-09-12 | 2024-01-26 | 南通尚轩金属制品有限公司 | A data processing method for non-standard parts of building materials |
Also Published As
Publication number | Publication date |
---|---|
CN114328600B (en) | 2024-11-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109947909B (en) | Intelligent customer service response method, equipment, storage medium and device | |
CN110188168B (en) | Semantic relation recognition method and device | |
CN106649818B (en) | Application search intent identification method, device, application search method and server | |
CN110019732B (en) | Intelligent question answering method and related device | |
CN111797214A (en) | Question screening method, device, computer equipment and medium based on FAQ database | |
EP3579125A1 (en) | System, computer-implemented method and computer program product for information retrieval | |
WO2023108980A1 (en) | Information push method and device based on text adversarial sample | |
CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
CN110929498B (en) | Calculation method and device for short text similarity, and readable storage medium | |
KR101511656B1 (en) | Ascribing actionable attributes to data that describes a personal identity | |
WO2018176913A1 (en) | Search method and apparatus, and non-temporary computer-readable storage medium | |
CN109994215A (en) | Disease automatic coding system, method, device and storage medium | |
CN112528315B (en) | Method and device for identifying sensitive data | |
CN111539197A (en) | Text matching method and device, computer system and readable storage medium | |
CN114254201A (en) | A recommendation method for scientific and technological project evaluation experts | |
CN111259173A (en) | Search information recommendation method and device | |
CN112182145A (en) | Text similarity determination method, device, equipment and storage medium | |
CN115795030A (en) | Text classification method, device, computer equipment and storage medium | |
CN112434211A (en) | Data processing method, device, storage medium and equipment | |
CN112507709A (en) | Document matching method, electronic device and storage device | |
CN111752922A (en) | Method and device for establishing knowledge database and realizing knowledge query | |
WO2023142809A1 (en) | Text classification method and apparatus, text processing method and apparatus, computer device and storage medium | |
CN109918583B (en) | A task information processing method and device | |
CN113064986B (en) | Model generation method, system, computer device and storage medium | |
CN118093881B (en) | Audit object portrait modeling method and system based on knowledge graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |