CN114490612A

CN114490612A - Data self-cleaning method, device, electronic device and storage medium

Info

Publication number: CN114490612A
Application number: CN202210107401.1A
Authority: CN
Inventors: 刘立力; 顾超
Original assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Current assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-13

Abstract

The invention provides a data self-cleaning method, a device, electronic equipment and a storage medium, wherein for a target data table and target data to be processed, field information of the target data table can be obtained, and as the field information can represent a first field in the target data table and a field sequence between the first field and the second field, after each piece of data in the target data is analyzed to determine a second field and a field value corresponding to the second field, mapping operation can be performed on the first field and the second field according to the field sequence, so that the field value corresponding to the first field of each piece of data is determined. Based on the method and the device, the fields of the unstructured data and the fields in the data table can be automatically kept in the consistent sequence, the accuracy of the data is guaranteed, even if the fields of the subsequent data table are expanded, the expanded fields can be determined in real time through the field information and are continuously kept consistent with the fields in the data table, self-adaptive cleaning is achieved, the workload of manual checking is greatly reduced, and the efficiency is improved.

Description

Data self-cleaning method, device, electronic device and storage medium

技术领域technical field

本发明涉及大数据ETL(Extract-Transform-Load，抽取-转换-加载)数据清洗技术领域，更具体地说，涉及一种数据自清洗方法、装置、电子设备及存储介质。The invention relates to the technical field of data cleaning of big data ETL (Extract-Transform-Load, extraction-transform-load), and more particularly, to a data self-cleaning method, device, electronic device and storage medium.

背景技术Background technique

对于大数据hive数仓来说，数据清洗是建立数仓的第一步，其基本的功能就是根据hive数仓中数据表(即hive表)的字段将非结构化数据转为结构化数据，为后续的分析统计提供最原始的数据。而在数据清洗中，如何自动、精确的将非结构化数据的字段加载对应到hive表的字段成为最重要的步骤。For the big data hive data warehouse, data cleaning is the first step in establishing a data warehouse. Its basic function is to convert unstructured data into structured data according to the fields of the data table (ie hive table) in the hive data warehouse. Provide the most primitive data for subsequent analysis and statistics. In data cleaning, how to automatically and accurately load the fields of unstructured data into the fields corresponding to the hive table has become the most important step.

目前，数据清洗大多使用人工指定的UDF(User Defined Function，用户自定义函数)输出字段的方式来解析非结构化数据，这就要求UDF输出字段必须要按数据表的字段顺序与数据表中的字段一一对应，字段一旦错位就会导致整张数据表的数据错误。但大型hive数仓中数据表会有上百张之多、并且每张数据表的字段也可能会有上百个，因此就需要大量人工去核对UDF输出字段，从而带来极大的工作量，并且后续数据表一旦扩展字段，这对于UDF输出字段的修改和数据验证的工作量也是相当大的。At present, data cleaning mostly uses the way of manually specified UDF (User Defined Function, user-defined function) output fields to parse unstructured data, which requires that the UDF output fields must be in the order of fields in the data table and in the data table. The fields are in one-to-one correspondence, and once the fields are misplaced, the data of the entire data table will be wrong. However, there are hundreds of data tables in a large hive data warehouse, and each data table may have hundreds of fields. Therefore, a lot of manual work is required to check the UDF output fields, which brings a great workload. , and once the fields are expanded in the subsequent data table, the workload of modifying the UDF output fields and data validation is also quite large.

发明内容SUMMARY OF THE INVENTION

有鉴于此，为解决上述问题，本发明提供一种数据自清洗方法、装置、电子设备及存储介质，技术方案如下：In view of this, in order to solve the above problems, the present invention provides a data self-cleaning method, device, electronic equipment and storage medium, and the technical solutions are as follows:

一种数据自清洗方法，所述方法包括：A data self-cleaning method, comprising:

确定待处理的目标数据表和目标数据，所述目标数据为所述目标数据表对应的非结构化数据、且其中包含至少一条数据；Determine the target data table and target data to be processed, and the target data is unstructured data corresponding to the target data table and contains at least one piece of data;

获取所述目标数据表对应的字段信息，所述字段信息能够表征所述目标数据表中的第一字段、以及所述第一字段间的字段次序；acquiring field information corresponding to the target data table, where the field information can represent the first field in the target data table and the field order between the first fields;

针对所述目标数据中的每条数据，对该条数据执行解析操作，以确定该条数据中的第二字段、以及所述第二字段对应的字段值；按照所述字段次序对所述第一字段和所述第二字段执行映射操作，以确定该条数据于所述第一字段所对应的字段值。For each piece of data in the target data, a parsing operation is performed on the piece of data to determine a second field in the piece of data and a field value corresponding to the second field; A mapping operation is performed between a field and the second field to determine the field value corresponding to the first field of the piece of data.

优选的，所述获取所述目标数据表对应的字段信息，包括：Preferably, the acquiring the field information corresponding to the target data table includes:

获取所述目标数据表对应的元数据信息表，所述元数据信息表中至少包含有序的字段元数据信息；obtaining a metadata information table corresponding to the target data table, where the metadata information table at least contains ordered field metadata information;

依次读取所述有序的字段元数据信息，以确定当前读取到的字段元数据信息所匹配的字段；Read the ordered field metadata information in turn to determine the field matched by the currently read field metadata information;

将所确定的字段依次写入至已确定的有序字段列表中，所述有序字段列表用于存储字段、且其中字段的次序与所述元数据信息表中字段元数据信息的次序相同。The determined fields are sequentially written into the determined ordered field list, where the ordered field list is used to store the fields, and the order of the fields is the same as the order of the field metadata information in the metadata information table.

优选的，所述方法还包括：Preferably, the method further includes:

输出所述有序字段列表。The ordered field list is output.

优选的，所述对该条数据执行解析操作，包括：Preferably, performing the parsing operation on the piece of data includes:

提取该条数据中的键值对，所述键值对中的键表征字段、所述键值对中的值表征字段值；Extract the key-value pair in the piece of data, the key in the key-value pair represents the field, and the value in the key-value pair represents the field value;

建立所述键值对中字段与字段值间的对应关系。A corresponding relationship between fields and field values in the key-value pair is established.

优选的，所述按照所述字段次序对所述第一字段和所述第二字段执行映射操作，包括：Preferably, performing the mapping operation on the first field and the second field according to the field order includes:

按照所述字段次序在所述第二字段中确定与所述第一字段相匹配的目标字段；determining a target field matching the first field in the second field according to the field order;

确定所述目标字段所对应的字段值，并将所确定的字段值依次写入至已确定的有序数据列表中，所述有序数据列表用于存储字段值、且其中字段值的次序与所述字段次序相同。Determine the field value corresponding to the target field, and write the determined field value into the determined ordered data list in turn, and the ordered data list is used to store the field value, and the order of the field value is the same as the order of the field value. The fields are in the same order.

优选的，所述方法还包括：Preferably, the method further includes:

输出所述有序数据列表。The ordered list of data is output.

一种数据自清洗装置，所述装置包括：A data self-cleaning device, the device includes:

确定模块，用于确定待处理的目标数据表和目标数据，所述目标数据为所述目标数据表对应的非结构化数据、且其中包含至少一条数据；a determination module, configured to determine a target data table to be processed and target data, where the target data is unstructured data corresponding to the target data table and contains at least one piece of data;

获取模块，用于获取所述目标数据表对应的字段信息，所述字段信息能够表征所述目标数据表中的第一字段、以及所述第一字段间的字段次序；an acquisition module, configured to acquire field information corresponding to the target data table, where the field information can represent a first field in the target data table and a field order between the first fields;

清洗模块，用于针对所述目标数据中的每条数据，对该条数据执行解析操作，以确定该条数据中的第二字段、以及所述第二字段对应的字段值；按照所述字段次序对所述第一字段和所述第二字段执行映射操作，以确定该条数据于所述第一字段所对应的字段值。a cleaning module, configured to perform a parsing operation on each piece of data in the target data to determine a second field in the piece of data and a field value corresponding to the second field; according to the field A mapping operation is performed on the first field and the second field in order to determine the field value corresponding to the first field of the piece of data.

优选的，所述获取模块，具体用于：Preferably, the acquisition module is specifically used for:

获取所述目标数据表对应的元数据信息表，所述元数据信息表中至少包含有序的字段元数据信息；依次读取所述有序的字段元数据信息，以确定当前读取到的字段元数据信息所匹配的字段；将所确定的字段依次写入至已确定的有序字段列表中，所述有序字段列表用于存储字段、且其中字段的次序与所述元数据信息表中字段元数据信息的次序相同。Obtain the metadata information table corresponding to the target data table, where the metadata information table at least contains ordered field metadata information; read the ordered field metadata information in turn to determine the currently read The fields matched by the field metadata information; the determined fields are sequentially written into the determined ordered field list, the ordered field list is used to store the fields, and the order of the fields is consistent with the metadata information table The order of the field metadata information is the same.

一种电子设备，所述电子设备包括：至少一个存储器和至少一个处理器；所述存储器存储有程序，所述处理器调用所述存储器存储的程序，所述程序用于实现所述的一种数据自清洗方法。An electronic device comprising: at least one memory and at least one processor; the memory stores a program, the processor calls the program stored in the memory, and the program is used to implement the one Data self-cleaning method.

一种存储介质，所述存储介质中存储有计算机可执行指令，所述计算机可执行指令用于执行所述的一种数据自清洗方法。A storage medium storing computer-executable instructions, the computer-executable instructions being used to execute the data self-cleaning method.

相较于现有技术，本发明实现的有益效果为：Compared with the prior art, the beneficial effects realized by the present invention are:

本发明提供一种数据自清洗方法、装置、电子设备及存储介质，对于待处理的目标数据表和目标数据，能够获取该目标数据表的字段信息，由于字段信息能够表征该目标数据表中的第一字段及其之间的字段次序，因此在对目标数据中的每条数据执行解析操作确定其中的第二字段及其对应的字段值后，能够按照字段次序对第一字段和第二字段执行映射操作，以此确定每条数据于第一字段所对应的字段值。基于本发明，能够自动将非结构化数据的字段与数据表中的字段保持一致顺序，保证数据的准确性，即便后续数据表扩展字段，也能够通过字段信息实时确定扩展的字段，并继续与数据表中的字段保持一致，做到自适应清洗，极大减少人工核对的工作量，提升效率。The present invention provides a data self-cleaning method, device, electronic device and storage medium. For the target data table and target data to be processed, the field information of the target data table can be obtained, because the field information can represent the data in the target data table. The first field and the field order between them, so after performing a parsing operation on each piece of data in the target data to determine the second field and its corresponding field value, the first field and the second field can be sorted according to the field order. A mapping operation is performed to determine the field value corresponding to each piece of data in the first field. Based on the present invention, the fields of unstructured data can be automatically kept in the same order as the fields in the data table, so as to ensure the accuracy of the data. Even if the fields of the subsequent data table are extended, the extended fields can be determined in real time through the field information, and continue to communicate with The fields in the data table are consistent, and self-adaptive cleaning is achieved, which greatly reduces the workload of manual verification and improves efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative work.

图1为本发明实施例提供的数据自清洗方法的方法流程图；Fig. 1 is a method flowchart of a data self-cleaning method provided by an embodiment of the present invention;

图2为本发明实施例提供的数据自清洗方法的部分方法流程图；2 is a partial method flowchart of a data self-cleaning method provided by an embodiment of the present invention;

图3为本发明实施例提供的数据自清洗方法的另一部分方法流程图；Fig. 3 is another part of the method flowchart of the data self-cleaning method provided by the embodiment of the present invention;

图4为本发明实施例提供的数据自清洗方法的又一部分方法流程图；Fig. 4 is another part of the method flowchart of the data self-cleaning method provided by the embodiment of the present invention;

图5为本发明实施例提供的数据自清洗装置的结构示意图。FIG. 5 is a schematic structural diagram of a data self-cleaning device according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

本发明实施例提供一种数据自清洗方法，该方法的方法流程图如图1所示，包括如下步骤：An embodiment of the present invention provides a data self-cleaning method. The method flowchart of the method is shown in FIG. 1 , and includes the following steps:

S10，确定待处理的目标数据表和目标数据，目标数据为目标数据表对应的非结构化数据、且其中包含至少一条数据。S10: Determine the target data table to be processed and the target data, where the target data is unstructured data corresponding to the target data table and contains at least one piece of data.

本发明实施例中，目标数据表即hive数仓中待处理的数据表，即待处理的hive表。而目标数据则为目标数据表所对应的待处理的非结构化数据，其可以为业务数据，也可以为日志数据，本发明实施例对此不做限定。In the embodiment of the present invention, the target data table is the data table to be processed in the hive data warehouse, that is, the hive table to be processed. The target data is unstructured data to be processed corresponding to the target data table, which may be business data or log data, which is not limited in this embodiment of the present invention.

此外，目标数据中包含一条或多条数据，目标数据中的一条数据对应于目标数据表中的一条数据记录，即经过数据清洗目标数据中的每条数据在目标数据表中具有相应的一条数据记录，该数据记录中包含相应一条数据在目标数据表的各字段下的字段值，而且目标数据表中任意两条数据记录间对应的字段是相同的。In addition, the target data contains one or more pieces of data, and one piece of data in the target data corresponds to a data record in the target data table, that is, each piece of data in the target data after data cleaning has a corresponding piece of data in the target data table. The data record contains the field value of the corresponding piece of data under each field of the target data table, and the corresponding fields between any two data records in the target data table are the same.

需要说明的是，本发明实施例中的数据自清洗方案可以应用于hive UDF。hiveUDF，即上述UDF(User Defined Function，用户自定义函数)，其为用户定义的hive函数。当hive自带的函数并不能完全满足业务需求，就需要由用户自定义函数来处理业务需求，类似于插件，可以在hive数仓中基于需要处理的业务逻辑来自定义一些处理方法。It should be noted that the data self-cleaning solution in the embodiment of the present invention can be applied to hive UDF. hiveUDF, that is, the above-mentioned UDF (User Defined Function, user-defined function), which is a user-defined hive function. When the functions that come with hive cannot fully meet the business requirements, user-defined functions are required to handle the business requirements. Similar to plug-ins, some processing methods can be customized in the hive data warehouse based on the business logic that needs to be processed.

S20，获取目标数据表对应的字段信息，字段信息能够表征目标数据表中的第一字段、以及第一字段间的字段次序。S20: Acquire field information corresponding to the target data table, where the field information can represent the first field in the target data table and the field order among the first fields.

本发明实施例中，对于目标数据表，可以获取其对应的字段信息，该字段信息用于表征该目标数据表中的字段(即第一字段)、以及第一字段间的字段次序。In this embodiment of the present invention, for the target data table, the corresponding field information can be obtained, and the field information is used to represent the fields (ie, the first fields) in the target data table and the field order among the first fields.

举例来说，目标数据表为用户数据表，该用户数据表中包含“姓名”、“年龄”、“性别”三个字段，并且按照用户对用户数据表的阅读次序，比如从左向右三个字段依次为“姓名”、“年龄”、“性别”，则该用户数据表中三个字段间的字段次序为“姓名→年龄→性别”。For example, the target data table is a user data table, and the user data table contains three fields of "name", "age", and "gender", and according to the user's reading order of the user data table, such as three fields from left to right The fields are "name", "age", and "gender" in sequence, then the field order among the three fields in the user data table is "name→age→gender".

由此，通过获取用户数据表的字段信息能够一方面能够确定该用户数据表中包含“姓名”、“年龄”、“性别”三个字段，另一方面还能够确定“姓名”、“年龄”、“性别”三个字段间的次序，即“姓名→年龄→性别”。Therefore, by acquiring the field information of the user data table, on the one hand, it can be determined that the user data table contains three fields of "name", "age" and "gender", and on the other hand, the "name" and "age" can also be determined. , the order between the three fields of "gender", namely "name→age→gender".

在一些实施例中，数据表的字段信息可以通过自动读取hive元数据信息得到。具体实现过程中，步骤S20“获取目标数据表对应的字段信息”可以采用如下步骤，方法流程图如图2所示：In some embodiments, the field information of the data table can be obtained by automatically reading hive metadata information. In the specific implementation process, step S20 "obtaining the field information corresponding to the target data table" may adopt the following steps, and the method flowchart is shown in Figure 2:

S201，获取目标数据表对应的元数据信息表，元数据信息表中至少包含有序的字段元数据信息。S201: Obtain a metadata information table corresponding to the target data table, where the metadata information table at least contains ordered field metadata information.

本发明实施例中，目标数据表进行DDL建表时，其字段信息会存入其对应的元数据信息表中，由于目标数据表中的字段是有序的，因此在元数据信息表中存入字段相对应的部分字段信息(即字段元数据信息)时也是有序的。In the embodiment of the present invention, when the target data table is constructed by DDL, its field information will be stored in its corresponding metadata information table. Part of the field information (ie, field metadata information) corresponding to the input field is also ordered.

需要说明的是，DDL(data definition language，数据定义语言)，即创建hive表的定义语言，DDL主要是用在定义或改变表(TABLE)的结构、数据类型、表之间的链接和约束等初始化工作上，大多在建立表时使用。继续以用户数据表来说明，其中“姓名”、“年龄”、“性别”三个字段分别以“name”、“age”、“sex”来标识，并且其中标识“name”所对应的“姓名”字段是字符类型(即“string”)、标识“age”所对应的“年龄”字段是数字类型(即“number”)、标识“sex”所对应的“性别”字段是字符类型(即“string”)，则相应的DDL为“create tableuser(name string,age number,sex string)”。It should be noted that DDL (data definition language, data definition language), that is, the definition language for creating hive tables, DDL is mainly used to define or change the structure, data types, links and constraints between tables, etc. In initialization work, it is mostly used when creating tables. Continue to use the user data table to illustrate, in which the three fields of "name", "age" and "gender" are identified by "name", "age" and "sex" respectively, and the "name" corresponding to "name" is identified. " field is a character type (ie "string"), the "age" field corresponding to the identification "age" is a numeric type (ie "number"), and the "gender" field corresponding to the identification "sex" is a character type (ie " string"), the corresponding DDL is "create tableuser(name string, age number, sex string)".

继续以DDL为“create table user(name string,age number,sex string)”为例来说明，通过该DDL可以创建用户数据表，则其中“姓名”、“年龄”、“性别”三个字段各自的字段元数据信息按照字段次序“姓名→年龄→性别”依次存入用户数据表对应的元数据信息表中。具体的，如下表所示，首先将标识“name”以及相应的字符类型“string”存入元数据信息表，进而将标识“age”以及相应的数字类型“number”存入、最后将标识“sex”以及相应的字符类型“string”存入。也就是说，用户数据表对应的元数据信息表中包含有序的三个字段的字段元数据信息，依次为“姓名”字段对应的标识“name”以及字符类型“string”、“年龄”字段对应的标识“age”以及数字类型“number”、以及“性别”字段对应的标识“sex”以及字符类型“string”。Continue to take the DDL as "create table user(name string, age number, sex string)" as an example, through which a user data table can be created, and the three fields of "name", "age" and "gender" are respectively The metadata information of the fields is sequentially stored in the metadata information table corresponding to the user data table according to the field order "name→age→gender". Specifically, as shown in the following table, first, the identifier "name" and the corresponding character type "string" are stored in the metadata information table, and then the identifier "age" and the corresponding digital type "number" are stored, and finally the identifier " sex" and the corresponding character type "string" are stored. That is to say, the metadata information table corresponding to the user data table contains the field metadata information of three fields in order, which are the identifier "name" corresponding to the "name" field, and the character type "string" and "age" fields. The corresponding identifier "age" and the numeric type "number", and the identifier "sex" corresponding to the "gender" field and the character type "string".

需要说明的是，下表中“table”表示数据表、“field”表示字段、“type”表示数据类型。It should be noted that, in the following table, "table" indicates a data table, "field" indicates a field, and "type" indicates a data type.

还需要说明的是，上述举例中数据表中各字段所对应的字段元数据信息仅包含数据类型，相应的，元数据信息表中也仅包含字段的标识以及字段的数据类型，可以理解的是，这仅为举例，在实际应用中，元数据信息表中存储有数据表属性的数据，包括不局限于字段的标识、字段的数据类型、数据表对应的存储位置等信息。It should also be noted that the field metadata information corresponding to each field in the data table in the above example only includes the data type. Correspondingly, the metadata information table also only includes the field identifier and the field data type. It is understandable that , this is just an example. In practical applications, the metadata information table stores the data of the attributes of the data table, including but not limited to the identifier of the field, the data type of the field, and the storage location corresponding to the data table.

S202，依次读取有序的字段元数据信息，以确定当前读取到的字段元数据信息所匹配的字段。S202: Read the ordered field metadata information in sequence to determine a field matched by the currently read field metadata information.

本发明实施例中，由于目标数据表所对应的元数据信息表中至少包含有序的字段元数据信息，即该元数据信息中包含一个字段的字段元数据信息、或者有序的多个字段的字段元数据信息。由此，对元数据信息表的读取主要功能是读取目标数据表的字段元数据信息，基于目标数据表的相关信息(比如标识)可以在UDF中读取到目标数据表所对应的元数据信息表中读取到其中各字段的字段元数据信息。In this embodiment of the present invention, since the metadata information table corresponding to the target data table contains at least ordered field metadata information, that is, the metadata information includes field metadata information of one field, or ordered multiple fields field metadata information. Therefore, the main function of reading the metadata information table is to read the field metadata information of the target data table. Based on the relevant information (such as the identifier) of the target data table, the metadata corresponding to the target data table can be read in the UDF. The field metadata information of each field is read from the data information table.

并且，由于目标数据表进行DDL建表时，其存入元数据信息表中的字段元数据信息是有序的，因此从该元数据信息表中读取到的字段元数据信息也是有序的。对此，对于当前读取到的字段元数据信息，可以确定其匹配的字段。继续以用户数据表来说明，其元数据信息表中已依次存入“姓名”、“年龄”、“性别”三个字段各自的字段元数据信息，因此对于当前读取到的字段元数据信息可以确定其对应的字段，比如当前读取到的是标识“name”以及字符类型“string”，由此可以确定相匹配的字段为“姓名”。Moreover, since the field metadata information stored in the metadata information table is in order when the target data table is constructed by DDL, the field metadata information read from the metadata information table is also in order. . In this regard, for the currently read field metadata information, the matched fields can be determined. Continue to explain with the user data table, the field metadata information of the three fields of "name", "age", and "gender" has been stored in the metadata information table in turn. Therefore, for the currently read field metadata information The corresponding field can be determined, for example, the identifier "name" and the character type "string" are currently read, and thus the matched field can be determined as "name".

S203，将所确定的字段依次写入至已确定的有序字段列表中，有序字段列表用于存储字段、且其中字段的次序与元数据信息表中字段元数据信息的次序相同。S203, the determined fields are sequentially written into the determined ordered field list, where the ordered field list is used to store the fields, and the order of the fields is the same as the order of the field metadata information in the metadata information table.

本发明实施例中，对于当前读取到的字段元数据信息，在确定其匹配的字段后，可以将该字段存入有序字段列表中，有序字段列表中字段的次序与元数据信息表中字段元数据信息的次序相同，这就可以通过自动读取hive元数据信息将非结构化数据的字段与数据表中的字段保持一致顺序，对于后续数据表扩展字段，也不需要再做字段核对，能够做到自适应，提升效率。In this embodiment of the present invention, for the currently read field metadata information, after the matching field is determined, the field may be stored in an ordered field list, and the order of the fields in the ordered field list is related to the metadata information table. The order of the metadata information of the fields in the Hive is the same, which can keep the unstructured data fields in the same order as the fields in the data table by automatically reading the hive metadata information. For the subsequent data table extension fields, there is no need to add fields. Check, can be adaptive and improve efficiency.

在其他一些实施例中，还可以进一步输出该有序字段列表，以告知UDF每个字段的输出顺序，这是hive UDF里必要的步骤，将有序字段列表给到UDF的字段输出方法。In some other embodiments, the ordered field list may be further output to inform the UDF of the output order of each field, which is a necessary step in the hive UDF to give the ordered field list to the UDF's field output method.

S30，针对目标数据中的每条数据，对该条数据执行解析操作，以确定该条数据中的第二字段、以及第二字段对应的字段值；按照字段次序对第一字段和第二字段执行映射操作，以确定该条数据于第一字段所对应的字段值。S30, for each piece of data in the target data, perform a parsing operation on the piece of data to determine a second field in the piece of data and a field value corresponding to the second field; analyze the first field and the second field according to the field order A mapping operation is performed to determine the field value corresponding to the first field of the piece of data.

本发明实施例中，目标数据中包含至少一条数据，其中的一条数据对应于目标数据表中的一条数据记录。以目标数据中包含如下一条数据为例进行说明：In this embodiment of the present invention, the target data includes at least one piece of data, and one piece of data corresponds to a data record in the target data table. Take the following data as an example in the target data:

“http://user.info？name＝lucky&age＝30&sex＝female”为目标数据中的一条数据，对该条数据执行解析操作能够确定该条数据中的字段(即第二字段)、以及第二字段对应的字段值，也就说，该条数据中包含有“姓名”、“年龄”、“性别”三个字段以及这三个字段各自对应的字段值，即标识“name”对应的“lucky”、标识“age”对应的“30”、以及标识“sex”对应的“female”。"http://user.info?name=lucky&age=30&sex=female" is a piece of data in the target data, and performing a parsing operation on the piece of data can determine the field (ie, the second field) and the second field in the piece of data. The field value corresponding to the field, that is to say, the data contains three fields "name", "age", "gender" and the field values corresponding to these three fields, that is, the "lucky" corresponding to the "name" is identified. ", "30" corresponding to the logo "age", and "female" corresponding to the logo "sex".

由此，可以将该条数据中的第二字段与目标数据表中的第一字段进行映射，以确定该条数据在目标数据表中对应的一条数据记录，即“姓名”字段的字段值为“lucky”、“年龄”字段的字段值为“30”、“性别”字段的字段值为“female”的一条数据记录。Therefore, the second field in the piece of data can be mapped with the first field in the target data table to determine a data record corresponding to the piece of data in the target data table, that is, the field value of the "name" field is A data record with the field value of "lucky", "age" field being "30", and the field value of "gender" field being "female".

需要说明的是，对目标数据中每条数据执行解析操作时，可以考虑该条数据的存储形式以选择相应的解析方法。举例来说，日志数据多采用文本格式存储，其主要采用k1＝v1&k2＝v2的形式或者是json的存储形式，通过不同存储形式对应的解析方法对该日志数据进行解析。It should be noted that when performing the parsing operation on each piece of data in the target data, the storage form of the piece of data may be considered to select a corresponding parsing method. For example, log data is mostly stored in text format, which mainly adopts the form of k1=v1&k2=v2 or the storage form of json, and the log data is parsed through the parsing methods corresponding to different storage forms.

具体实现过程中，步骤S30中“对该条数据执行解析操作”可以采用如下步骤，方法流程图如图3所示：In the specific implementation process, in step S30, the following steps may be adopted for "performing the analysis operation on this piece of data", and the method flowchart is shown in Figure 3:

S301，提取该条数据中的键值对，键值对中的键表征字段、键值对中的值表征字段值。S301 , extract the key-value pair in the piece of data, where the key in the key-value pair represents the field, and the value in the key-value pair represents the field value.

本发明实施例中，非结构化数据解析主要是解析日志数据，可以将一条数据解析为k、v形式，一组k、v即一个键值对，举例来说，“http://user.info？name＝lucky&age＝30&sex＝female”这条数据中包含三个键值对，即“k＝name,v＝lucky”、“k＝age,v＝30”、“k＝sex,v＝female”。In this embodiment of the present invention, the unstructured data parsing is mainly to parse log data, and a piece of data can be parsed into the form of k and v, and a set of k and v is a key-value pair. For example, "http://user. info?name=lucky&age=30&sex=female" contains three key-value pairs, namely "k=name,v=lucky", "k=age,v=30", "k=sex,v=female" ".

S302，建立键值对中字段与字段值间的对应关系。S302, establishing a correspondence between fields and field values in the key-value pair.

本发明实施例中，可以对于一条数据中所提取的各键值对，可以将其转换为map结构以建立该键值对中字段与字段值间的对应关系。In this embodiment of the present invention, each key-value pair extracted from a piece of data may be converted into a map structure to establish a corresponding relationship between fields and field values in the key-value pair.

进一步，由于经过上述解析操作所获得的第二字段及其对应的字段值(或者map结构)之间是无序的，里面第二字段的数量常会多于目标数据表中第一字段的数量，而第一字段才是数据表所需要的字段，这就需要基于第一字段从无序的第二字段及其对应的字段值中提取出第一字段的字段值。对此，步骤S30中“按照字段次序对第一字段和第二字段执行映射操作”可以采用如下步骤，方法流程图如图4所示：Further, since the second field obtained through the above parsing operation and its corresponding field value (or map structure) are out of order, the number of second fields in it is often more than the number of first fields in the target data table, The first field is the field required by the data table, which requires extracting the field value of the first field from the disordered second field and its corresponding field value based on the first field. In this regard, in step S30, the following steps can be adopted for "performing the mapping operation on the first field and the second field according to the field order", and the method flowchart is shown in Figure 4:

S303，按照字段次序在第二字段中确定与第一字段相匹配的目标字段。S303: Determine a target field matching the first field in the second field according to the field order.

本发明实施例中，按照目标数据表中第一字段间的字段次序，以依次确定当前待处理的第一字段，进而在一条数据经解析得到的第二字段中确定与该第一字段相匹配的字段，即目标字段。In the embodiment of the present invention, according to the field order among the first fields in the target data table, the first field currently to be processed is sequentially determined, and then the second field obtained by parsing a piece of data is determined to match the first field field, the target field.

继续以目标数据表为用户数据表来说明，假设该用户数据表中“姓名”、“年龄”、“性别”三个字段间的字段次序为“姓名→年龄→性别”。按照该字段次序依次为“姓名”、“年龄”、“性别”三个字段匹配目标字段。以下以第一字段为“姓名”字段进行说明：Continue to take the target data table as the user data table for illustration, assuming that the field order among the three fields of "name", "age" and "gender" in the user data table is "name→age→gender". According to the order of the fields, the three fields of "name", "age" and "gender" match the target field. The following description takes the first field as the "Name" field:

假设，一条数据为“http://user.info？name＝lucky&age＝30&sex＝female&occupation＝farmer”，经过解析可以确定该条数据中包含有“姓名”、“年龄”、“性别”、“职业”四个字段以及这四个字段各自对应的字段值，即标识“name”对应的“lucky”、标识“age”对应的“30”、标识“sex”对应的“female”、以及标识“occupation”对应的“farmer”，显然，该条数据中第二字段的数量要多于用户数据表中第一字段的数量。显然，通过将第一字段的标识与四个第二字段的标识进行匹配，可以确定与第一字段相匹配的第二字段，即第一字段的“姓名”字段与第二字段的“姓名”字段相匹配。Suppose, a piece of data is "http://user.info?name=lucky&age=30&sex=female&occupation=farmer", after analysis, it can be determined that the piece of data contains "name", "age", "sex", "occupation" Four fields and their corresponding field values, namely "lucky" corresponding to the identifier "name", "30" corresponding to the identifier "age", "female" corresponding to the identifier "sex", and identifier "occupation" The corresponding "farmer", obviously, the number of the second field in this piece of data is more than the number of the first field in the user data table. Obviously, by matching the identifier of the first field with the identifiers of the four second fields, the second field that matches the first field can be determined, that is, the "name" field of the first field and the "name" field of the second field can be determined fields match.

S304，确定目标字段所对应的字段值，并将所确定的字段值依次写入至已确定的有序数据列表中，有序数据列表用于存储字段值、且其中字段值的次序与字段次序相同。S304, determine the field value corresponding to the target field, and write the determined field value into the determined ordered data list in turn, the ordered data list is used to store the field value, and the order of the field value and the field order same.

本发明实施例中，继续以目标数据表为用户数据表来说明。继续以第一字段为“姓名”字段进行说明：In the embodiment of the present invention, the target data table is continued to be used as the user data table for description. Continue to explain with the first field as the "name" field:

在确定以其匹配的第二字段的“姓名”字段作为目标字段后，可以将目标数据中“姓名”字段的字段值写入有序数据列表中。由此，按照目标数据表中“姓名→年龄→性别”的字段次序，可以依次将“http://user.info？name＝lucky&age＝30&sex＝female&occupation＝farmer”这条数据中的“lucky”、“30”、“female”写入到有序数据列表中，显然该有序数据列表中数据的次序与字段次序(或者有序字段列表中的字段次序)是一致的。而且，从一条数据是中提取的数据也是按照字段次序来获得的，这就既保证了取得数据的准确性，也保证了输出的数据与数据表的字段次序完全一致，不会输出错位。After it is determined that the "name" field of the second field with which it matches is used as the target field, the field value of the "name" field in the target data can be written into the ordered data list. Therefore, according to the field order of "name→age→gender" in the target data table, the "lucky", "30" and "female" are written into the ordered data list. Obviously, the order of the data in the ordered data list is consistent with the field order (or the field order in the ordered field list). Moreover, the data extracted from a piece of data is also obtained according to the field order, which not only ensures the accuracy of the obtained data, but also ensures that the output data is completely consistent with the field order of the data table, and the output will not be misplaced.

在其他一些实施例中，还可以进一步输出该有序数据列表，以告知UDF每个字段值的输出顺序，这是hive UDF里必要的步骤，将有序数据列表给到UDF的处理方法中，进行forward输出。需要说明的是，forward为hive UDF规定的输出函数，要使用UDF，必须按照hive UDF的规定使用该函数，否则就无法使用hive UDF。In some other embodiments, the ordered data list can be further output to inform the UDF of the output order of each field value, which is a necessary step in the hive UDF, and the ordered data list is given to the UDF processing method, Do forward output. It should be noted that forward is the output function specified by hive UDF. To use UDF, this function must be used according to the regulations of hive UDF, otherwise hive UDF cannot be used.

基于此，在将有序字段列表和有序数据列表给到UDF后，可以通过注册UDF到hive平台，来实现hive平台调用UDF进行清洗，写入数据到数据表。Based on this, after the ordered field list and the ordered data list are given to the UDF, the UDF can be registered to the hive platform, so that the hive platform can call the UDF for cleaning and write data to the data table.

由此，本发明实施例提供的数据自清洗方法，通过hive UDF读取数据表的元数据信息，能够自动将非结构化数据的字段与数据表的字段保持一致顺序，保证数据的准确性。而且在后续的扩展字段后，能够实时读取扩展的字段，并与数据表的字段顺序保证一致，做到自适应，极大减少了人工核对的工作量，提升效率。Therefore, in the data self-cleaning method provided by the embodiment of the present invention, the metadata information of the data table is read through the hive UDF, and the fields of the unstructured data and the fields of the data table can be automatically kept in the same order to ensure the accuracy of the data. Moreover, after the subsequent expansion of the fields, the expanded fields can be read in real time, and the order of the fields in the data table is guaranteed to be consistent, so as to achieve self-adaptation, which greatly reduces the workload of manual verification and improves the efficiency.

基于上述实施例提供的数据自清洗方法，本发明实施例还可以提供执行该数据自清洗方法的装置，该装置的结构示意图如图5所示，包括：Based on the data self-cleaning method provided by the above-mentioned embodiments, an embodiment of the present invention may also provide a device for executing the data self-cleaning method. The schematic structural diagram of the device is shown in FIG. 5 , including:

确定模块10，用于确定待处理的目标数据表和目标数据，目标数据为目标数据表对应的非结构化数据、且其中包含至少一条数据；A determination module 10, configured to determine the target data table to be processed and the target data, and the target data is unstructured data corresponding to the target data table and contains at least one piece of data;

获取模块20，用于获取目标数据表对应的字段信息，字段信息能够表征目标数据表中的第一字段、以及第一字段间的字段次序；The obtaining module 20 is used for obtaining the field information corresponding to the target data table, and the field information can represent the first field in the target data table and the field order between the first fields;

清洗模块30，用于针对目标数据中的每条数据，对该条数据执行解析操作，以确定该条数据中的第二字段、以及第二字段对应的字段值；按照字段次序对第一字段和第二字段执行映射操作，以确定该条数据于第一字段所对应的字段值。The cleaning module 30 is configured to perform a parsing operation on each piece of data in the target data to determine the second field in the piece of data and the field value corresponding to the second field; Perform a mapping operation with the second field to determine the field value corresponding to the first field of the piece of data.

可选的，获取模块10，具体用于：Optionally, obtain the module 10, which is specifically used for:

获取目标数据表对应的元数据信息表，元数据信息表中至少包含有序的字段元数据信息；依次读取有序的字段元数据信息，以确定当前读取到的字段元数据信息所匹配的字段；将所确定的字段依次写入至已确定的有序字段列表中，有序字段列表用于存储字段、且其中字段的次序与元数据信息表中字段元数据信息的次序相同。Obtain the metadata information table corresponding to the target data table, the metadata information table contains at least ordered field metadata information; read the ordered field metadata information in turn to determine the matching field metadata information currently read The determined fields are sequentially written into the determined ordered field list, the ordered field list is used to store the fields, and the order of the fields is the same as the order of the field metadata information in the metadata information table.

可选的，获取模块10，还用于：Optionally, the acquisition module 10 is also used for:

输出有序字段列表。Output an ordered list of fields.

可选的，用于对该条数据执行解析操作的清洗模块30，具体用于：Optionally, the cleaning module 30 for performing the parsing operation on the piece of data is specifically used for:

提取该条数据中的键值对，键值对中的键表征字段、键值对中的值表征字段值；建立键值对中字段与字段值间的对应关系。Extract the key-value pair in the piece of data, the key in the key-value pair represents the field, and the value in the key-value pair represents the field value; establish the corresponding relationship between the field and the field value in the key-value pair.

可选的，用于按照字段次序对第一字段和第二字段执行映射操作的清洗模块30，具体用于：Optionally, the cleaning module 30 for performing a mapping operation on the first field and the second field according to the field order is specifically used for:

按照字段次序在第二字段中确定与第一字段相匹配的目标字段；确定目标字段所对应的字段值，并将所确定的字段值依次写入至已确定的有序数据列表中，有序数据列表用于存储字段值、且其中字段值的次序与字段次序相同。Determine the target field that matches the first field in the second field according to the field order; determine the field value corresponding to the target field, and write the determined field value into the determined ordered data list in turn, orderly A data list is used to store field values, and the order of the field values is the same as the order of the fields.

可选的，清洗模块30，还用于：Optionally, the cleaning module 30 is also used for:

输出有序数据列表。Output an ordered list of data.

需要说明的是，本发明实施例中各模块的细化功能可以参见上述方法实施例对应公开部分，在此不再赘述。It should be noted that, for the refined functions of each module in the embodiments of the present invention, reference may be made to the corresponding disclosure parts of the foregoing method embodiments, and details are not described herein again.

基于上述实施例提供的数据自清洗方法，本发明实施例还提供一种电子设备，电子设备包括：至少一个存储器和至少一个处理器；存储器存储有程序，处理器调用存储器存储的程序，程序用于实现数据自清洗方法。Based on the data self-cleaning method provided by the above embodiment, the embodiment of the present invention further provides an electronic device, the electronic device includes: at least one memory and at least one processor; the memory stores a program, the processor calls the program stored in the memory, and the program uses For realizing data self-cleaning method.

基于上述实施例提供的数据自清洗方法，本发明实施例还提供一种存储介质，存储介质中存储有计算机可执行指令，计算机可执行指令用于执行数据自清洗方法。Based on the data self-cleaning method provided by the foregoing embodiments, an embodiment of the present invention further provides a storage medium, where computer-executable instructions are stored in the storage medium, and the computer-executable instructions are used to execute the data self-cleaning method.

以上对本发明所提供的一种数据自清洗方法、装置、电子设备及存储介质进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。A data self-cleaning method, device, electronic device and storage medium provided by the present invention are described in detail above. Specific examples are used in this paper to illustrate the principles and implementations of the present invention. In order to help understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, this specification The contents should not be construed as limiting the present invention.

需要说明的是，本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. For the same and similar parts among the various embodiments, refer to each other Can. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备所固有的要素，或者是还包括为这些过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply those entities or operations There is no such actual relationship or order between them. Furthermore, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article, or device of a list of elements is included, inherent to, or is also included for, those processes. , method, article or device inherent elements. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a data self-cleaning method, is characterized in that, described method comprises:

Determine the target data table and target data to be processed, and the target data is unstructured data corresponding to the target data table and contains at least one piece of data;

acquiring field information corresponding to the target data table, where the field information can represent the first field in the target data table and the field order between the first fields;

For each piece of data in the target data, a parsing operation is performed on the piece of data to determine a second field in the piece of data and a field value corresponding to the second field; A mapping operation is performed between a field and the second field to determine the field value corresponding to the first field of the piece of data.

2. The method according to claim 1, wherein the acquiring the field information corresponding to the target data table comprises:

obtaining a metadata information table corresponding to the target data table, where the metadata information table at least contains ordered field metadata information;

Read the ordered field metadata information in turn to determine the field matched by the currently read field metadata information;

The determined fields are sequentially written into the determined ordered field list, where the ordered field list is used to store the fields, and the order of the fields is the same as the order of the field metadata information in the metadata information table.

3. The method according to claim 2, wherein the method further comprises:

The ordered field list is output.

4. The method according to claim 1, wherein the performing a parsing operation on the piece of data comprises:

Extract the key-value pair in the piece of data, the key in the key-value pair represents the field, and the value in the key-value pair represents the field value;

A corresponding relationship between fields and field values in the key-value pair is established.

5. The method according to claim 1, wherein the performing a mapping operation on the first field and the second field according to the field order comprises:

determining a target field matching the first field in the second field according to the field order;

Determine the field value corresponding to the target field, and write the determined field value into the determined ordered data list in turn, and the ordered data list is used to store the field value, and the order of the field value is the same as the order of the field value. The fields are in the same order.

6. The method according to claim 5, wherein the method further comprises:

The ordered list of data is output.

7. A data self-cleaning device, wherein the device comprises:

a determination module, configured to determine a target data table to be processed and target data, where the target data is unstructured data corresponding to the target data table and contains at least one piece of data;

an acquisition module, configured to acquire field information corresponding to the target data table, where the field information can represent a first field in the target data table and a field order between the first fields;

a cleaning module, configured to perform a parsing operation on each piece of data in the target data to determine a second field in the piece of data and a field value corresponding to the second field; according to the field A mapping operation is performed on the first field and the second field in order to determine the field value corresponding to the first field of the piece of data.

8. The device according to claim 7, wherein the acquisition module is specifically used for:

Obtain the metadata information table corresponding to the target data table, where the metadata information table contains at least ordered field metadata information; read the ordered field metadata information in turn to determine the currently read The fields matched by the field metadata information; the determined fields are sequentially written into the determined ordered field list, the ordered field list is used to store the fields, and the order of the fields is consistent with the metadata information table The order of the field metadata information is the same.

9. An electronic device, characterized in that the electronic device comprises: at least one memory and at least one processor; the memory stores a program, and the processor calls the program stored in the memory, and the program is used for A data self-cleaning method according to any one of claims 1-6 is realized.

10. A storage medium, wherein the storage medium stores computer-executable instructions, and the computer-executable instructions are used to execute the data self-cleaning method according to any one of claims 1-6.