WO2020073687A1

WO2020073687A1 - Columnar storage method and apparatus for streaming data, device, and storage medium

Info

Publication number: WO2020073687A1
Application number: PCT/CN2019/092893
Authority: WO
Inventors: 陈俊峰
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-10-11
Filing date: 2019-06-26
Publication date: 2020-04-16
Anticipated expiration: 2021-04-11
Also published as: CN109542889B; CN109542889A

Abstract

The present application relates to the field of streaming data storage, in particular, to a columnar storage method and apparatus for streaming data, a device, and a storage medium. The columnar storage method for streaming data comprises: reading data from a real-time message system to obtain data to be processed; parsing the data to be processed to obtain structured data; converting the structured data into data in Row format, and every time a group of structured data is converted into data in Row format, storing the data in Row format into a memory; and forming multiple rows of data in Row format stored in the memory into data in Dataset<Row> format, and writing same to a file system in a columnar storage format. According to the present application, streaming data in a real-time message system is processed by means of Spark Streaming, thereby solving the problem that streaming data in a real-time message system cannot be stored in a columnar storage format at present, greatly improving the speed of subsequent processing of a large amount of data, and reducing the time for converting a row storage structure into a columnar storage structure.

Description

Streaming data column storage method, device, equipment and storage medium

本申请要求于2018年10月11日提交中国专利局、申请号为201811182661.5、发明名称为“流式数据列存储方法、装置、设备和存储介质”的中国专利申请的优先权，其全部内容通过引用结合在申请中。This application requires the priority of the Chinese patent application submitted to the China Patent Office on October 11, 2018, with the application number 201811182661.5 and the invention titled "Streaming Data Column Storage Method, Apparatus, Equipment, and Storage Media", all of which are approved by The reference is incorporated in the application.

Technical field

本申请涉及流式数据存储领域，尤其涉及一种流式数据列存储方法、装置、设备和存储介质。The present application relates to the field of streaming data storage, and in particular to a method, device, device, and storage medium for streaming data column storage.

Background technique

近年来，随着互联网的迅猛发展，数据的快速增长成了许多行业共同面临的机遇与挑战。在当今网络环境下，大量数据源是实时的，不间断的，要求对用户的响应时间也是实时的。这些数据以流式的形式被采集、计算与查询，如实时消息系统，对流入的数据均采取流式方式处理。其每时每刻都有各式各样的、海量的网络数据流入，流入速度各异，且数据结构复杂多样，包括二进制文件、文本文件、压缩文件等。对于此类系统，需要底层存储系统能够支持：对流入的数据以统一格式存储，对上层应用提供统一接口，方便检索，并且对实时性也有一定要求。In recent years, with the rapid development of the Internet, the rapid growth of data has become a common opportunity and challenge faced by many industries. In today's network environment, a large number of data sources are real-time and uninterrupted, and the response time to users is also required to be real-time. These data are collected, calculated and queried in a streaming form, such as a real-time messaging system, and all incoming data are processed in a streaming manner. It has various and massive network data inflows at all times, with different inflow speeds, and complex and diverse data structures, including binary files, text files, and compressed files. For such systems, the underlying storage system needs to support: store incoming data in a unified format, provide a unified interface for upper-layer applications, facilitate retrieval, and have certain requirements for real-time performance.

针对现今的大数据趋势，涌现了一批大数据处理平台，比如kafka，flume等。具体为前置应用把消息通过流式的方式输入到消息队列中，然后消息队列再通过某种形式把这些数据写入到磁盘，比如hdfs，或者本地磁盘。In response to today's big data trends, a number of big data processing platforms have emerged, such as kafka and flume. Specifically, the front-end application streams the messages into the message queue, and then the message queue writes these data to the disk in some form, such as hdfs, or the local disk.

发明人意识到由于实时消息系统的流式处理形式，使得消息最终都是以行存储的形式写入磁盘，比如json，或者普通文本。而在大数据处理中，很多情况下需要数据以列存储的形式进行保存，这时候传统的flume等工具就无法满足需求。The inventor realized that due to the streaming form of the real-time messaging system, the messages are eventually written to disk in the form of line storage, such as json, or ordinary text. In big data processing, in many cases, the data needs to be stored in the form of column storage. At this time, traditional tools such as flume cannot meet the needs.

发明内容Summary of the invention

有鉴于此，有必要针对现有实时消息系统中的数据均是以行存储的形式写入文件系统，而不是以列存储的形式写入文件系统，提供一种流式数据列存储方法、装置、设备和存储介质。In view of this, it is necessary to provide a method and device for streaming data column storage for the data in the existing real-time messaging system is written to the file system in the form of row storage, not in the file system , Equipment and storage media.

一种流式数据列存储方法，包括如下步骤：A streaming data column storage method includes the following steps:

从实时消息系统中读取数据，得到待处理数据；Read the data from the real-time messaging system to get the data to be processed;

对所述待处理数据进行解析，得到结构化数据；Analyze the data to be processed to obtain structured data;

将所述结构化数据转换为Row格式数据，每将一组所述结构化数据转换为Convert the structured data to Row format data, each time a set of the structured data is converted to

Row格式数据后，即存入内存中；After the Row format data is stored in the memory;

将所述内存中存入的多行所述Row格式数据组成Dataset<Row>格式数据，Combine multiple rows of Row format data stored in the memory into a Dataset <Row> format data,

通过列存储的格式写入文件系统。Write to the file system in a column storage format.

在其中一个实施例中，所述从实时消息系统中读取数据，得到待处理数据，包括：In one of the embodiments, the reading data from the real-time messaging system to obtain the data to be processed includes:

获取所述实时消息系统的访问权限，并连接到所述实时消息系统；Obtain the access authority of the real-time messaging system and connect to the real-time messaging system;

设定执行周期，按照所述执行周期从所述实时消息系统中读取数据。Set an execution cycle, and read data from the real-time messaging system according to the execution cycle.

在其中一个实施例中，所述对所述待处理数据进行解析，得到结构化数据，包括对所述待处理数据的格式进行判断后，按照判断结果采用不同的方法进行解析，具体包括：In one of the embodiments, the parsing the data to be processed to obtain structured data includes judging the format of the data to be processed, and using different methods for parsing according to the judgment result, specifically including:

若所述待处理数据为json格式，则调用FastJSON将所述json格式的待处理数据解析为所述结构化数据；If the data to be processed is in json format, FastJSON is called to parse the data to be processed in json format into the structured data;

若所述待处理数据为csv格式，则根据所述待处理数据的内容，并通过DataFrame()方法给所述csv格式的待处理数据添加结构化信息，得到所述结构化数据。If the to-be-processed data is in the csv format, then according to the content of the to-be-processed data, and using the DataFrame () method to add structured information to the to-be-processed data in the csv format to obtain the structured data.

在其中一个实施例中，所述将所述内存中存入的多行所述Row格式数据组成Dataset<Row>格式数据，通过列存储的格式写入文件系统，包括：In one embodiment, the rows of the Row format data stored in the memory form a Dataset <Row> format data, and are written into the file system in a column storage format, including:

通过数据框架的方法将多行所述Row格式数据组成所述Dataset<Row>格式数据；Form the Dataset <Row> format data in multiple rows of the Row format data through the data frame method;

通过parquet()将所述Dataset<Row>格式数据转换为parquet格式数据，并使用spark.read()将parquet格式数据写入文件系统。Convert the Dataset <Row> format data to parquet format data through parquet (), and write the parquet format data to the file system using spark.read ().

在其中一个实施例中，所述设定执行周期，按照所述执行周期从所述实时消息系统中读取数据，包括：In one of the embodiments, the setting an execution cycle, and reading data from the real-time messaging system according to the execution cycle includes:

从所述实时消息系统中第一条数据所在位置开始读取；Start reading from the location of the first piece of data in the real-time messaging system;

接收读取完毕的指令，停止读取，并记录读取完毕的位置；Receive the completed reading instruction, stop reading, and record the completed reading position;

获取上次读取完毕的位置，从上次读取完毕的位置开始读取，直到接收到读取完毕的指令，停止读取，并记录读取完毕的位置。Obtain the position where the last reading was completed, start reading from the position where the last reading was completed, until the instruction to complete the reading is received, stop reading, and record the position where the reading was completed.

在其中一个实施例中，所述若所述待处理数据为json格式，则调用FastJSON将所述json格式的待处理数据解析为所述结构化数据，包括：In one of the embodiments, if the to-be-processed data is in json format, calling FastJSON to parse the to-be-processed data in json format into the structured data includes:

提取所述json格式的待处理数据的字段信息；Extract the field information of the data to be processed in the json format;

根据所述字段信息对所述json格式的待处理数据进行排序，得到所述结构化数据。Sort the data to be processed in the json format according to the field information to obtain the structured data.

在其中一个实施例中，所述将所述内存中存入的多行所述Row格式数据组成Dataset<Row>格式数据，通过列存储的格式写入文件系统之后，还包括：In one of the embodiments, the rows of the Row format data stored in the memory form a Dataset <Row> format data, and after being written to the file system in a column storage format, the method further includes:

根据所述待处理数据的列信息对存储路径进行分割；Divide the storage path according to the column information of the data to be processed;

调用partitionBy()函数，将所述待处理数据中列名相同的列，按照所述列中不同的值存储于不同目录。The partitionBy () function is called, and the columns with the same column names in the data to be processed are stored in different directories according to different values in the columns.

一种流式数据列存储装置，包括如下模块：A streaming data column storage device includes the following modules:

数据获取模块，设置为从实时消息系统中读取数据，得到待处理数据；The data acquisition module is set to read data from the real-time messaging system to obtain data to be processed;

数据解析模块，设置为对所述待处理数据进行解析，得到结构化数据；A data analysis module configured to analyze the data to be processed to obtain structured data;

数据转换模块，设置为将所述结构化数据转换为Row格式数据，每将一组所述结构化数据转换为Row格式数据后，即存入内存中；The data conversion module is configured to convert the structured data into Row format data, and each time a group of the structured data is converted into Row format data, it is stored in the memory;

数据存储模块，设置为将所述内存中存入的多行所述Row格式数据组成Dataset<Row>格式数据，通过列存储的格式写入文件系统。The data storage module is configured to compose multiple rows of Row format data stored in the memory into a Dataset <Row> format data, and write the data into a file system in a column storage format.

一种计算机设备，包括存储器和处理器，所述存储器中存储有计算机可读指令，所述计算机可读指令被一个或多个所述处理器执行时，使得一个或多个所述处理器执行上述流式数据列存储方法的步骤。A computer device includes a memory and a processor, and the memory stores computer-readable instructions. When the computer-readable instructions are executed by one or more of the processors, the one or more of the processors are executed The steps of the above streaming data column storage method.

一种存储有计算机可读指令的存储介质，所述计算机可读指令被一个或多个处理器执行时，使得一个或多个所述处理器执行上述流式数据列存储方法的步骤。A storage medium storing computer-readable instructions, when the computer-readable instructions are executed by one or more processors, causes the one or more processors to perform the steps of the above streaming data column storage method.

本申请通过Spark Streaming对实时消息系统中的流式数据进行处理，解决了当前无法把实时消息系统中的流式数据保存为列存储格式的问题，极大地提高了后续对大量数据处理的速度，也节省了把行存储结构转换为列存储结构的时间，使用Spark Streaming作为计算框架，极大地利用分布式计算提高了转换和存储性能。This application uses Sparking to process the streaming data in the real-time messaging system, which solves the problem that the streaming data in the real-time messaging system cannot be saved as a column storage format, which greatly improves the speed of subsequent processing of large amounts of data. It also saves the time to convert the row storage structure to the column storage structure. Using Spark Streaming as the computing framework, it greatly uses distributed computing to improve the conversion and storage performance.

BRIEF DESCRIPTION

附图仅用于示出优选实施方式的目的，而并不认为是对本申请的限制。The drawings are only for the purpose of showing the preferred embodiments, and are not considered to limit the present application.

图1为本申请的一种流式数据列存储方法的整体流程图；FIG. 1 is an overall flowchart of a streaming data column storage method of this application;

图2为本申请的一种流式数据列存储方法中的数据获取过程的示意图；2 is a schematic diagram of a data acquisition process in a streaming data column storage method of this application;

图3为本申请的一种流式数据列存储方法中的数据解析过程的示意图；3 is a schematic diagram of a data parsing process in a streaming data column storage method of this application;

图4为本申请的一种流式数据列存储方法中的数据存储过程的示意图；4 is a schematic diagram of a data storage process in a streaming data column storage method of this application;

图5为本申请的一种流式数据列存储装置的结构图。FIG. 5 is a structural diagram of a streaming data column storage device of the present application.

detailed description

本技术领域技术人员可以理解，除非特意声明，这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是，本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件，但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。Those skilled in the art can understand that unless specifically stated, the singular forms "a", "an", "said" and "the" used herein may also include the plural forms. It should be further understood that the word "comprising" used in the specification of this application refers to the presence of the described features, integers, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components, and / or their groups.

图1为本申请的一种流式数据列存储方法的整体流程图，如图1所示，一种流式数据列存储方法，包括以下步骤：FIG. 1 is an overall flowchart of a streaming data column storage method of the present application. As shown in FIG. 1, a streaming data column storage method includes the following steps:

步骤S1，从实时消息系统中读取数据，得到待处理数据。Step S1: Read data from the real-time messaging system to obtain data to be processed.

其中，本申请的一种流式数据列存储装置主要包括Spark Streaming程序，本申请主要依托Spark Streaming对实时消息系统的流式数据进行处理，从而实现将实时消息系统中的流式数据转换为列存储的形式写入文件系统。实时消息系统中的数据为流式数据，实时消息系统也是流式数据的处理组件。Among them, a streaming data column storage device of the present application mainly includes the Spark Streaming program. The present application mainly relies on Spark Streaming to process the streaming data of the real-time messaging system, thereby realizing the conversion of the streaming data in the real-time messaging system into columns The storage format is written to the file system. The data in the real-time messaging system is streaming data, and the real-time messaging system is also a processing component of streaming data.

其中，Spark Streaming包括数据获取模块、数据解析模块、数据转换模块和数据存储模块。Among them, Spark Streaming includes data acquisition module, data analysis module, data conversion module and data storage module.

上述步骤执行时，数据获取模块的其中一个子模块每隔一段时间发出数据获取指令，另一个子模块则接收上述数据获取指令，接收到数据获取指令后，执行指令，从实时消息系统中读取数据，得到待处理数据。When the above steps are executed, one of the sub-modules of the data acquisition module issues a data acquisition instruction at regular intervals, and the other sub-module receives the data acquisition instruction. After receiving the data acquisition instruction, it executes the instruction and reads it from the real-time messaging system Data to get the data to be processed.

步骤S2，对所述待处理数据进行解析，得到结构化数据。Step S2: Analyze the data to be processed to obtain structured data.

上述步骤执行时，数据获取模块将得到的待处理数据发送给数据解析模块，数据解析模块对待处理数据进行解析，并根据待处理数据的不同格式采用不同的方法进行解析。从实时消息系统中获取的数据，其数据结构复杂多样，包括二进制文件、文本文件、压缩文件等各种格式的数据。数据解析模块接收到不同格式的待处理数据后，采用不同方法进行解析，最终将待处理数据统一解析为结构化数据，再发送给数据转换模块。When the above steps are executed, the data acquisition module sends the obtained data to be processed to the data parsing module. The data parsing module parses the data to be processed, and uses different methods to parse according to different formats of the data to be processed. The data obtained from the real-time messaging system has a complex and diverse data structure, including binary files, text files, compressed files and other data in various formats. After receiving the data to be processed in different formats, the data parsing module adopts different methods for parsing, and finally parses the data to be processed into structured data, and then sends it to the data conversion module.

步骤S3，将所述结构化数据转换为Row格式数据，每将一组所述结构化数据转换为Row格式数据后，即存入内存中。Step S3, the structured data is converted into Row format data, and each time a group of the structured data is converted into Row format data, it is stored in the memory.

上述步骤执行时，数据转换模块将数据解析模块发来的结构化数据转换为Row格式数据，暂时存入数据存储模块中。When the above steps are performed, the data conversion module converts the structured data sent from the data analysis module to Row format data, and temporarily stores it in the data storage module.

在其中一个优选的实施例中，通过spark.createRow()将解析后的结构化数据转换为Row格式数据。In one of the preferred embodiments, the parsed structured data is converted into Row format data through spark.createRow ().

其中，Row格式是Spark Streaming自带的一种格式，并且Row格式为一种带列信息的数据结构，其实质为一行数据。Among them, the Row format is a format that comes with Spark Streaming, and the Row format is a data structure with column information, which is essentially a row of data.

步骤S4，将所述内存中存入的多行所述Row格式数据组成Dataset<Row>格式数据，通过列存储的格式写入文件系统。In step S4, a plurality of rows of Row format data stored in the memory are combined into a Dataset <Row> format data, and written into the file system in a column storage format.

上述步骤执行时，经过数据转换模块转换好的Row格式数据暂时存入数据存储模块，每隔一段时间，把累加的多行Row格式数据组成Dataset<Row>格式数据，一次性以列存储的格式写入文件系统。When the above steps are executed, the Row format data converted by the data conversion module is temporarily stored in the data storage module, and every once in a while, the accumulated multi-row Row format data is composed into a Dataset <Row> format data, which is stored in a column format at a time Write to the file system.

其中，Dataset<Row>格式是Spark Streaming自带的一种格式，Dataset<Row>格式是很多行Row格式组成的矩阵,是Row格式数据的有序集合，Dataset<Row>即为列信息结构，将Dataset<Row>格式的数据转换为列存储的格式即是数据以列的形式存在。Among them, the Dataset <Row> format is a format that comes with Spark Streaming. The Dataset <Row> format is a matrix composed of many row Row formats. It is an ordered collection of Row format data. Dataset <Row> is a column information structure. Converting Dataset <Row> format data to column storage format means that the data exists in the form of columns.

本实施例，通过Spark Streaming对实时消息系统中的流式数据进行解析处理，将每一条数据转换为Spark Streaming中的Row格式数据，并将多行Row格式数据合并累加暂时放到数据存储模块中，再组成Dataset<Row>格式一次性写入文件系统，解决了当前流式数据无法列存储的问题，使用Spark Streaming作为计算框架，提高了数据的转换和存储性能。In this embodiment, the streaming data in the real-time messaging system is parsed through Spark Streaming, and each piece of data is converted into Row format data in Spark Streaming, and the multiple rows of Row format data are combined and temporarily placed in the data storage module , And then form the Dataset <Row> format one-time write file system, which solves the problem that current streaming data cannot be stored in columns. Using Spark Streaming as a computing framework, improves the data conversion and storage performance.

在一个实施例中，图2为本申请的一种流式数据列存储方法中的数据获取过程的示意图，如图2所示，一种流式数据列存储方法的数据获取过程，包括如下步骤：In one embodiment, FIG. 2 is a schematic diagram of a data acquisition process in a streaming data column storage method of the present application. As shown in FIG. 2, a data acquisition process of a streaming data column storage method includes the following steps :

步骤S101，获取所述实时消息系统的访问权限，并连接到所述实时消息系统。Step S101: Acquire the access right of the real-time messaging system and connect to the real-time messaging system.

上述步骤执行时，通过使用远程连接权限的用户名和密码获取所述实时消息系统的访问权限，并通过Hibernate对象关系映射框架与所述实时消息系统进行连接。When the above steps are performed, the access rights of the real-time messaging system are obtained by using the user name and password of the remote connection authority, and connected to the real-time messaging system through the Hibernate object relationship mapping framework.

步骤S102，设定执行周期，按照所述执行周期从所述实时消息系统中读取数据。Step S102: Set an execution cycle, and read data from the real-time messaging system according to the execution cycle.

上述步骤执行时，设定Spark Streaming程序的执行周期，并将执行周期作为参数值传入Spark Streaming程序。When the above steps are executed, the execution cycle of the Spark Streaming program is set, and the execution cycle is passed into the Spark Streaming program as the parameter value.

在其中一个优选的实施例中，也可以将执行周期作为固定值写在程序中，设定在Spark Streaming程序的配置参数中。In one of the preferred embodiments, the execution cycle may also be written as a fixed value in the program and set in the configuration parameters of the Spark Streaming program.

在其中一个优选的实施例中，执行周期可以设置为每次读取时间间隔相同，也可以根据实时消息系统数据流入速度的快慢设置为读取时间间隔不相同。In one of the preferred embodiments, the execution cycle may be set to be the same for each read time interval, or may be set to be different for the read time interval according to the speed of the data inflow rate of the real-time messaging system.

本实施例，将执行周期作为参数值传入Spark Streaming程序中，比较灵活，将执行周期作为固定值写在程序中，可以确保数值的安全性较高，而且执行周期可以根据实时消息系统数据流入速度的快慢灵活设置。In this embodiment, the execution cycle is passed into the Spark Streaming program as a parameter value, which is more flexible. Writing the execution cycle as a fixed value in the program can ensure high value security, and the execution cycle can be based on the real-time messaging system data inflow The speed can be set flexibly.

在一个实施例中，图3为本申请的一种流式数据列存储方法中的数据解析过程的示意图，如图3所示，一种流式数据列存储方法的数据解析过程，包括如下步骤：In one embodiment, FIG. 3 is a schematic diagram of a data parsing process in a streaming data column storage method of the present application. As shown in FIG. 3, a data parsing process of a streaming data column storage method includes the following steps :

步骤S201，若所述待处理数据为json格式，则调用FastJSON将所述json格式的待处理数据解析为所述结构化数据。Step S201, if the data to be processed is in json format, FastJSON is called to parse the data to be processed in json format into the structured data.

上述步骤执行时，若从实时消息系统中获取的数据为json格式，使用相关库进行对其进行解析，在其中一个优选的实施例中，使用FastJSON对其进行解析。When the above steps are performed, if the data obtained from the real-time messaging system is in json format, it is parsed using a related library. In one of the preferred embodiments, it is parsed using FastJSON.

具体的，若从实时消息系统中获取的数据为{"id"：0，"name"："Alice"，"age"：21}的json格式数据，其结构包括3个字段，分别为id、name和age，分别代表id、姓名和年龄。使用Fastjson对其进行解析之后，则会解析为一个包含id，name，age的结构化数据，之后，再将解析之后的数据转换为Spark Streaming自带的Row格式数据。Specifically, if the data obtained from the real-time messaging system is {"id": 0, "name": "Alice", "age": 21} json format data, its structure includes 3 fields, namely id, name and age represent id, name and age, respectively. After using Fastjson to parse it, it will be parsed into a structured data containing id, name, and age, and then the parsed data will be converted to Row format data that comes with Spark Streaming.

步骤S202，若所述待处理数据为csv格式，则根据所述待处理数据的内容，并通过DataFrame()方法给所述csv格式的待处理数据添加结构化信息，得到所述结构化数据。Step S202, if the data to be processed is in the csv format, then according to the content of the data to be processed, and using the DataFrame () method to add structured information to the data to be processed in the csv format to obtain the structured data.

其中，不同于json和avro等格式的数据，csv格式的数据一般只包含数据信息，不包含结构信息。如上述步骤S201中提到的{"id"：0，"name"："Alice"，"age"：21}的json格式数据，如果是csv格式，则其数据内容只有0，Alice，21。这样格式的数据，无法通过数据内容确定每一列所表示的意思，需要根据用户对数据的认知，设定第一列为id，第二列为姓名，第三列为年龄，即根据数据的内容自行添加结构化信息，将数据解析为结构化数据，再将数据转换为Row格式数据。Among them, unlike data in json and avro formats, data in csv format generally contains only data information, not structure information. As mentioned in the above step S201, {"id": 0, "name": "Alice", "age": 21} json format data, if it is in csv format, its data content is only 0, Alice, 21. In this format of data, the meaning of each column cannot be determined by the content of the data. It is necessary to set the first column as id, the second column as the name, and the third column as the age according to the user's perception of the data. The content adds structured information by itself, parses the data into structured data, and then converts the data to Row format data.

上述步骤执行时，通过spark.createDataFrame(RowJavaRDD,type)方法来给数据添加结构化信息。其中，RowJavaRDD指的是数据信息，type为结构信息。When the above steps are executed, the spark.createDataFrame (RowJavaRDD, type) method is used to add structured information to the data. Among them, RowJavaRDD refers to data information, and type is structural information.

本实施例，对不同格式的数据采用不同的解析方法，使数据统一解析为结构化数据，再将数据转换为Row格式数据，节省了数据处理的时间，且提高了数据处理的准确度。In this embodiment, different analysis methods are used for data in different formats, so that the data is uniformly parsed into structured data, and then the data is converted into Row format data, which saves data processing time and improves the accuracy of data processing.

在一个实施例中，图4为本申请的一种流式数据列存储方法中的数据存储过程的示意图，如图4所示，一种流式数据列存储方法的数据存储过程，包括如下步骤：In one embodiment, FIG. 4 is a schematic diagram of a data storage process in a streaming data column storage method of the present application. As shown in FIG. 4, a data storage process of a streaming data column storage method includes the following steps :

步骤S301，通过数据框架的方法将多行所述Row格式数据组成所述Dataset<Row>格式数据。In step S301, multiple rows of the Row format data are combined into the Dataset <Row> format data by a data frame method.

在其中一个优选的实施例中，使用spark.createDataFrame(RowJavaRDD,type)将多行Row格式数据组成Dataset<Row>格式数据，其中，RowJavaRDD表示数据信息，type表示结构信息。In one of the preferred embodiments, spark.createDataFrame (RowJavaRDD, type) is used to compose multiple rows of Row format data into Dataset <Row> format data, where RowJavaRDD represents data information and type represents structure information.

步骤S302，通过parquet()将所述Dataset<Row>格式数据转换为parquet格式数据，并使用spark.read()将parquet格式数据写入文件系统。Step S302: Convert the Dataset <Row> format data to parquet format data through parquet (), and write the parquet format data to the file system using spark.read ().

上述步骤执行时，使用spark.read().parquet(filename)将Dataset<Row>格式数据以parquet格式，写入文件系统。When the above steps are executed, use spark.read (). Parquet (filename) to write Dataset <Row> format data to the file system in parquet format.

上述步骤执行时，如果要以parquet格式进行列存储，使用parquet()将Dataset<Row>格式数据转换为parquet格式，具体的，使用parquet(filename)将Dataset<Row>格式数据转换为parquet格式，parquet是一种支持列式存储的文件格式。When the above steps are performed, if you want to store columns in parquet format, use parquet () to convert Dataset <Row> format data to parquet format. Specifically, use parquet (filename) to convert Dataset <Row> format data to parquet format. Parquet is a file format that supports columnar storage.

上述步骤执行时，使用spark.read()将转换后的parquet格式的数据，写入文件系统。When the above steps are executed, use spark.read () to write the converted parquet format data to the file system.

本步骤中，还可以通过其他列存储格式将数据写入文件系统。In this step, data can also be written to the file system through other column storage formats.

文件系统包括本地文件(file://)和HDFS(hdfs://)，亦可包括其他spark所支持的其他文件系统，比如亚马逊S3(s3://)。一般是通过文件名来制定，比如要写入hdfs根目录下的data文件夹，则可以设置为hdfs:///data/。The file system includes local files (file: //) and HDFS (hdfs: //), and can also include other file systems supported by other sparks, such as Amazon S3 (s3: //). It is generally formulated by the file name. For example, if you want to write to the data folder in the root directory of hdfs, you can set it to hdfs: /// data /.

本实施例，通过使用spark.createDataFrame(RowJavaRDD,type)将多行Row格式数据组成Dataset<Row>格式数据，使流式数据转换为列数据结构，为后续数据列存储打好基础。使用spark.read()将组成的Dataset<Row>格式的数据写入文件系统，实现了流式数据以列存储的格式写入文件系统。In this embodiment, by using spark.createDataFrame (RowJavaRDD, type), multiple rows of Row format data are combined into a Dataset <Row> format data, so that the streaming data is converted into a column data structure, which lays a solid foundation for subsequent data column storage. Use spark.read () to write the data in the format of Dataset <Row> to the file system, and realize that the streaming data is written to the file system in the format of column storage.

在一个实施例中，按照所述执行周期从所述实时消息系统中读取数据，包括如下具体步骤：In one embodiment, reading data from the real-time messaging system according to the execution cycle includes the following specific steps:

从所述实时消息系统中第一条数据所在位置开始读取。Start reading from the location of the first piece of data in the real-time messaging system.

接收读取完毕的指令，停止读取，并记录读取完毕的位置。Receive the read completed command, stop reading, and record the completed position.

上述步骤执行时，当程序为首次读取数据时，则从实时消息系统中第一条数据所在位置开始读取，直到读取时产生的最新数据读取完，此时，会接收到读取完毕的指令，则停止读取，Spark Streaming自动记录下读取完毕的位置。When the above steps are executed, when the program is to read data for the first time, it will start reading from the location of the first piece of data in the real-time messaging system until the latest data generated during reading is read. At this time, the read will be received After the completed instruction, the reading is stopped, and Spark Streaming automatically records the position where the reading is completed.

其中，首次读取数据是指第一次启动程序时的读取，Spark Streaming程序为永久运行，如果不暂停，可以一直运行下去。由于实时消息系统的数据是源源不断的被写入的，所以每次数据读取完毕时，由Spark Streaming记录下每次读取完毕的位置，以便下次读取。Among them, the first reading data refers to the reading when the program is started for the first time. The Spark Streaming program runs permanently. If it is not suspended, it can continue to run. Because the data of the real-time messaging system is written continuously, every time the data is read, Spark Streaming records the location of each read for the next reading.

以后每次读取数据时，获取上次读取完毕的位置，从上次读取完毕的位置开始读取，直到接收到读取完毕的指令，停止读取，并记录读取完毕的位置。Each time the data is read in the future, the position where the last reading was completed is obtained, and the reading is started from the position where the last reading was completed, until the instruction to read the completion is received, the reading is stopped, and the position where the reading is completed is recorded.

本实施例，每次读取完毕，都会记录下读取完毕的位置，便于下次读取，且不易出错，提高了数据获取的速度和质量。In this embodiment, each time the reading is completed, the position where the reading is completed is recorded, which is convenient for the next reading, and is not easy to make mistakes, which improves the speed and quality of data acquisition.

在一个实施例中，调用FastJSON将所述json格式的待处理数据解析为所述结构化数据，包括如下具体步骤：In one embodiment, calling FastJSON to parse the to-be-processed data in the json format into the structured data includes the following specific steps:

从实时消息系统中获取的数据为{"age"：21，"id"：0，"name"："Alice"，}的json格式数据，使用FastJSON将数据的字段信息提取出来，分别为age、id、和name，分别代表年龄、id和姓名。再根据字段信息对待处理数据进行排序，比如，排好序的数据结构为{"id"，"name"，"age"}，则{"id"，"name"，"age"}即为结构化数据。The data obtained from the real-time messaging system is {"age": 21, "id": 0, "name": "Alice",} json format data, using FastJSON to extract the field information of the data, respectively, age, id, and name represent age, id, and name, respectively. Then sort the data to be processed according to the field information. For example, if the sorted data structure is {"id", "name", "age"}, then {"id", "name", "age"} is the structure化数据。 Data.

在一个实施例中，数据转换模块根据需要决定是否对解析后的结构化数据进行修改。若存储时需要按照年月日进行存储，若从实时消息系统中获取的数据含有时间戳，如“2017-09-21 08:16:05.011”，而存储时需要按照年月日进行存储，则需要把时间戳中的年月日信息提取出来。In one embodiment, the data conversion module decides whether to modify the parsed structured data as needed. If the storage needs to be stored according to the year, month and day, if the data obtained from the real-time messaging system contains a time stamp, such as "2017-09-21 08: 16: 05.011", and the storage needs to be stored according to the year, month and day, then The year, month, and day information in the time stamp needs to be extracted.

在一个实施例中，可以根据待处理数据的列信息对存储路径进行分割，通过partitionBy()函数，将所述待处理数据中列名相同的列，按照所述列中不同的值存储于不同目录。上述步骤执行时，通过spark.read().partitionBy()对存储路径进行分割，比如参数填写为newDf.write().mode(SaveMode.Append).partitionBy("stream","year","month","day","hour").orc("orc")，指的是根据stream，year，month，day字段来进行路径的分割。In one embodiment, the storage path may be divided according to the column information of the data to be processed, and the column with the same column name in the data to be processed may be stored in different columns according to different values in the column through the partitionBy () function table of Contents. When the above steps are executed, the storage path is divided by spark.read (). PartitionBy (), for example, fill in the parameters as newDf.write (). Mode (SaveMode.Append) .partitionBy ("stream", "year", "month "," day "," hour "). orc (" orc ") refers to the path segmentation based on the stream, year, month, and day fields.

partitionBy是分析性函数的一部分，它和聚合函数groupBy不同的地方在于它能返回一个分组中的多条记录，而聚合函数一般只有一条反映统计值的记录，partitionBy用于给结果集分组，如果没有指定那么它把整个结果集作为一个分组，partitionBy返回的是分组里的每一条数据，并且可以对分组数据进行排序操作。PartitionBy is part of the analytical function. It differs from the aggregate function groupBy in that it can return multiple records in a group. The aggregate function generally has only one record that reflects the statistical value. partitionBy is used to group the result set. Specify that it treats the entire result set as a group, and partitionBy returns each piece of data in the group, and can sort the grouped data.

本实施例，通过使用partitionBy函数实现了对存储路径的分割，方便了后续对大量数据的处理。In this embodiment, the partitioning of the storage path is realized by using the partitionBy function, which facilitates the subsequent processing of a large amount of data.

一种流式数据列存储装置，如图5所示，包括如下模块：A streaming data column storage device, as shown in FIG. 5, includes the following modules:

在一个实施例中，提出了一种计算机设备，包括存储器和处理器，存储器中存储有计算机可读指令，计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行计算机可读指令时实现上述各实施例中所述的流式数据列存储方法的步骤。In one embodiment, a computer device is proposed, which includes a memory and a processor. The memory stores computer-readable instructions. When the computer-readable instructions are executed by one or more processors, the one or more processors are executed. The computer-readable instructions implement the steps of the streaming data column storage method described in the foregoing embodiments.

在一个实施例中，提出了一种存储有计算机可读指令的存储介质，计算机可读指令被一个或多个处理器执行时，使得一个或多个处理器执行上述各实施例中所述的流式数据列存储方法的步骤。其中，所述存储介质可以为非易失性存储介质。In one embodiment, a storage medium storing computer-readable instructions is provided. When the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned embodiments. Steps of streaming data column storage method. Wherein, the storage medium may be a non-volatile storage medium.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储介质中，存储介质可以包括：只读存储器(ROM，Read Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁盘或光盘等。Those of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by instructing relevant hardware through a program. The program may be stored in a computer-readable storage medium, and the storage medium may include: Read only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-mentioned embodiments can be arbitrarily combined. To simplify the description, all possible combinations of the technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, All should be considered within the scope of this description.

以上所述实施例仅表达了本申请一些示例性实施例，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express some exemplary embodiments of the present application, and their descriptions are more specific and detailed, but they should not be construed as limiting the patent scope of the present application. It should be noted that, for those of ordinary skill in the art, without departing from the concept of the present application, a number of modifications and improvements can also be made, which all fall within the protection scope of the present application. Therefore, the protection scope of the patent of this application shall be subject to the appended claims.

Claims

A streaming data column storage method includes the following steps:

Read the data from the real-time messaging system to get the data to be processed;

Analyze the data to be processed to obtain structured data;

Convert the structured data into Row format data, and each time a set of the structured data is converted into Row format data, it is stored in the memory;

A plurality of rows of the Row format data stored in the memory are combined into a Dataset <Row> format data, which is written into the file system in a column storage format.

The streaming data column storage method according to claim 1, said reading data from a real-time messaging system to obtain data to be processed includes:

Obtain the access authority of the real-time messaging system and connect to the real-time messaging system;

Set an execution cycle, and read data from the real-time messaging system according to the execution cycle.

According to the streaming data column storage method according to claim 1, the parsing the data to be processed to obtain structured data includes determining the format of the data to be processed and using different methods according to the judgment result The analysis includes:

If the data to be processed is in json format, FastJSON is called to parse the data to be processed in json format into the structured data;

If the to-be-processed data is in the csv format, then according to the content of the to-be-processed data, and using the DataFrame () method to add structured information to the to-be-processed data in the csv format to obtain the structured data.

The streaming data column storage method according to claim 1, wherein the rows of the Row format data stored in the memory form a Dataset <Row> format data, and are written into a file system in a column storage format, including :

Form the Dataset <Row> format data in multiple rows of the Row format data through the data frame method;

Convert the Dataset <Row> format data to parquet format data through parquet (), and write the parquet format data to the file system using spark.read ().

The streaming data column storage method according to claim 2, wherein the set execution cycle, and reading data from the real-time messaging system according to the execution cycle includes:

Start reading from the location of the first piece of data in the real-time messaging system;

Receive the completed reading instruction, stop reading, and record the completed reading position;

Obtain the position where the last reading was completed, start reading from the position where the last reading was completed, until the instruction to complete the reading is received, stop reading, and record the position where the reading was completed.

According to the streaming data column storage method of claim 3, if the to-be-processed data is in json format, calling FastJSON to parse the to-be-processed data in json format into the structured data includes:

Extract the field information of the data to be processed in the json format;

Sort the data to be processed in the json format according to the field information to obtain the structured data.

According to the streaming data column storage method according to claim 1, the rows of the Row format data stored in the memory form a Dataset <Row> format data, and are written into the file system in a column storage format, Also includes:

Divide the storage path according to the column information of the data to be processed;

The partitionBy () function is called, and the columns with the same column names in the data to be processed are stored in different directories according to different values in the columns.

A streaming data column storage device includes the following modules:

The data acquisition module is set to read data from the real-time messaging system to obtain data to be processed;

A data analysis module configured to analyze the data to be processed to obtain structured data;

The data conversion module is configured to convert the structured data into Row format data, and each time a group of the structured data is converted into Row format data, it is stored in the memory;

The data storage module is configured to compose multiple rows of Row format data stored in the memory into a Dataset <Row> format data, and write the data into a file system in a column storage format.

A computer device includes a memory and a processor, and the memory stores computer-readable instructions. When the computer-readable instructions are executed by one or more of the processors, the one or more of the processors are executed The following steps:

Analyze the data to be processed to obtain structured data;

The computer device of claim 9, when the computer-readable instructions are executed by one or more of the processors, causing the one or more of the processors to further perform the following steps:

The computer device of claim 10, when the computer-readable instructions are executed by one or more of the processors, causing the one or more of the processors to further perform the following steps:

The computer device of claim 11, when the computer-readable instructions are executed by one or more of the processors, causing the one or more of the processors to further perform the following steps:

Extract the field information of the data to be processed in the json format;

A storage medium storing computer-readable instructions, which when executed by one or more processors, causes the one or more processors to perform the following steps:

Analyze the data to be processed to obtain structured data;

The storage medium of claim 16, when the computer-readable instructions are executed by one or more processors, causing the one or more processors to execute the following steps:

The storage medium of claim 16, when the computer-readable instructions are executed by one or more processors, causing the one or more processors to perform further steps as follows:

The storage medium of claim 17, when the computer-readable instructions are executed by one or more processors, causing the one or more processors to execute the following steps: