CN107193494B

CN107193494B - A RDD Persistence Method Based on SSD and HDD Hybrid Storage System

Info

Publication number: CN107193494B
Application number: CN201710358093.9A
Authority: CN
Inventors: 陆克中; 黄泽成; 毛睿; 廖好; 朱金彬; 隋秀峰
Original assignee: Shenzhen University
Current assignee: Baode Network Security System Shenzhen Co ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2020-05-12
Anticipated expiration: 2037-05-19
Also published as: CN107193494A

Abstract

The invention provides an RDD (remote data description) persistence method based on a SSD (solid State disk) and HDD (hard disk drive) hybrid storage system, which comprises the following steps: the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager; the disk block manager transmits the preset persistence level to a device adapter; the equipment adapter receives a preset persistence level of data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager; the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager; and the block manager stores the data in the RDD module in the SSD or the HDD according to the data storage address.

Description

RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system

Technical Field

The invention relates to the technical field of data processing, in particular to an RDD (remote data description) persistence method based on a SSD (solid state drive) and HDD (hard disk drive) hybrid storage system.

Background

In the existing big data era, in the face of massive data, how to manage, analyze and extract valuable information in an effective time becomes a problem which people need to solve urgently. However, big data, whether it be of scale, variety, or structure, presents a significant challenge to people's ability to host data.

Spark is a big data computing framework which is currently efficient and widely used in the industry, and is a universal and fast large-scale data processing engine. Firstly, Spark provides a uniform solution, and can be used for complex tasks such as interactive query, real-time stream processing, machine learning and the like; secondly, the Spark divides phases and tasks through an elastic distributed data set (RDD), optimizes the execution sequence of subtasks through a high-efficiency Directed Acyclic Graph (DAG) execution engine, and greatly improves the data processing efficiency through memory-based calculation; thirdly, Spark data management depends on multiple data sources such as HDFS and Hive, Spark in a cluster mode realizes horizontal expansion, and large-scale data processing is supported. RDD is the most important concept of Spark to distinguish from other big data computing frameworks, which is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into a plurality of partitions, and Spark performs various operations on the RDD in units of partitions. And the data of the persistent (Persist) RDD partition is cached in a memory or a hard disk, so that the intermediate result of the calculation task can be directly read by the subsequent iteration task, the repeated calculation is avoided, and the data processing efficiency is greatly improved. In addition, the data is durably transmitted to the hard disk, the limitation of insufficient memory capacity on the size of the data set is broken, and spare processing of large data by Spark is enabled.

However, at present, the initial RDD dataset is divided according to a random proportion, and the persistence framework provided by Spark persists data to different storage media according to the proportion, so that persistence on demand cannot be realized.

Disclosure of Invention

The invention aims to solve the technical problem that on-demand persistence cannot be realized in the prior art, and provides an RDD persistence method based on an SSD and HDD hybrid storage system, which can not realize on-demand persistence.

The embodiment of the invention provides an RDD (remote data description) persistence method based on a SSD (solid State disk) and HDD (hard disk drive) hybrid storage system, which comprises the following steps:

the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;

the block manager transmits the block identifier and a preset persistence level to a disk block manager;

the disk block manager transmits the preset persistence level to a device adapter;

the equipment adapter receives a preset persistence level of data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager;

the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager;

and the block manager stores the data in the RDD module in the SSD or the HDD according to the data storage address.

The invention also provides a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: and storing the data in the RDD module in the SSD or the HDD according to the preset persistence level by using the data storage address so as to realize the on-demand persistence of the Spark application program.

Drawings

FIG. 1 is a block diagram of one embodiment of a distributed computing system according to the present invention.

FIG. 2 is a flow chart of one embodiment of a data processing method of the distributed computing system of the present invention.

FIG. 3 is a flow chart of one embodiment of the RDD persistence method based on the SSD and HDD hybrid storage system of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Specifically, the emergence of a Solid-State Drive (SSD) brings a new opportunity for improving the performance of a storage system, and the SSD has the advantages of low power consumption, low latency, small size, and the like. Unlike the conventional Hard disk drive (Hard disk drive for short) which addresses by moving a robot arm, the SSD is completely built on a semiconductor chip, and thus has a random access performance. However, due to the disadvantages of high cost and limited life span of the SSD, the complete replacement of the HDD with the SSD will significantly increase the cost of the industry. In order to make reasonable use of the advantages of high performance of SSDs and low price of HDDs, heterogeneous data centers based on hybrid storage of SSDs and HDDs are widely researched and applied.

As shown in fig. 1, the distributed computing system according to an embodiment of the present invention includes a spare platform module 1 and a hybrid storage module 2, where the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the spare platform module 1 is connected to the SSD unit 21 and the HDD unit 22 respectively;

the Spark platform module 1 uses a big data processing frame Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage, and the Spark platform module 1 is further configured to receive a query instruction, and fetch and output data corresponding to the query instruction from the SSD unit 21 or the HDD unit 22.

The Spark platform module is respectively connected with the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of the data can be realized.

In a specific implementation, the Spark platform module 1 includes a first API (application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, the Spark platform module 1 is connected to the SSD unit 21 through the first API, and the Spark platform module 1 is connected to the HDD unit 22 through the second API, so as to perform data transmission. The Spark platform module 1 may expose the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, the selection of the storage in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.

In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units in the same layer. The processed data specifically includes RDD partition data. The Spark platform module is further used for persisting the RDD partition data into the SSD unit or the HDD unit according to a preset partition proportion value.

In a specific implementation, the spare platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to a hot degree of the RDD partition data. The I/O bandwidth and reduced access latency due to SSD can be effectively increased. HDDs still provide substantial storage efficiency for data that requires less storage performance. After an additional large amount of data is collected and captured by the data center, it is not often accessed, called cold data, which accounts for about 90% of global data. While the remaining 10% of the data is collected and captured and is accessed frequently, referred to as hot data. Clearly, it is not reasonable to store all of the data on a high performance, low latency storage device, and the cost is prohibitively expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, performance can be greatly improved by constructing a hybrid storage system, and cost controllability is ensured.

In specific implementation, the distributed computing system further includes a capacity monitoring module connected to the hybrid storage module, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module and output an alarm signal when the remaining capacity is smaller than a preset threshold. That is to say, the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is smaller than a preset threshold. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.

The present invention also provides a data processing method of a distributed computing system according to an embodiment, as shown in fig. 2, the data processing method includes the following steps:

step S21, the Spark platform module sends the processed data to the SSD unit or the HDD unit for storage by using a big data processing frame Spark as a calculation engine;

step S22, the Spark platform module receives the query instruction, and acquires data corresponding to the query instruction from the SSD unit or the HDD unit and outputs the data.

In specific implementation, the data processing method further includes the following steps of monitoring the remaining capacity of the hybrid storage module through a capacity monitoring module, and outputting alarm information when the remaining capacity is smaller than a preset threshold value. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.

In specific implementation, the RDD partition data is persisted by calling RDD. The operation of persisting the RDD is started by an RDD initiator method, and the content shown in fig. 3 is a persistence flow of RDD data. In addition, to persist RDD partition data, two conditions need to be met: the method comprises the steps of partitioning data and addresses, wherein the partitioning data are stored in an RDD module, the addresses need to be obtained through calculation, the addresses are paths/file names, the paths are stored in a configuration file, the paths need to be obtained according to a preset persistence level mapping configuration file of the partitioning data, and the file names need to be generated according to block identifiers.

The invention provides an embodiment of an RDD persistence method based on a SSD and HDD hybrid storage system, which is based on an optimized Spark framework to realize the persistence of RDD partition data, and comprises the following steps:

The data storage address stores the data in the RDD module in the SSD or the HDD according to the preset persistence level, so that the on-demand persistence of the Spark application program is realized. That is, when the preset persistence level is SSD _ ONLY, the data in the RDD module is stored in the SSD, and when the preset persistence level is HDD _ ONLY, the data in the RDD module is stored in the HDD.

Specifically, as shown in fig. 3, the steps of the persistence method are as follows:

step 1, the RDD module calls a doputiterer method of a block manager Blockmanager through an iterer method to transmit a block identifier blockId in the RDD module and a preset persistence level of data in the RDD module to the block manager Blockmanager;

step 2, the doPutIterator method of the block manager BlockManager calls the getFile method of the disk block manager, and transmits the block identification blockId in the RDD module and the preset persistence level of the data in the RDD module to the DiskBlockManager;

step 3, the getFile method of the disk block manager DiskBlockManager calls a getACCURateDir method of the device adapter to transfer the preset persistence level to the device adapter;

step 4, the device adapter DeviceAdapter reads two directory management variables in the configuration file, specifically, the two directory management variables include an SSD directory management variable and an HDD directory management variable;

step 5, the device adapter DeviceAdapter matches the preset persistence level with the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data, that is, the device adapter DeviceAdapter can obtain the preset persistence level from the upper layer, can obtain the configuration file such as the SSD directory management variable and the HDD directory management variable from the lower layer, and can complete the preset persistence level and the temporary file directory, that is, the getAccurateDir method reads the configuration file, wherein the configuration file includes two variables of the SSD directory management variable and the HDD directory management variable, and then matches the two variables according to the received preset persistence level. If the preset persistence level is SSD _ ONLY, matching an SSD directory management variable; if the preset persistence level is HDD _ ONLY, matching HDD directory management variables, obtaining a specific storage address of RDD data persistence at the moment, and then returning the address to the disk block manager DiskBlockManager;

step 6, returning the temporary file directory obtained by matching to the disk block manager DiskBlockManager, that is, the temporary file directory obtained by matching contains a specific storage address, and then returning the address to the disk block manager DiskBlockManager;

step 7, the disk block manager DiskBlockManager obtains a fileName according to the block identification blockId, and obtains a data storage address according to the temporary file directory obtained by matching and the fileName, that is, the specific address + fileName is a complete address, that is, a data storage address, where RDD _ and Index are digital indexes, and are sequentially incremented, and the data storage address is a directory/fileName, and the temporary file directory is a storage path;

step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;

and 9, after the block manager BlockManager obtains the data storage address of the RDD, calling a writeFunc method of the DiskStore block storage module to finish the data storage task.

In a specific implementation, the RDD persistence method further comprises the steps of;

judging whether the heat degree of the data in the RDD module is greater than a first preset value or not;

if so, the preset persistence level of the data in the RDD module is SSD _ ONLY;

and if not, the preset persistence level of the data in the RDD module is HDD _ ONLY.

That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance can be greatly improved by constructing the hybrid storage system, while ensuring the controllability of the cost.

That is, by means of the optimized Spark persistence framework, on-demand persistence of Spark data is achieved. Furthermore, the user can call an SSD persistence-oriented API provided by the optimized Spark framework to persist the partition data of the high-heat RDD into the SSD, so that the Spark performance is effectively improved.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of fig. 3 described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for RDD persistence based on SSD and HDD hybrid storage system, characterized in that: the method comprises the following steps:

The RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager, where the preset persistence level is SSD_ONLY or HDD_ONLY;

The block manager passes the block identifier and the preset persistence level to the disk block manager;

the disk block manager passing the preset persistence level to the device adapter;

The device adapter receives the preset persistence level of the data and reads two directory management variables in the configuration file. The two directory management variables include the SSD directory management variable and the HDD directory management variable. According to the preset persistence level of the data Matching the preset persistence level with the temporary file directory in the corresponding directory management variable, and returning the matching temporary file directory to the disk block manager;

The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager;

The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.

2. The RDD persistence method of claim 1, wherein the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager, specifically: :

The RDD module calls the doPutIterator method of the block manager through the Iterator method to transfer the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager.

3. The RDD persistence method according to claim 1, wherein the step of the block manager passing the block identifier and the preset persistence level to the disk block manager is specifically:

The block manager calls the getFile method of the disk block manager, and transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the disk block manager.

4. The RDD persistence method according to claim 1, wherein the disk block manager obtains the file name according to the block identifier, and transmits the preset persistence level to the device adapter, specifically for:

The disk block manager obtains the file name according to the block identifier by the getFile method;

The disk block manager calls the getAccurateDir method of the device adapter to transfer the preset persistence level to the device adapter.

5. The RDD persistence method according to claim 1, wherein the device adapter receives a preset persistence level of the data and reads two directory management variables in the configuration file, according to the preset persistence level of the data The steps of matching the preset persistence level with the temporary file directory in the corresponding directory management variable, and returning the matching temporary file directory to the disk block manager are specifically:

The device adapter uses the getAccurateDir method to match the preset persistence level with the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data;

The device adapter returns the matching temporary file directory to the disk block manager through the getAccurateDir method.

6 . The RDD persistence method according to claim 5 , wherein the two directory management variables include SSD directory management variables and HDD directory management variables. 7 .

7. The RDD persistence method according to claim 6, wherein the device adapter performs the matching between the preset persistence level and the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data by the getAccurateDir method. The steps are as follows:

When the preset persistence level of the data is SSD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the SSD directory management variable;

When the preset persistence level of the data is HDD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the HDD directory management variable.

8. The RDD persistence method according to claim 1, wherein the step of the block manager storing the data in the RDD module in the SSD or the HDD according to the data storage address, comprises:

After the block manager obtains the data storage address of the RDD, it calls the writeFunc method of the block storage module to store the data in the RDD module in the SSD or HDD.

9. The RDD persistence method according to claim 1, characterized in that: the RDD persistence method further comprises the following steps;

Determine whether the heat of the data in the RDD module is greater than the first preset value;

If so, the preset persistence level of the data in the RDD module is SSD_ONLY;

If not, the preset persistence level of the data in the RDD module is HDD_ONLY.

10. A computer-readable storage medium on which a computer program is stored, characterized in that, when the program is executed by a processor, the steps of the method according to any one of claims 1-9 are implemented.