CN107193494A - RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system - Google Patents
RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system Download PDFInfo
- Publication number
- CN107193494A CN107193494A CN201710358093.9A CN201710358093A CN107193494A CN 107193494 A CN107193494 A CN 107193494A CN 201710358093 A CN201710358093 A CN 201710358093A CN 107193494 A CN107193494 A CN 107193494A
- Authority
- CN
- China
- Prior art keywords
- rdd
- data
- persistence
- manager
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000002688 persistence Effects 0.000 title claims abstract description 90
- 238000000034 method Methods 0.000 title claims abstract description 46
- 239000007787 solid Substances 0.000 title 1
- 238000007726 management method Methods 0.000 claims abstract description 27
- 238000013500 data storage Methods 0.000 claims abstract description 20
- 238000013507 mapping Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000002045 lasting effect Effects 0.000 claims 1
- 238000013403 standard screening design Methods 0.000 description 25
- 238000005192 partition Methods 0.000 description 24
- 238000012545 processing Methods 0.000 description 9
- 238000012544 monitoring process Methods 0.000 description 6
- 230000002085 persistent effect Effects 0.000 description 6
- 238000003672 processing method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3034—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/325—Display of status information by lamps or LED's
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/068—Hybrid storage device
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种基于SSD和HDD混合存储系统的RDD持久化方法,包括:RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;所述磁盘块管理器将所述预设持久化级别传递给设备适配器;所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。
The present invention provides an RDD persistence method based on SSD and HDD hybrid storage system, comprising: the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager; the disk block The manager passes the preset persistence level to the device adapter; the device adapter receives the preset persistence level of the data and reads two directory management variables in the configuration file, and presets according to the preset persistence level of the data The persistence level matches the temporary file directory in the corresponding directory management variable, and returns the matched temporary file directory to the disk block manager; the disk block manager obtains the file name according to the block identifier, and obtains the file name according to the matching Temporary file directory and the file name to obtain the data storage address, and return the data storage address to the block manager; the block manager stores the data in the RDD module on SSD or HDD according to the data storage address stored in.
Description
技术领域technical field
本发明涉及数据处理技术领域,尤其涉及一种基于SSD和HDD混合存储系统的RDD持久化方法。The invention relates to the technical field of data processing, in particular to an RDD persistence method based on an SSD and HDD hybrid storage system.
背景技术Background technique
在现有的大数据时代,面对海量数据,如何在有效的时间内管理、分析并提取有价值的信息,成为人们亟需解决的问题。然而,无论是规模、种类还是结构,大数据对人们驾驭数据的能力提出了巨大挑战。In the current era of big data, in the face of massive data, how to manage, analyze and extract valuable information in an effective time has become an urgent problem that people need to solve. However, regardless of size, variety or structure, big data poses a huge challenge to people's ability to manage data.
Spark是目前高效且在产业界被广泛使用的大数据计算框架,是通用、快速的大规模数据处理引擎。首先,Spark提供了统一的解决方案,可以用于交互式查询、实时流处理、机器学习等复杂任务;其次,Spark通过弹性分布式数据集(Resilient DistributedDataset,简称RDD)划分阶段和任务,通过高效的有向无环图(Directed Acyclic Graph,简称DAG)执行引擎优化子任务执行顺序,并通过基于内存的计算大幅提升数据处理效率;第三,Spark数据管理依赖于HDFS、Hive等多种数据源,并且集群模式下的Spark实现了横向扩展,支持大规模数据的处理。RDD是Spark区别于其他大数据计算框架最重要的概念,它是一种具有高度容错机制的、只读的分布式数据集。Spark应用程序中,每一个RDD会被分成多个分区,且Spark以分区为单位对RDD进行各种操作。持久化(Persist)RDD分区数据到内存或硬盘实现了对计算任务中间结果的缓存,以供后续迭代任务直接读取中间结果,避免了重复计算,大幅提升了数据处理效率。另外,持久化数据到硬盘,打破了内存容量不足对数据集规模的限制,使得Spark处理大数据游刃有余。Spark is currently an efficient and widely used big data computing framework in the industry. It is a general and fast large-scale data processing engine. First of all, Spark provides a unified solution that can be used for complex tasks such as interactive query, real-time stream processing, and machine learning; secondly, Spark divides stages and tasks through Resilient Distributed Dataset (RDD). The Directed Acyclic Graph (DAG for short) execution engine optimizes the execution order of subtasks, and greatly improves data processing efficiency through memory-based calculations; third, Spark data management relies on HDFS, Hive and other data sources , and Spark in cluster mode realizes horizontal expansion and supports large-scale data processing. RDD is the most important concept that distinguishes Spark from other big data computing frameworks. It is a read-only distributed dataset with a highly fault-tolerant mechanism. In a Spark application, each RDD is divided into multiple partitions, and Spark performs various operations on the RDD in units of partitions. Persisting (Persist) RDD partition data to memory or hard disk realizes the caching of the intermediate results of computing tasks, so that subsequent iterative tasks can directly read the intermediate results, avoiding repeated calculations, and greatly improving the efficiency of data processing. In addition, persisting data to the hard disk breaks the limitation of insufficient memory capacity on the size of the data set, allowing Spark to handle large data with ease.
但是目前初始RDD数据集按照随机比例进行分割,Spark所提供的持久化框架根据依据此比例将数据持久化到不同的存储介质中,无法实现按需持久化。However, the current initial RDD data set is divided according to a random ratio, and the persistence framework provided by Spark persists the data to different storage media according to this ratio, which cannot achieve on-demand persistence.
发明内容Contents of the invention
本发明旨在解决现有技术中的无法实现按需持久化技术问题,提供一种能无法实现按需持久化的基于SSD和HDD混合存储系统的RDD持久化方法。The present invention aims to solve the technical problem that the on-demand persistence cannot be realized in the prior art, and provides an RDD persistence method based on the SSD and HDD hybrid storage system that cannot realize the on-demand persistence.
本发明的实施例提供一种基于SSD和HDD混合存储系统的RDD持久化方法,所述方法包括以下步骤:The embodiment of the present invention provides a kind of RDD persistent method based on SSD and HDD hybrid storage system, and described method comprises the following steps:
RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器;The block manager passes the block identifier and the preset persistence level to the disk block manager;
所述磁盘块管理器将所述预设持久化级别传递给设备适配器;The disk block manager passes the preset persistence level to the device adapter;
所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;The device adapter receives the preset persistence level of the data and reads the two directory management variables in the configuration file, performs the preset persistence level matching with the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data, and return the matched temporary file directory to the disk block manager;
所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;The disk block manager obtains the file name according to the block identifier, and obtains the data storage address according to the matching temporary file directory and the file name, and returns the data storage address to the block manager;
所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。The block manager stores the data in the RDD module in SSD or HDD according to the data storage address.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述方法的步骤。The present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above method are realized.
本发明的技术方案与现有技术相比,有益效果在于:根据预设持久化级别将所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储,以实现Spark应用程序的按需持久化。Compared with the prior art, the technical solution of the present invention has the beneficial effect of: storing the data in the data storage address pair RDD module in SSD or HDD according to the preset persistence level, so as to realize the on-demand application of Spark Persistence.
附图说明Description of drawings
图1是本发明分布式计算系统一种实施例的结构示意图。FIG. 1 is a schematic structural diagram of an embodiment of the distributed computing system of the present invention.
图2是本发明分布式计算系统的数据处理方法一种实施例的流程图。Fig. 2 is a flowchart of an embodiment of the data processing method of the distributed computing system of the present invention.
图3是本发明基于SSD和HDD混合存储系统的RDD持久化方法一种实施例的流程图。FIG. 3 is a flowchart of an embodiment of the RDD persistence method based on the SSD and HDD hybrid storage system of the present invention.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary and are intended to explain the present invention and should not be construed as limiting the present invention.
具体的,固态硬盘(Solid-State Drive,简称SSD)的出现为提升存储系统性能带来了新的机遇,SSD具有低功耗、低延迟、体积小等优点。与传统企业级硬盘(Hard DiskDrive,简称HDD)通过移动机械臂来寻址方式不同,SSD完全构建于半导体芯片上,因此具有随机访问性能。然而,由于SSD容量成本过高、寿命有限等不足,完全使用SSD替换HDD会大幅提升产业成本。为了合理利用SSD的高性能和HDD的低廉价格等优势,基于SSD和HDD混合存储的异构数据中心得到人们普遍研究和应用。Specifically, the emergence of solid-state drives (Solid-State Drive, SSD for short) has brought new opportunities for improving the performance of storage systems. SSDs have advantages such as low power consumption, low latency, and small size. Different from the traditional enterprise hard disk (Hard DiskDrive, referred to as HDD) which is addressed by moving the mechanical arm, the SSD is completely built on the semiconductor chip, so it has random access performance. However, due to the high cost of SSD capacity and limited lifespan, replacing HDD with SSD will greatly increase the cost of the industry. In order to reasonably utilize the advantages of high performance of SSD and low price of HDD, heterogeneous data centers based on hybrid storage of SSD and HDD have been widely researched and applied.
本发明一个实施例的分布式计算系统,如图1所示,包括Spark平台模块1和混合存储模块2,所述混合存储模块2包括SSD单元21和与HDD单元22,所述Spark平台模块1分别与所述SSD单元21和HDD单元22连接;The distributed computing system of one embodiment of the present invention, as shown in Figure 1, comprises Spark platform module 1 and hybrid storage module 2, and described hybrid storage module 2 comprises SSD unit 21 and with HDD unit 22, and described Spark platform module 1 Connect with the SSD unit 21 and the HDD unit 22 respectively;
所述Spark平台模块1利用大数据处理框架Spark作为计算引擎,将处理得到的数据送至所述SSD单元21或者所述HDD单元22进行存储,所述Spark平台模块1还用于接收查询指令,并从所述SSD单元21或者所述HDD单元22取与查询指令对应的数据后输出。The Spark platform module 1 utilizes the big data processing framework Spark as a computing engine, and the data obtained by processing is sent to the SSD unit 21 or the HDD unit 22 for storage, and the Spark platform module 1 is also used for receiving query instructions, And fetch the data corresponding to the query command from the SSD unit 21 or the HDD unit 22 and output it.
通过所述Spark平台模块分别与所述SSD单元和HDD单元连接,以使处理得到的数据送至所述SSD单元或者所述HDD单元进行存储,可以实现数据的精确映射和保存。The Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of data can be realized.
在具体实施中,所述Spark平台模块1包括与所述SSD单元21对应的第一API(ApplicationProgrammingInterface,应用程序编程接口)和与所述HDD单元对应的第二API,所述Spark平台模块1通过第一API与所述SSD单元21连接,所述Spark平台模块1通过第二API与所述HDD单元22连接,以进行数据传输。所述Spark平台模块1通过第一API和第二API,可以将混合存储系统的结构特征展示给用户。而存储介质的选择是通过调用第一API或第二API接口来实现,即选择在所述SSD单元21或是所述HDD单元22中进行存储通过调用第一API或第二API接口来实现。In a specific implementation, the Spark platform module 1 includes a first API (ApplicationProgrammingInterface, application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through the second API for data transmission. The Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, selecting to store in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.
在具体实施中,所述SSD单元21作和所述HDD单元22为同层持久化存储单元。所述处理得到的数据具体包括RDD分区数据。所述Spark平台模块还用于根据预设的分区比例值将RDD分区数据持久化到所述SSD单元或所述HDD单元中。In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units of the same layer. The processed data specifically includes RDD partition data. The Spark platform module is also used to persist the RDD partition data into the SSD unit or the HDD unit according to a preset partition ratio value.
在具体实施中,所述Spark平台模块1还用于根据RDD分区数据的热度将RDD分区数据持久化到所述SSD单元或所述HDD单元中。由于SSD的I/O带宽和降低访问延迟可以被有效地提升。而HDD仍然能为那些对存储性能要求较低的数据提供大量的存储效率。另外大量的数据被数据中心收集并捕获后,并不经常被访问,称之为冷数据,约占全球数据的90%。而剩余的10%的数据被收集并捕获后,会经常性的被访问,称之为热数据。显然,将全部的数据都存储在高性能、低延迟的存储设备是不合理的,成本是极为昂贵的。因此,根据RDD分区数据的热度,实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。In a specific implementation, the Spark platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data. Due to SSD's I/O bandwidth and lower access latency can be effectively improved. And HDD can still provide a lot of storage efficiency for those data that have lower storage performance requirements. In addition, after a large amount of data is collected and captured by the data center, it is not frequently accessed, which is called cold data, accounting for about 90% of the global data. After the remaining 10% of data is collected and captured, it will be frequently accessed, which is called hot data. Obviously, it is unreasonable to store all data in high-performance, low-latency storage devices, and the cost is extremely expensive. Therefore, according to the popularity of the RDD partition data, the SSD unit 21 and the HDD unit 22 can be combined in a reasonable manner, and the performance can be greatly improved by building a hybrid storage system, while ensuring cost control.
在具体实施中,所述分布式计算系统还包括连接所述混合存储模块的容量监控模块,所述容量监控模块用于对所述混合存储模块的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信号。也就是说,分布式计算系统还可包括连接混合存储模块2的容量监控模块,容量监控模块用于对混合存储模块2的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信息。预设阈值的具体取值可根据混合存储模块2的容量大小决定,输出报警信息可以是控制扬声器发声或控制报警灯闪烁等。在混合存储模块2的剩余容量过低时进行报警,提醒工作人员及时对存储数据进行转移或更换存储硬盘等,以提高数据存储可靠性。In a specific implementation, the distributed computing system further includes a capacity monitoring module connected to the hybrid storage module, the capacity monitoring module is used to monitor the remaining capacity of the hybrid storage module, and when the remaining capacity is less than a preset An alarm signal is output when the threshold is reached. That is to say, the distributed computing system may also include a capacity monitoring module connected to the hybrid storage module 2, the capacity monitoring module is used to monitor the remaining capacity of the hybrid storage module 2, and output an alarm message when the remaining capacity is less than a preset threshold. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output of the alarm information can be to control the sound of the speaker or control the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the stored data or replace the storage hard disk in time, so as to improve the reliability of data storage.
本发明还提供一种实施例的分布式计算系统的数据处理方法,如图2所示,所述数据处理方法包括以下步骤:The present invention also provides a data processing method of the distributed computing system of an embodiment. As shown in FIG. 2, the data processing method includes the following steps:
步骤S21,所述Spark平台模块通过大数据处理框架Spark作为计算引擎,将处理得到的数据送至所述SSD单元或者所述HDD单元进行存储;Step S21, the Spark platform module uses the big data processing framework Spark as a computing engine, and sends the processed data to the SSD unit or the HDD unit for storage;
步骤S22,所述Spark平台模块接收查询指令,并从所述SSD单元或者所述HDD单元获取与查询指令对应的数据后输出。Step S22, the Spark platform module receives the query command, and obtains data corresponding to the query command from the SSD unit or the HDD unit, and then outputs it.
通过所述Spark平台模块分别与所述SSD单元和HDD单元连接,以使处理得到的数据送至所述SSD单元或者所述HDD单元进行存储,可以实现数据的精确映射和保存。The Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of data can be realized.
在具体实施中,所述数据处理方法还包括以下步骤通过容量监控模块对所述混合存储模块的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信息。预设阈值的具体取值可根据混合存储模块2的容量大小决定,输出报警信息可以是控制扬声器发声或控制报警灯闪烁等。在混合存储模块2的剩余容量过低时进行报警,提醒工作人员及时对存储数据进行转移或更换存储硬盘等,以提高数据存储可靠性。In a specific implementation, the data processing method further includes the following steps of monitoring the remaining capacity of the hybrid storage module through the capacity monitoring module, and outputting an alarm message when the remaining capacity is less than a preset threshold. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output of the alarm information can be to control the sound of the speaker or control the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the stored data or replace the storage hard disk in time, so as to improve the reliability of data storage.
在具体实施中,所述Spark平台模块1包括与所述SSD单元21对应的第一API(ApplicationProgrammingInterface,应用程序编程接口)和与所述HDD单元对应的第二API,所述Spark平台模块1通过第一API与所述SSD单元21连接,所述Spark平台模块1通过第二API与所述HDD单元22连接,以进行数据传输。所述Spark平台模块1通过第一API和第二API,可以将混合存储系统的结构特征展示给用户。而存储介质的选择是通过调用第一API或第二API接口来实现,即选择在所述SSD单元21或是所述HDD单元22中进行存储通过调用第一API或第二API接口来实现。In a specific implementation, the Spark platform module 1 includes a first API (ApplicationProgrammingInterface, application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through the second API for data transmission. The Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, selecting to store in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.
在具体实施中,所述SSD单元21作和所述HDD单元22为同层持久化存储单元。所述处理得到的数据具体包括RDD分区数据。所述Spark平台模块还用于根据预设的分区比例值将RDD分区数据持久化到所述SSD单元或所述HDD单元中。In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units of the same layer. The processed data specifically includes RDD partition data. The Spark platform module is also used to persist the RDD partition data into the SSD unit or the HDD unit according to a preset partition ratio value.
在具体实施中,所述Spark平台模块1还用于根据RDD分区数据的热度将RDD分区数据持久化到所述SSD单元或所述HDD单元中。由于SSD的I/O带宽和降低访问延迟可以被有效地提升。而HDD仍然能为那些对存储性能要求较低的数据提供大量的存储效率。另外大量的数据被数据中心收集并捕获后,并不经常被访问,称之为冷数据,约占全球数据的90%。而剩余的10%的数据被收集并捕获后,会经常性的被访问,称之为热数据。显然,将全部的数据都存储在高性能、低延迟的存储设备是不合理的,成本是极为昂贵的。因此,根据RDD分区数据的热度,实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。In a specific implementation, the Spark platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data. Due to SSD's I/O bandwidth and lower access latency can be effectively improved. And HDD can still provide a lot of storage efficiency for those data that have lower storage performance requirements. In addition, after a large amount of data is collected and captured by the data center, it is not frequently accessed, which is called cold data, accounting for about 90% of the global data. After the remaining 10% of data is collected and captured, it will be frequently accessed, which is called hot data. Obviously, it is unreasonable to store all data in high-performance, low-latency storage devices, and the cost is extremely expensive. Therefore, according to the popularity of the RDD partition data, the SSD unit 21 and the HDD unit 22 can be combined in a reasonable manner, and the performance can be greatly improved by building a hybrid storage system, while ensuring cost control.
在具体实施中,通过调用RDD.persist(StorageLevel.SSD_ONLY)实现持久化该RDD分区数据,同时设置分区数据的预设持久化级别为SSD_ONLY。持久化该RDD的操作由RDD.iterator方法开启,图3所示内容为RDD数据的持久化流程。另外,要持久化RDD分区数据,需要具备两个条件:分区数据+地址,分区数据已经保存在RDD模块中,而地址需要通过计算获取,地址=路径/文件名,路径已经保存到配置文件中,需要根据分区数据的预设持久化级别映射配置文件获取,而文件名需要根据块标识生成。In a specific implementation, the RDD partition data is persisted by calling RDD.persist(StorageLevel.SSD_ONLY), and at the same time, the preset persistence level of the partition data is set to SSD_ONLY. The operation of persisting the RDD is started by the RDD.iterator method, and the content shown in Figure 3 is the persistence process of the RDD data. In addition, to persist RDD partition data, two conditions need to be met: partition data + address, the partition data has been saved in the RDD module, and the address needs to be obtained through calculation, address = path/file name, and the path has been saved in the configuration file , it needs to be obtained according to the preset persistence level mapping configuration file of the partition data, and the file name needs to be generated according to the block identifier.
本发明提供一种实施例的基于SSD和HDD混合存储系统的RDD持久化方法,所述持久化方法是基于优化后的Spark框架以实现对RDD分区数据的持久化,所述持久化方法包括以下步骤:The present invention provides an embodiment of an RDD persistence method based on an SSD and HDD hybrid storage system. The persistence method is based on an optimized Spark framework to achieve persistence of RDD partition data. The persistence method includes the following step:
RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器;The block manager passes the block identifier and the preset persistence level to the disk block manager;
所述磁盘块管理器将所述预设持久化级别传递给设备适配器;The disk block manager passes the preset persistence level to the device adapter;
所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;The device adapter receives the preset persistence level of the data and reads the two directory management variables in the configuration file, performs the preset persistence level matching with the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data, and return the matched temporary file directory to the disk block manager;
所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;The disk block manager obtains the file name according to the block identifier, and obtains the data storage address according to the matching temporary file directory and the file name, and returns the data storage address to the block manager;
所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。The block manager stores the data in the RDD module in SSD or HDD according to the data storage address.
本发明根据预设持久化级别将所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储,以实现Spark应用程序的按需持久化。也就是说,当预设持久化级别为SSD_ONLY时,将RDD模块中的数据在SSD中进行存储,当预设持久化级别为HDD_ONLY时,将RDD模块中的数据在HDD中进行存储。The present invention stores the data in the data storage address pair RDD module in the SSD or HDD according to the preset persistence level, so as to realize the on-demand persistence of the Spark application program. That is to say, when the preset persistence level is SSD_ONLY, the data in the RDD module is stored in SSD, and when the preset persistence level is HDD_ONLY, the data in the RDD module is stored in HDD.
具体的,如图3所示,所述持久化方法的步骤如下:Specifically, as shown in Figure 3, the steps of the persistence method are as follows:
步骤1,所述RDD模块通过Iterator方法调用块管理器BlockManager的doPutIterator方法将RDD模块中的块标识blockId和RDD模块中数据的预设持久化级别传递给块管理器BlockManager;Step 1, the RDD module calls the doPutIterator method of the block manager BlockManager through the Iterator method to pass the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the block manager BlockManager;
步骤2,所述块管理器BlockManager的doPutIterator方法调用磁盘块管理器的getFile方法,将RDD模块中的块标识blockId和RDD模块中数据的预设持久化级别传递给磁盘块管理器DiskBlockManager;Step 2, the doPutIterator method of the block manager BlockManager calls the getFile method of the disk block manager, and the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module are passed to the disk block manager DiskBlockManager;
步骤3,所述磁盘块管理器DiskBlockManager的getFile方法调用设备适配器的getAccurateDir方法将所述预设持久化级别传递给设备适配器DeviceAdapter;Step 3, the getFile method of the disk block manager DiskBlockManager calls the getAccurateDir method of the device adapter to pass the preset persistence level to the device adapter DeviceAdapter;
步骤4,所述设备适配器DeviceAdapter读取配置文件中两个目录管理变量,具体的,所述两个目录管理变量包括SSD目录管理变量和HDD目录管理变量;Step 4, the device adapter DeviceAdapter reads two directory management variables in the configuration file, specifically, the two directory management variables include SSD directory management variables and HDD directory management variables;
步骤5,所述设备适配器DeviceAdapter根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,也就是说所述设备适配器DeviceAdapter可以从上层获取预设持久化级别,可以从下层获取配置文件比如SSD目录管理变量和HDD目录管理变量,可以完成预设持久化级别与临时文件目录,也就是说,getAccurateDir方法读取配置文件,其中配置文件包括两个变量为SSD目录管理变量和HDD目录管理变量,然后根据接收到的预设持久化级别匹配上述两个变量。如果预设持久化级别是SSD_ONLY,则匹配SSD目录管理变量;如果预设持久化级别是HDD_ONLY,则匹配HDD目录管理变量,此时得到了RDD数据持久化的具体存储地址,然后将该地址返回给所述磁盘块管理器DiskBlockManager;Step 5, the device adapter DeviceAdapter matches the preset persistence level with the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data, that is to say, the device adapter DeviceAdapter can obtain the preset persistence level from the upper layer , can obtain configuration files from the lower layer such as SSD directory management variables and HDD directory management variables, and can complete the preset persistence level and temporary file directory, that is, the getAccurateDir method reads the configuration file, and the configuration file includes two variables for SSD Directory management variable and HDD directory management variable, and then match the above two variables according to the received preset persistence level. If the preset persistence level is SSD_ONLY, match the SSD directory management variable; if the preset persistence level is HDD_ONLY, match the HDD directory management variable. At this time, the specific storage address of the RDD data persistence is obtained, and then the address is returned to the disk block manager DiskBlockManager;
步骤6,将匹配得到的临时文件目录返回给所述磁盘块管理器DiskBlockManager,也就是说,匹配得到的临时文件目录中包含具体存储地址,然后将该地址返回给所述磁盘块管理器DiskBlockManager;Step 6, return the matched temporary file directory to the disk block manager DiskBlockManager, that is, the matched temporary file directory contains a specific storage address, and then return the address to the disk block manager DiskBlockManager;
步骤7,所述磁盘块管理器DiskBlockManager根据所述块标识blockId得到文件名filename,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,也就是说,具体地址+fileName就是RDD数据存储到磁盘的完整地址即数据存储地址,其中fileName=“rdd_”+Index,Index是一个数字索引,按照顺序递增,而数据存储地址=目录/文件名,另外临时文件目录也就是保存路径;Step 7, the disk block manager DiskBlockManager obtains the file name filename according to the block identifier blockId, and obtains the data storage address according to the temporary file directory obtained by matching and the file name, that is to say, the specific address+fileName is the RDD data The complete address stored to the disk is the data storage address, where fileName=“rdd_”+Index, Index is a numerical index, which increases in order, and the data storage address=directory/file name, and the temporary file directory is also the storage path;
步骤8,所述磁盘块管理器DiskBlockManager将所述数据存储地址返回至所述块管理器BlockManager;Step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;
步骤9,所述块管理器BlockManager获得RDD的数据存储地址后,调用块存储模块DiskStore的writeFunc方法,完成数据的存储任务。Step 9: After obtaining the data storage address of the RDD, the block manager BlockManager calls the writeFunc method of the block storage module DiskStore to complete the data storage task.
在具体实施中,所述RDD持久化方法还包括以下步骤;In a specific implementation, the RDD persistence method also includes the following steps;
判断RDD模块中数据的热度是否大于第一预设值;Judging whether the heat of the data in the RDD module is greater than a first preset value;
如果是,所述RDD模块中数据的预设持久化级别为SSD_ONLY;If yes, the preset persistence level of data in the RDD module is SSD_ONLY;
如果否,所述RDD模块中数据的预设持久化级别为HDD_ONLY。If not, the preset persistence level of data in the RDD module is HDD_ONLY.
即根据RDD分区中数据的热度,进行数据的预设持久化级别的设置以实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner. By building a hybrid storage system, the performance can be greatly improved, and at the same time The cost of guarantee is controllable.
也就是说,通过优化的Spark持久化框架,实现Spark数据的按需持久化。进而,用户可调用优化后的Spark框架所提供的面向SSD持久化的API将高热度RDD的分区数据持久化到SSD中,由此有效地提升Spark性能。That is to say, through the optimized Spark persistence framework, the on-demand persistence of Spark data is realized. Furthermore, the user can call the SSD persistence-oriented API provided by the optimized Spark framework to persist the partition data of the hot RDD to the SSD, thereby effectively improving Spark performance.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述图3中方法的步骤。The present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above-mentioned method in FIG. 3 are realized.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples. In addition, those skilled in the art can combine and combine different embodiments or examples and features of different embodiments or examples described in this specification without conflicting with each other.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710358093.9A CN107193494B (en) | 2017-05-19 | 2017-05-19 | A RDD Persistence Method Based on SSD and HDD Hybrid Storage System |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710358093.9A CN107193494B (en) | 2017-05-19 | 2017-05-19 | A RDD Persistence Method Based on SSD and HDD Hybrid Storage System |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107193494A true CN107193494A (en) | 2017-09-22 |
CN107193494B CN107193494B (en) | 2020-05-12 |
Family
ID=59875380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710358093.9A Active CN107193494B (en) | 2017-05-19 | 2017-05-19 | A RDD Persistence Method Based on SSD and HDD Hybrid Storage System |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107193494B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590003A (en) * | 2017-09-28 | 2018-01-16 | 深圳大学 | A kind of Spark method for allocating tasks and system |
WO2018209693A1 (en) * | 2017-05-19 | 2018-11-22 | 深圳大学 | Rdd persistence method based on ssd and hdd hybrid storage system |
CN109375868A (en) * | 2018-09-14 | 2019-02-22 | 网宿科技股份有限公司 | A kind of date storage method, dispatching device, system, equipment and storage medium |
CN112799597A (en) * | 2021-02-08 | 2021-05-14 | 东北大学 | A fault-tolerant method of hierarchical storage for stream data processing |
CN113590536A (en) * | 2021-05-20 | 2021-11-02 | 济南浪潮数据技术有限公司 | Data storage method, system, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104216988A (en) * | 2014-09-04 | 2014-12-17 | 天津大学 | SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data |
CN105893541A (en) * | 2016-03-31 | 2016-08-24 | 中国科学院软件研究所 | Streaming data self-adaption persistence method and system based on mixed storage |
-
2017
- 2017-05-19 CN CN201710358093.9A patent/CN107193494B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104216988A (en) * | 2014-09-04 | 2014-12-17 | 天津大学 | SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data |
CN105893541A (en) * | 2016-03-31 | 2016-08-24 | 中国科学院软件研究所 | Streaming data self-adaption persistence method and system based on mixed storage |
Non-Patent Citations (2)
Title |
---|
AWASTHI A: "Hybrid HBase: Leveraging Flash SSDs to Improve Cost per", 《THE 18TH INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (COMAD)》 * |
LUO TIAN: "hStorag e-DB: Heterog eneity-aware Data Manag ement to", 《AUGUST 27TH - 31ST 2012, ISTANBUL, TURKEY.》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018209693A1 (en) * | 2017-05-19 | 2018-11-22 | 深圳大学 | Rdd persistence method based on ssd and hdd hybrid storage system |
CN107590003A (en) * | 2017-09-28 | 2018-01-16 | 深圳大学 | A kind of Spark method for allocating tasks and system |
CN107590003B (en) * | 2017-09-28 | 2020-10-23 | 深圳大学 | Spark task allocation method and system |
CN109375868A (en) * | 2018-09-14 | 2019-02-22 | 网宿科技股份有限公司 | A kind of date storage method, dispatching device, system, equipment and storage medium |
CN109375868B (en) * | 2018-09-14 | 2022-07-08 | 深圳爱捷云科技有限公司 | Data storage method, scheduling device, system, equipment and storage medium |
CN112799597A (en) * | 2021-02-08 | 2021-05-14 | 东北大学 | A fault-tolerant method of hierarchical storage for stream data processing |
CN113590536A (en) * | 2021-05-20 | 2021-11-02 | 济南浪潮数据技术有限公司 | Data storage method, system, electronic equipment and storage medium |
CN113590536B (en) * | 2021-05-20 | 2023-12-29 | 济南浪潮数据技术有限公司 | Data storage method, system, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107193494B (en) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11741053B2 (en) | Data management system, method, terminal and medium based on hybrid storage | |
US11989160B2 (en) | Heuristic interface for enabling a computer device to utilize data property-based data placement inside a nonvolatile memory device | |
US11099769B1 (en) | Copying data without accessing the data | |
US9092321B2 (en) | System and method for performing efficient searches and queries in a storage node | |
US8819335B1 (en) | System and method for executing map-reduce tasks in a storage device | |
EP2973018B1 (en) | A method to accelerate queries using dynamically generated alternate data formats in flash cache | |
CN107193494A (en) | RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system | |
US20150032938A1 (en) | System and method for performing efficient processing of data stored in a storage node | |
WO2018077292A1 (en) | Data processing method and system, electronic device | |
JP5744707B2 (en) | Computer-implemented method, computer program, and system for memory usage query governor (memory usage query governor) | |
CN105183839A (en) | Hadoop-based storage optimizing method for small file hierachical indexing | |
TW201220197A (en) | for improving the safety and reliability of data storage in a virtual machine based on cloud calculation and distributed storage environment | |
US9424314B2 (en) | Method and apparatus for joining read requests | |
CN116166691B (en) | Data archiving system, method, device and equipment based on data division | |
CN112632069B (en) | Hash table data storage management method, device, medium and electronic equipment | |
CN102014158A (en) | Cloud storage service client high-efficiency fine-granularity data caching system and method | |
CN106354805A (en) | Optimization method and system for searching and caching distribution storage system NoSQL | |
WO2012083754A1 (en) | Method and device for processing dirty data | |
CN104054071A (en) | Method for accessing storage device and storage device | |
US9336135B1 (en) | Systems and methods for performing search and complex pattern matching in a solid state drive | |
CN111291083B (en) | Web page source code data processing method, device and computer equipment | |
WO2024260324A1 (en) | Data cache processing method, apparatus and system | |
CN104052824A (en) | Distributed cache method and system | |
CN107179883B (en) | Spark architecture optimization method of hybrid storage system based on SSD and HDD | |
US9760577B2 (en) | Write-behind caching in distributed file systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220523 Address after: 518000 east of the fourth floor of plant 1 (Building 1) of Baode technology R & D and production base, gaoxinyuan, Guanlan street, Longhua new area, Shenzhen, Guangdong Patentee after: Baode network security system (Shenzhen) Co.,Ltd. Address before: 518000 No. 3688 Nanhai Road, Shenzhen, Guangdong, Nanshan District Patentee before: SHENZHEN University |