CN105183839A

CN105183839A - Hadoop-based storage optimizing method for small file hierachical indexing

Info

Publication number: CN105183839A
Application number: CN201510557719.XA
Authority: CN
Inventors: 戴彬; 王雄; 张焜
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2015-12-23

Abstract

The invention discloses a hadoop-based storage optimization method for hierarchical indexing of small files, capable of uploading a large amount of small files to an HDFS file system and improving the indexing efficiency of small files stored in the HDFS system. It mainly includes the following steps: (1) WebServer listens to file read and write requests, and judges whether the size of the file is smaller than the block size set by the HDFS file system; (2) files larger than the block size are read or written according to normal HDFS files, and if the file size is smaller than The file of block size is divided into small files, and is sent to the small file processing server for processing; (3) inside the small file processing server, a judging module is provided to set the small files of 1～1023KB as K-level small files, Set the small files from 1M to 64M as M-level small files; (4) There is also a cache in the small file processing server, which can prefetch some small files when the files are separated according to the file correlation, and design a cache update strategy to Improve file access efficiency.

Description

A storage optimization method for hierarchical indexing of small files based on Hadoop

技术领域technical field

本发明属于云存储技术领域，更具体地，涉及一种基于Hadoop的小文件分级索引的存储优化方法。The invention belongs to the technical field of cloud storage, and more specifically relates to a Hadoop-based storage optimization method for hierarchical indexing of small files.

背景技术Background technique

Hadoop是一个分布式系统基础架构，由Apache基金会开发。Hadoop是当前主流的云存储平台，它由一个NameNode和多个DataNode组成，其中NameNode负责管理文件系统命名空间和控制外部客户端的访问，DataNode负责具体数据的存储。用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力进行高速运算和存储。Hadoop实现了一个分布式文件系统，简称HDFS。HDFS有着高容错性的特点，并且设计用来部署在低廉的硬件上。而且它提供高传输率来访问应用程序的数据，适合那些有着超大数据集的应用程序。Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop is the current mainstream cloud storage platform. It consists of a NameNode and multiple DataNodes. The NameNode is responsible for managing the file system namespace and controlling the access of external clients, and the DataNode is responsible for the storage of specific data. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. Hadoop implements a distributed file system, HDFS for short. HDFS is highly fault-tolerant and designed to be deployed on inexpensive hardware. And it provides a high transfer rate to access application data, suitable for those applications with very large data sets.

Hadoop是被设计用来处理大文件的，它将到来的大文件存储在不同datanode的不同数据块上，并在namenode上记录大文件的元数据。当要对某个文件进行操作时，首先访问namenode，根据namenode的反馈信息定位到各个块置，找到这个文件，然后使用相应的命令对进行操作。需要说明是，这些操作都是按块进行的。而对于小文件，Hadoop便不再那么高效，由于每个独立的小文件中所含有数据量是非常少，只能作为一块(远小于数据块的默认大)存储或操作，且海量文件就意味着元数据也是海量的。如何有效地存储和管理大量的小文件，成为亟待解决的难题。Hadoop is designed to handle large files. It stores incoming large files on different data blocks of different datanodes, and records the metadata of large files on the namenode. When operating on a file, first visit the namenode, locate each block according to the feedback information from the namenode, find the file, and then use the corresponding command to operate on it. It should be noted that these operations are performed in blocks. For small files, Hadoop is no longer so efficient, because the amount of data contained in each independent small file is very small, and can only be stored or operated as a block (far smaller than the default size of the data block), and a large number of files means The metadata is also massive. How to effectively store and manage a large number of small files has become an urgent problem to be solved.

在HDFS分布式文件系统中，任何文件信息、文件块信息在NameNode(主节点)的内存中都以一个对象的形式存储，每一个对象约占150字节，大量小文件使NameNode的内存使用情况严重制约了集群的扩展。In the HDFS distributed file system, any file information and file block information are stored in the form of an object in the memory of the NameNode (master node). Each object occupies about 150 bytes. A large number of small files make the memory usage of the NameNode Severely restricts the expansion of the cluster.

HDFS访问大量小文件速度远远小于访问同等大小的大文件。它主要是为了流式的访问大文件而设计的。对小文件的读取通常会造成大量从Datanode到Datanode的获取文件，这样是非常的低效的一种访问方式。HDFS access to a large number of small files is much slower than accessing large files of the same size. It is primarily designed for streaming access to large files. Reading small files usually results in a large number of fetched files from Datanode to Datanode, which is a very inefficient access method.

MapReduce(并行数据处理)处理批量小文件时间远远长于处理同等大小的大文件的时间。处理每一个小文件要占用一个task(任务)，若小文件过多，则大部分处理时间都耗费在启动task和释放task上。Map函数通常处理的是一个块大小的输入，若有非常多的小文件会产生大量的Map函数来执行，而这些Map函数无疑会占用大量的内存。MapReduce (parallel data processing) takes much longer to process batches of small files than to process large files of the same size. Processing each small file takes up a task (task). If there are too many small files, most of the processing time will be spent on starting and releasing tasks. The Map function usually processes a block-sized input. If there are a lot of small files, a large number of Map functions will be generated for execution, and these Map functions will undoubtedly occupy a large amount of memory.

对于大量的小文件引起的问题，Hadoop技术自身提供了三个解决方案，分别为HadoopArchive(Hadoop归档)，Sequencefile(序列文件)和CombineFileInputFormat(组合文件输入格式)。For the problems caused by a large number of small files, Hadoop technology itself provides three solutions, namely HadoopArchive (Hadoop archive), Sequencefile (sequence file) and CombineFileInputFormat (combined file input format).

HadoopArchive将小文件打包成HAR文件，首先是建立一个主索引结构,这个主索引指向二级索引，由二级索引指向具体的小文件。因此访问每个HAR文件将需要完成两层索引文件读取和文件本身数据读取，所以实际上通过HAR来读取一个文件可能会比直接从HDFS中读取文件效率要低。HadoopArchive，Sequencefile和CombineFileInputFormat这三个工具不能自动化，都需要管理员手动添加，效率非常的低下，不适合大量的文件操作。HadoopArchive packs small files into HAR files. First, a main index structure is established. The main index points to the secondary index, and the secondary index points to specific small files. Therefore, accessing each HAR file will require two layers of index file reading and file data reading, so actually reading a file through HAR may be less efficient than reading a file directly from HDFS. The three tools HadoopArchive, Sequencefile and CombineFileInputFormat cannot be automated and need to be manually added by the administrator. The efficiency is very low and it is not suitable for a large number of file operations.

发明内容Contents of the invention

本发明的主要目的在于解决现有Hadoop分布式文件系统对于海量小文件存储、读取和写入操作效率低的问题，提出了一种基于hadoop的小文件分级索引的存储优化方法。The main purpose of the present invention is to solve the problem of low efficiency of the existing Hadoop distributed file system for massive small file storage, reading and writing operations, and proposes a storage optimization method based on hadoop-based small file hierarchical indexing.

为实现上述目的，本发明提供了一种基于Hadoop的小文件分级索引的存储优化方法，在Hadoop分布式文件系统HDFS之外，新增一台用于文件读写请求的网络服务器WebServer，新增一台用于处理小文件的小文件处理服务器，所述方法包括如下步骤：In order to achieve the above object, the present invention provides a storage optimization method based on Hadoop's small file hierarchical index, in addition to the Hadoop distributed file system HDFS, add a web server WebServer for file read and write requests, add A small file processing server for processing small files, the method includes the following steps:

(1)网络服务器WebServer接收文件读取或写入请求，并判断文件是否大于预设块大小，大于预设块大小的文件按照HDFS文件系统的正常流程处理，小于预设块大小的文件被送到小文件处理服务器进行后续处理；(1) The web server WebServer receives a file read or write request, and judges whether the file is larger than the preset block size. Files larger than the preset block size are processed according to the normal process of the HDFS file system, and files smaller than the preset block size are sent to Go to the small file processing server for subsequent processing;

(2)小文件处理服务器对小文件进行进一步的划分，把1～1023KB之间的小文件划分为K级小文件，把1M到预设块大小之间的小文件划分为M级小文件；(2) The small file processing server further divides the small files, divides the small files between 1～1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(3)对K级小文件建立二级索引，其中，一级索引存储在小文件处理服务器中，二级索引存储在datanode中，对M级小文件建立线性索引结构，存储在小文件处理服务器中；(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle;

(4)小文件处理服务器的缓存区保存部分已访问的小文件和它的元数据信息，并根据小文件的访问频率来更新缓存区中的小文件及其元数据。(4) The cache area of the small file processing server saves some accessed small files and their metadata information, and updates the small files and their metadata in the cache area according to the access frequency of the small files.

进一步地，所述网络服务器WebServer为一台监听文件读写请求的网络服务器，小文件处理服务器是一台用于小文件处理的服务器。Further, the web server WebServer is a web server that monitors file read and write requests, and the small file processing server is a server for small file processing.

进一步地，所述小文件大小小于所述分布式文件系统上块的大小。Further, the small file size is smaller than the block size on the distributed file system.

进一步地，所述小文件处理服务器对小文件进一步划分，步骤如下：Further, the small file processing server further divides the small files, and the steps are as follows:

1)小文件处理服务器收到读取或写入小文件的操作后，小文件分级模块将1～1023KB的小文件划分为K级，将1M到块大小之间的小文件划分为M级；1) After the small file processing server receives the operation of reading or writing small files, the small file classification module divides small files from 1 to 1023KB into K grades, and divides small files between 1M and block size into M grades;

2)K级小文件采用二级索引，一级索引为范围划分，以1000条记录为一个范围，存储在小文件处理服务器中，二级索引是真正记录小文件在合并后的大文件中的偏移的索引文件，存储在datanode中；M级小文件采用线性索引结构，存储在小文件处理服务器中。2) The K-level small files adopt the secondary index. The primary index is divided into ranges, with 1000 records as a range, and stored in the small file processing server. The secondary index is to actually record the small files in the merged large file. Offset index files are stored in the datanode; M-level small files adopt a linear index structure and are stored in the small file processing server.

进一步地，小文件处理服务器中的缓存保存部分已访问文件的相关文件和相应的元数据信息，并根据采用相应的更新策略来更新存储在缓存区中的小文件及相应的元数据信息。Further, the cache in the small file processing server stores related files and corresponding metadata information of some accessed files, and updates the small files and corresponding metadata information stored in the cache area according to the corresponding updating strategy.

进一步地，对于文件写入请求，其处理过程如下：Further, for file write requests, the processing process is as follows:

(1.1)网络服务器WebServer判断文件的请求类型为文件写入请求；(1.1) Web server WebServer judges that the request type of file is a file write request;

(1.2)网络服务器WebServer判断文件是否大于预设块大小，大于预设块大小的文件按照HDFS文件系统的正常流程处理，小于预设块大小的文件被送到小文件处理服务器进行后续处理；(1.2) Web server WebServer judges whether the file is greater than the preset block size, and the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing;

(1.3)小文件处理服务器对小文件进行进一步的划分，把1～1023KB之间的小文件划分为K级小文件，把1M到预设块大小之间的小文件划分为M级小文件；(1.3) The small file processing server further divides the small files, divides the small files between 1～1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(1.4)对K级小文件建立二级索引，其中，一级索引存储在小文件处理服务器中，二级索引存储在datanode中，对M级小文件建立线性索引结构，存储在小文件处理服务器中。(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle.

进一步地，对于文件读取请求，其处理过程如下：Further, for file read requests, the processing process is as follows:

(2.1)小文件处理服务器接收到小文件的读取请求，首先检查缓存中是否存在需要读取的小文件，若存在，小文件处理服务器将缓存中的读请求文件取出返回给客户，完成读取操作，否则，执行步骤(2.2)；(2.1) The small file processing server receives a read request for a small file, first checks whether there is a small file to be read in the cache, if it exists, the small file processing server takes out the read request file in the cache and returns it to the client, and completes the Take the operation, otherwise, perform step (2.2);

(2.2)小文件处理服务器检查缓存中是否存在读请求小文件的元数据信息，若存在，小文件处理服务器将直接和HDFS客户端交互，将小文件从HDFS中取出返回给用户，完成读取操作，否则执行步骤(2.3)；(2.2) The small file processing server checks whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the reading operation, otherwise perform step (2.3);

(2.3)根据小文件和合并文件的文件名，小文件处理服务器将收到的请求读取的小文件映射到小文件的合并文件中，并将合并文件送入Hadoop分布式文件系统HDFS的客户端；(2.3) According to the file name of the small file and the combined file, the small file processing server maps the received small file to the combined file of the small file, and sends the combined file to the client of the Hadoop distributed file system HDFS end;

(2.4)Hadoop分布式文件系统HDFS的客户端，将接收到的请求读取的合并文件，从Hadoop分布式文件系统HDFS中读出，得到合并文件的元数据信息与索引信息，从而完成读取操作。(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate.

进一步地，所述小文件处理服务器采用小文件分离方法，从Hadoop分布式文件系统HDFS中读取合并文件，将请求读取的小文件从合并文件中分离出来返回给用户。Further, the small file processing server adopts a small file separation method to read the merged file from the Hadoop distributed file system HDFS, and separate the requested small file from the merged file and return it to the user.

进一步地，所述方法还包括：小文件处理服务器从步骤(2.4)得到的合并文件的元数据信息与索引信息中，提取每个小文件的文件名，数据块位置、偏移量和文件长度，得到小文件的元数据信息，并将每个小文件及其相应的元数据信息存储到缓存中。Further, the method further includes: the small file processing server extracts the file name, data block position, offset and file length of each small file from the metadata information and index information of the merged file obtained in step (2.4) , get the metadata information of the small files, and store each small file and its corresponding metadata information in the cache.

进一步地，所述方法还包括，在小文件的元数据预取记录和小文件预取记录的首部，分别添加一个用于记录文件访问频率的32位的文件访问标识，以五分钟为计时标准，如果在五分钟之内有访问小文件的请求，则将文件访问标识加1，否则，将文件访问标识减1，如果文件访问标识为0，则将这个小文件从缓存中移除。Further, the method further includes adding a 32-bit file access identifier for recording file access frequency to the metadata prefetch record of the small file and the header of the small file prefetch record respectively, with five minutes as the timing standard , if there is a request for accessing a small file within five minutes, add 1 to the file access ID, otherwise, decrement the file access ID by 1, and if the file access ID is 0, remove the small file from the cache.

与现有技术相比，本发明的有益效果是：Compared with prior art, the beneficial effect of the present invention is:

(1)能够向HDFS文件系统上传海量小文件和并且能够实现小文件快速索引，独立于HDFS文件系统的小文件处理服务器通过建立相应的小文件索引和小文件处理机制，来有效完成小文件的合并,映射,预取，这可以很大程度上减少namenode上的开销，有效改善系统处理小文件的效率，而且也不会改变现有的HDFS系统，保证了系统的通用性；在小文件处理服务器，通过将小文件进一步进行划分，有区别地建立合适的索引，能够节省索引文件的存储空间；小文件处理服务器中的缓存缓存部分已访问文件的相关文件及相应的元数据信息，并制定更新策略，来提高访问效率(1) A large number of small files can be uploaded to the HDFS file system and fast indexing of small files can be realized. The small file processing server independent of the HDFS file system can effectively complete the small file processing by establishing a corresponding small file index and small file processing mechanism Merge, map, and prefetch, which can greatly reduce the overhead on the namenode, effectively improve the efficiency of the system to process small files, and will not change the existing HDFS system, ensuring the versatility of the system; in small file processing The server can save the storage space of the index files by further dividing the small files and establishing appropriate indexes in different ways; the small file processing server caches the related files and corresponding metadata information of some accessed files, and formulates Update strategies to improve access efficiency

(2)将海量小文件合并为有限个数的大文件，并且对小文件的索引方法进一步划分，降低了Hadoop集群中NameNode节点的内存占有率；(2) Merge a large number of small files into a limited number of large files, and further divide the indexing method of small files, reducing the memory occupancy rate of the NameNode node in the Hadoop cluster;

(3)将小文件管理功能封装成为一个服务，降低了用户的使用难度；(3) Encapsulate the small file management function into a service, which reduces the difficulty of use for users;

(4)独立的小文件处理服务器不会影响到HDFS文件系统的正常运行，提高了文件访问效率。(4) The independent small file processing server will not affect the normal operation of the HDFS file system, which improves the file access efficiency.

附图说明Description of drawings

图1为本发明一种基于Hadoop的小文件分级索引的存储优化方法的流程图；Fig. 1 is a flow chart of the storage optimization method of a Hadoop-based small file hierarchical index of the present invention;

图2为本发明的小文件存储流程示意图；Fig. 2 is a schematic diagram of the small file storage process of the present invention;

图3为本发明的小文件读取流程示意图；Fig. 3 is a schematic diagram of a small file reading process of the present invention;

图4为K级索引文件的索引结构示意图。FIG. 4 is a schematic diagram of an index structure of a K-level index file.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

如图1所示，为本发明一种基于Hadoop的小文件分级索引的存储优化方法，在Hadoop分布式文件系统HDFS之外，新增一台用于文件读写请求的网络服务器WebServer，新增一台用于处理小文件的小文件处理服务器，其步骤如下：As shown in Figure 1, for the storage optimization method of a kind of Hadoop-based small file classification index of the present invention, outside Hadoop distributed file system HDFS, add a web server WebServer that is used for file read and write request, add A small file processing server for processing small files, the steps are as follows:

(3)对K级小文件建立二级索引，其中，一级索引存储在小文件处理服务器中，二级索引存储在datanode中，对M级小文件建立线性索引结构，存储在小文件处理服务器中，以减少小文件处理服务器的开销；(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server In order to reduce the overhead of small file processing server;

所述的二级索引，一级索引是范围索引，二级索引才是真正记录小文件在合并后的大文件中的偏移的索引文件。所述索引结构如图4所示。每当二级索引文件记录超过1000记录(默认设置为1000条)，则向一级索引记录一个范围，包括这1000条的第一条记录的索引值，最后一条的索引值，第一条记录的索引值在二级索引文件中的位置，最后一条的索引值在二级索引文件中的结束位置。二级索引记录的信息包括文件名的索引值,文件的相对路径名，是目录还是文件，在合并后文件中的位置，文件的长度，合并后的文件名。As for the secondary index, the primary index is a range index, and the secondary index is an index file that actually records the offset of the small file in the merged large file. The index structure is shown in FIG. 4 . Whenever the secondary index file records more than 1000 records (the default setting is 1000), record a range to the primary index, including the index value of the first record of these 1000 records, the index value of the last record, and the first record The position of the index value of the index in the secondary index file, and the end position of the index value of the last item in the secondary index file. The information recorded in the secondary index includes the index value of the file name, the relative path name of the file, whether it is a directory or a file, the position in the merged file, the length of the file, and the merged file name.

小文件处理服务器主要负责小文件的合并，映射及预取。它的功能有三方面：The small file processing server is mainly responsible for merging, mapping and prefetching of small files. Its function has three aspects:

(1)写入文件时，它根据小文件的大小建立索引，1～1023KB的小文件采用线性索引结构，1M到块大小之间的小文件采用二级索引结构，对文件进行合并；(1) When writing a file, it builds an index according to the size of the small file. Small files from 1 to 1023 KB use a linear index structure, and small files between 1M and block size use a secondary index structure to merge files;

(2)读取文件时，它完成小文件到其合并文件的映射过程，并根据索引文件把所需小文件从大文件中分离出来；(2) When reading a file, it completes the mapping process from the small file to its merged file, and separates the required small file from the large file according to the index file;

(3)小文件处理服务器还设有缓存区，可根据文件相关性，在文件分离时预取一部分小文件，并设计相应的缓存更新策略，来提高文件访问效率。(3) The small file processing server also has a cache area, which can prefetch some small files when the files are separated according to file dependencies, and design a corresponding cache update strategy to improve file access efficiency.

对于本发明中HDFS部分来说，在NameNode层，我们维持了原有的NameNode和SecondaryNameNode的关系，其作用也与原HDFS中NameNode相同，在DataNode层，我们将DataNode分为两块云存储区域，在DataNode首部加入K级小文件的二级索引文件，其他部分仍然作为数据存储区。For the HDFS part of the present invention, at the NameNode layer, we maintain the relationship between the original NameNode and SecondaryNameNode, and its function is the same as that of the NameNode in the original HDFS. At the DataNode layer, we divide the DataNode into two cloud storage areas. Add the secondary index file of K-level small files to the head of the DataNode, and the other parts are still used as data storage areas.

对于独立的小文件处理服务器而言，它只负责小文件的处理，对小文件进一步细划分，以及缓存部分已访问文件的相关文件及相应的元数据信息，并制定更新策略。For an independent small file processing server, it is only responsible for processing small files, further subdividing small files, caching related files and corresponding metadata information of some accessed files, and formulating update strategies.

本发明方法的架构下，大文件和处理后的小文件均在文件等待队列等待写入或读取，这就引发了计算机并发执行的理念，使得在处理小文件的合并，映射或读取的同时不影响对大文件或者对已合并的小文件的写入或读取，来提高并发存储访问效率。Under the framework of the method of the present invention, large files and processed small files are all waiting to be written or read in the file waiting queue, which leads to the concept of concurrent computer execution, so that when processing small files, they can be merged, mapped or read At the same time, it does not affect the writing or reading of large files or small files that have been merged to improve the efficiency of concurrent storage access.

图2为本发明基于hadoop的小文件分级索引的存储优化方法的写入流程图，具体的，如图2，所述方法包括小文件的判断，分类及写入：网络服务器WebServer接收文件写入请求，并判断文件是否大于预设块大小，大于预设块大小的文件按照HDFS文件系统的正常流程处理，小于预设块大小的文件被送到小文件处理服务器进行后续处理；Fig. 2 is the writing flowchart of the storage optimization method of the small file hierarchical index based on hadoop of the present invention, specifically, as Fig. 2, described method comprises the judgment of small file, classification and writing: network server WebServer receives file and writes Request, and determine whether the file is larger than the preset block size, the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing;

具体的，所述写入过程包括：Specifically, the writing process includes:

(1.1)网络服务器WebServer判断文件的请求类型，此步骤中，检测的请求类型应为文件写入请求；(1.1) Web server WebServer judges the request type of file, and in this step, the request type of detection should be file writing request;

(1.2)网络服务器WebServer判断文件是否大于预设块大小，大于预设块大小的文件按照HDFS文件系统的正常流程处理，小于预设块大小的文件被送到小文件处理服务器进行后续处理，此步骤中，请求文件大小小于预设块大小；(1.2) The network server WebServer judges whether the file is larger than the preset block size, and the files larger than the preset block size are processed according to the normal process of the HDFS file system, and the files smaller than the preset block size are sent to the small file processing server for subsequent processing. In the step, the requested file size is smaller than the preset block size;

(1.4)对K级小文件建立二级索引，其中，一级索引存储在小文件处理服务器中，二级索引存储在datanode中，对M级小文件建立线性索引结构，存储在小文件处理服务器中，以减少小文件处理服务器的开销；(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server In order to reduce the overhead of small file processing server;

图3为本发明基于hadoop的小文件分级索引的存储优化方法的读取流程图，具体的，如图3，所述方法包括小文件的判断，分类及读取：网络服务器WebServer接收文件读取请求，并判断文件是否大于预设块大小，大于预设块大小的文件按照HDFS文件系统的正常流程处理，小于预设块大小的文件读取请求被送到小文件处理服务器进行后续处理；Fig. 3 is the reading flowchart of the storage optimization method of the small file hierarchical index based on hadoop in the present invention, concretely, as Fig. 3, described method comprises the judgment of small file, classification and reading: network server WebServer receives file and reads Request, and determine whether the file is larger than the preset block size, the file larger than the preset block size is processed according to the normal flow of the HDFS file system, and the file read request smaller than the preset block size is sent to the small file processing server for subsequent processing;

具体的，所述读取过程包括：Specifically, the reading process includes:

(2.1)小文件处理服务器接收到小文件的读取请求，首先会检查缓存中是否存在需要读取的小文件，若存在，小文件处理服务器将缓存中的读请求文件取出返回给客户，完成读取操作，否则，执行步骤(2.2)；(2.1) When the small file processing server receives a read request for a small file, it will first check whether there is a small file that needs to be read in the cache. If it exists, the small file processing server will take out the read request file in the cache and return it to the client. Read operation, otherwise, perform step (2.2);

(2.2)小文件处理服务器会检查缓存中是否存在读请求小文件的元数据信息，若存在，小文件处理服务器将直接和HDFS客户端交互，将小文件从HDFS中取出返回给用户，完成读取操作，否则，执行步骤(2.3)；(2.2) The small file processing server will check whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the read Take the operation, otherwise, perform step (2.3);

(2.4)Hadoop分布式文件系统HDFS的客户端，将接收到的请求读取的合并文件，从Hadoop分布式文件系统HDFS中读出，得到合并文件的元数据信息与索引信息，从而完成读取操作；(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate;

具体地，小文件处理服务器采用小文件分离方法，从Hadoop分布式文件系统HDFS中读取合并文件，将请求读取的小文件从合并文件中分离出来返回给用户；Specifically, the small file processing server adopts a small file separation method to read the combined file from the Hadoop distributed file system HDFS, and separate the small file requested to be read from the combined file and return it to the user;

小文件处理服务器从步骤(2.4)得到的合并文件的元数据信息与索引信息中，提取每个小文件的文件名，数据块位置，偏移量和文件长度，得到小文件的元数据信息，并将每个小文件及其相应的元数据信息存储到缓存中。The small file processing server extracts the file name, data block position, offset and file length of each small file from the metadata information and index information of the merged file obtained in step (2.4), and obtains the metadata information of the small file, And store each small file and its corresponding metadata information into the cache.

在小文件的元数据预取记录和小文件预取记录的首部，分别添加一个用于记录文件访问频率的32位的文件访问标识，以五分钟为计时标准，如果在五分钟之内有访问小文件的请求，则将文件访问标识加1，否则，将文件访问标识减1，如果文件访问标识为0，则将这个小文件从缓存中移除。Add a 32-bit file access identifier for recording the file access frequency to the header of the metadata prefetch record and the small file prefetch record of the small file respectively, with five minutes as the timing standard, if there is an access within five minutes For a small file request, add 1 to the file access ID, otherwise, subtract 1 from the file access ID, and if the file access ID is 0, remove the small file from the cache.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims

1. a storage optimization method based on Hadoop's small file hierarchical index, it is characterized in that, outside Hadoop distributed file system HDFS, add a network server WebServer that is used for file read and write request, add a new one with A small file processing server for processing small files, the method includes the following steps:

(1) The web server WebServer receives a file read or write request, and judges whether the file is larger than the preset block size. Files larger than the preset block size are processed according to the normal process of the HDFS file system, and files smaller than the preset block size are sent to Go to the small file processing server for subsequent processing;

(2) The small file processing server further divides the small files, divides the small files between 1～1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle;

(4) The cache area of the small file processing server saves some accessed small files and their metadata information, and updates the small files and their metadata in the cache area according to the access frequency of the small files.

2. the storage optimization method of the small file hierarchical index based on hadoop according to claim 1, is characterized in that, described network server WebServer is a network server that monitors file read and write requests, and small file processing server is a A server for small file processing.

3. The storage optimization method of hadoop-based small file hierarchical indexing according to claim 1 or 2, wherein the size of the small file is smaller than the size of the block on the distributed file system.

4. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, described small file processing server divides small file further, and the steps are as follows:

1) After the small file processing server receives the operation of reading or writing small files, the small file classification module divides small files from 1 to 1023KB into K grades, and divides small files between 1M and block size into M grades;

2) The K-level small files adopt the secondary index. The primary index is divided into ranges, with 1000 records as a range, and stored in the small file processing server. The secondary index is to actually record the small files in the merged large file. Offset index files are stored in the datanode; M-level small files adopt a linear index structure and are stored in the small file processing server.

5. according to claim 1 or 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, the relevant file and the corresponding metadata information of the cache storage part of the accessed file in the small file processing server, and The small files stored in the cache area and the corresponding metadata information are updated according to the corresponding updating strategy.

6. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, for file writing request, its process is as follows:

(1.1) Web server WebServer judges that the request type of file is a file write request;

(1.2) Web server WebServer judges whether the file is greater than the preset block size, and the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing;

(1.3) The small file processing server further divides the small files, divides the small files between 1～1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle.

7. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, for file reading request, its processing procedure is as follows:

(2.1) The small file processing server receives a read request for a small file, first checks whether there is a small file to be read in the cache, if it exists, the small file processing server takes out the read request file in the cache and returns it to the client, and completes the Take the operation, otherwise, perform step (2.2);

(2.2) The small file processing server checks whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the reading operation, otherwise perform step (2.3);

(2.3) According to the file name of the small file and the combined file, the small file processing server maps the received small file to the combined file of the small file, and sends the combined file to the client of the Hadoop distributed file system HDFS end;

(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate.

8. the storage optimization method of the small file hierarchical index based on hadoop according to claim 7, is characterized in that, the small file processing server adopts the small file separation method, reads the merged file from the Hadoop distributed file system HDFS, and requests The read small files are separated from the merged file and returned to the user.

9. the storage optimization method of the small file hierarchical index based on hadoop according to claim 7, is characterized in that, also comprises: in the metadata information and the index information of the merged file that the small file processing server obtains from step (2.4), Extract the file name, data block location, offset and file length of each small file, obtain the metadata information of the small file, and store each small file and its corresponding metadata information in the cache.

10. The storage optimization method of hadoop-based small file hierarchical indexing according to claim 1 or 2, characterized in that the method further comprises, at the head of the metadata prefetch record and the small file prefetch record of the small file , respectively add a 32-bit file access identifier for recording file access frequency, with five minutes as the timing standard, if there is a request to access a small file within five minutes, add 1 to the file access identifier, otherwise, delete the file The access identifier is decremented by 1, and if the file access identifier is 0, the small file is removed from the cache.