[go: up one dir, main page]

CN105183839A - Hadoop-based storage optimizing method for small file hierachical indexing - Google Patents

Hadoop-based storage optimizing method for small file hierachical indexing Download PDF

Info

Publication number
CN105183839A
CN105183839A CN201510557719.XA CN201510557719A CN105183839A CN 105183839 A CN105183839 A CN 105183839A CN 201510557719 A CN201510557719 A CN 201510557719A CN 105183839 A CN105183839 A CN 105183839A
Authority
CN
China
Prior art keywords
file
small
files
processing server
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510557719.XA
Other languages
Chinese (zh)
Inventor
戴彬
王雄
张焜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201510557719.XA priority Critical patent/CN105183839A/en
Publication of CN105183839A publication Critical patent/CN105183839A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于hadoop的小文件分级索引的存储优化方法,能够向HDFS文件系统上传海量小文件和提高存储在HDFS系统中的小文件索引效率。主要包括以下步骤:(1)WebServer监听文件读写请求,判断文件的大小是否小于HDFS文件系统设置的块大小;(2)大于块的大小的文件按照正常HDFS文件进行读取或写入,小于块的大小的文件被划分为小文件,送入小文件处理服务器进行处理;(3)在小文件处理服务器内部,设有一个判断模块,将1~1023KB的小文件设置为K级小文件,将1M~64M的小文件设置为M级小文件;(4)在小文件处理服务器还设有缓存,可根据文件相关性,在文件分离时预取一部分小文件,并设计缓存更新策略,来提高文件访问效率。

The invention discloses a hadoop-based storage optimization method for hierarchical indexing of small files, capable of uploading a large amount of small files to an HDFS file system and improving the indexing efficiency of small files stored in the HDFS system. It mainly includes the following steps: (1) WebServer listens to file read and write requests, and judges whether the size of the file is smaller than the block size set by the HDFS file system; (2) files larger than the block size are read or written according to normal HDFS files, and if the file size is smaller than The file of block size is divided into small files, and is sent to the small file processing server for processing; (3) inside the small file processing server, a judging module is provided to set the small files of 1~1023KB as K-level small files, Set the small files from 1M to 64M as M-level small files; (4) There is also a cache in the small file processing server, which can prefetch some small files when the files are separated according to the file correlation, and design a cache update strategy to Improve file access efficiency.

Description

一种基于Hadoop的小文件分级索引的存储优化方法A storage optimization method for hierarchical indexing of small files based on Hadoop

技术领域technical field

本发明属于云存储技术领域,更具体地,涉及一种基于Hadoop的小文件分级索引的存储优化方法。The invention belongs to the technical field of cloud storage, and more specifically relates to a Hadoop-based storage optimization method for hierarchical indexing of small files.

背景技术Background technique

Hadoop是一个分布式系统基础架构,由Apache基金会开发。Hadoop是当前主流的云存储平台,它由一个NameNode和多个DataNode组成,其中NameNode负责管理文件系统命名空间和控制外部客户端的访问,DataNode负责具体数据的存储。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力进行高速运算和存储。Hadoop实现了一个分布式文件系统,简称HDFS。HDFS有着高容错性的特点,并且设计用来部署在低廉的硬件上。而且它提供高传输率来访问应用程序的数据,适合那些有着超大数据集的应用程序。Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop is the current mainstream cloud storage platform. It consists of a NameNode and multiple DataNodes. The NameNode is responsible for managing the file system namespace and controlling the access of external clients, and the DataNode is responsible for the storage of specific data. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. Hadoop implements a distributed file system, HDFS for short. HDFS is highly fault-tolerant and designed to be deployed on inexpensive hardware. And it provides a high transfer rate to access application data, suitable for those applications with very large data sets.

Hadoop是被设计用来处理大文件的,它将到来的大文件存储在不同datanode的不同数据块上,并在namenode上记录大文件的元数据。当要对某个文件进行操作时,首先访问namenode,根据namenode的反馈信息定位到各个块置,找到这个文件,然后使用相应的命令对进行操作。需要说明是,这些操作都是按块进行的。而对于小文件,Hadoop便不再那么高效,由于每个独立的小文件中所含有数据量是非常少,只能作为一块(远小于数据块的默认大)存储或操作,且海量文件就意味着元数据也是海量的。如何有效地存储和管理大量的小文件,成为亟待解决的难题。Hadoop is designed to handle large files. It stores incoming large files on different data blocks of different datanodes, and records the metadata of large files on the namenode. When operating on a file, first visit the namenode, locate each block according to the feedback information from the namenode, find the file, and then use the corresponding command to operate on it. It should be noted that these operations are performed in blocks. For small files, Hadoop is no longer so efficient, because the amount of data contained in each independent small file is very small, and can only be stored or operated as a block (far smaller than the default size of the data block), and a large number of files means The metadata is also massive. How to effectively store and manage a large number of small files has become an urgent problem to be solved.

在HDFS分布式文件系统中,任何文件信息、文件块信息在NameNode(主节点)的内存中都以一个对象的形式存储,每一个对象约占150字节,大量小文件使NameNode的内存使用情况严重制约了集群的扩展。In the HDFS distributed file system, any file information and file block information are stored in the form of an object in the memory of the NameNode (master node). Each object occupies about 150 bytes. A large number of small files make the memory usage of the NameNode Severely restricts the expansion of the cluster.

HDFS访问大量小文件速度远远小于访问同等大小的大文件。它主要是为了流式的访问大文件而设计的。对小文件的读取通常会造成大量从Datanode到Datanode的获取文件,这样是非常的低效的一种访问方式。HDFS access to a large number of small files is much slower than accessing large files of the same size. It is primarily designed for streaming access to large files. Reading small files usually results in a large number of fetched files from Datanode to Datanode, which is a very inefficient access method.

MapReduce(并行数据处理)处理批量小文件时间远远长于处理同等大小的大文件的时间。处理每一个小文件要占用一个task(任务),若小文件过多,则大部分处理时间都耗费在启动task和释放task上。Map函数通常处理的是一个块大小的输入,若有非常多的小文件会产生大量的Map函数来执行,而这些Map函数无疑会占用大量的内存。MapReduce (parallel data processing) takes much longer to process batches of small files than to process large files of the same size. Processing each small file takes up a task (task). If there are too many small files, most of the processing time will be spent on starting and releasing tasks. The Map function usually processes a block-sized input. If there are a lot of small files, a large number of Map functions will be generated for execution, and these Map functions will undoubtedly occupy a large amount of memory.

对于大量的小文件引起的问题,Hadoop技术自身提供了三个解决方案,分别为HadoopArchive(Hadoop归档),Sequencefile(序列文件)和CombineFileInputFormat(组合文件输入格式)。For the problems caused by a large number of small files, Hadoop technology itself provides three solutions, namely HadoopArchive (Hadoop archive), Sequencefile (sequence file) and CombineFileInputFormat (combined file input format).

HadoopArchive将小文件打包成HAR文件,首先是建立一个主索引结构,这个主索引指向二级索引,由二级索引指向具体的小文件。因此访问每个HAR文件将需要完成两层索引文件读取和文件本身数据读取,所以实际上通过HAR来读取一个文件可能会比直接从HDFS中读取文件效率要低。HadoopArchive,Sequencefile和CombineFileInputFormat这三个工具不能自动化,都需要管理员手动添加,效率非常的低下,不适合大量的文件操作。HadoopArchive packs small files into HAR files. First, a main index structure is established. The main index points to the secondary index, and the secondary index points to specific small files. Therefore, accessing each HAR file will require two layers of index file reading and file data reading, so actually reading a file through HAR may be less efficient than reading a file directly from HDFS. The three tools HadoopArchive, Sequencefile and CombineFileInputFormat cannot be automated and need to be manually added by the administrator. The efficiency is very low and it is not suitable for a large number of file operations.

发明内容Contents of the invention

本发明的主要目的在于解决现有Hadoop分布式文件系统对于海量小文件存储、读取和写入操作效率低的问题,提出了一种基于hadoop的小文件分级索引的存储优化方法。The main purpose of the present invention is to solve the problem of low efficiency of the existing Hadoop distributed file system for massive small file storage, reading and writing operations, and proposes a storage optimization method based on hadoop-based small file hierarchical indexing.

为实现上述目的,本发明提供了一种基于Hadoop的小文件分级索引的存储优化方法,在Hadoop分布式文件系统HDFS之外,新增一台用于文件读写请求的网络服务器WebServer,新增一台用于处理小文件的小文件处理服务器,所述方法包括如下步骤:In order to achieve the above object, the present invention provides a storage optimization method based on Hadoop's small file hierarchical index, in addition to the Hadoop distributed file system HDFS, add a web server WebServer for file read and write requests, add A small file processing server for processing small files, the method includes the following steps:

(1)网络服务器WebServer接收文件读取或写入请求,并判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;(1) The web server WebServer receives a file read or write request, and judges whether the file is larger than the preset block size. Files larger than the preset block size are processed according to the normal process of the HDFS file system, and files smaller than the preset block size are sent to Go to the small file processing server for subsequent processing;

(2)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(2) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(3)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中;(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle;

(4)小文件处理服务器的缓存区保存部分已访问的小文件和它的元数据信息,并根据小文件的访问频率来更新缓存区中的小文件及其元数据。(4) The cache area of the small file processing server saves some accessed small files and their metadata information, and updates the small files and their metadata in the cache area according to the access frequency of the small files.

进一步地,所述网络服务器WebServer为一台监听文件读写请求的网络服务器,小文件处理服务器是一台用于小文件处理的服务器。Further, the web server WebServer is a web server that monitors file read and write requests, and the small file processing server is a server for small file processing.

进一步地,所述小文件大小小于所述分布式文件系统上块的大小。Further, the small file size is smaller than the block size on the distributed file system.

进一步地,所述小文件处理服务器对小文件进一步划分,步骤如下:Further, the small file processing server further divides the small files, and the steps are as follows:

1)小文件处理服务器收到读取或写入小文件的操作后,小文件分级模块将1~1023KB的小文件划分为K级,将1M到块大小之间的小文件划分为M级;1) After the small file processing server receives the operation of reading or writing small files, the small file classification module divides small files from 1 to 1023KB into K grades, and divides small files between 1M and block size into M grades;

2)K级小文件采用二级索引,一级索引为范围划分,以1000条记录为一个范围,存储在小文件处理服务器中,二级索引是真正记录小文件在合并后的大文件中的偏移的索引文件,存储在datanode中;M级小文件采用线性索引结构,存储在小文件处理服务器中。2) The K-level small files adopt the secondary index. The primary index is divided into ranges, with 1000 records as a range, and stored in the small file processing server. The secondary index is to actually record the small files in the merged large file. Offset index files are stored in the datanode; M-level small files adopt a linear index structure and are stored in the small file processing server.

进一步地,小文件处理服务器中的缓存保存部分已访问文件的相关文件和相应的元数据信息,并根据采用相应的更新策略来更新存储在缓存区中的小文件及相应的元数据信息。Further, the cache in the small file processing server stores related files and corresponding metadata information of some accessed files, and updates the small files and corresponding metadata information stored in the cache area according to the corresponding updating strategy.

进一步地,对于文件写入请求,其处理过程如下:Further, for file write requests, the processing process is as follows:

(1.1)网络服务器WebServer判断文件的请求类型为文件写入请求;(1.1) Web server WebServer judges that the request type of file is a file write request;

(1.2)网络服务器WebServer判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;(1.2) Web server WebServer judges whether the file is greater than the preset block size, and the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing;

(1.3)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(1.3) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(1.4)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中。(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle.

进一步地,对于文件读取请求,其处理过程如下:Further, for file read requests, the processing process is as follows:

(2.1)小文件处理服务器接收到小文件的读取请求,首先检查缓存中是否存在需要读取的小文件,若存在,小文件处理服务器将缓存中的读请求文件取出返回给客户,完成读取操作,否则,执行步骤(2.2);(2.1) The small file processing server receives a read request for a small file, first checks whether there is a small file to be read in the cache, if it exists, the small file processing server takes out the read request file in the cache and returns it to the client, and completes the Take the operation, otherwise, perform step (2.2);

(2.2)小文件处理服务器检查缓存中是否存在读请求小文件的元数据信息,若存在,小文件处理服务器将直接和HDFS客户端交互,将小文件从HDFS中取出返回给用户,完成读取操作,否则执行步骤(2.3);(2.2) The small file processing server checks whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the reading operation, otherwise perform step (2.3);

(2.3)根据小文件和合并文件的文件名,小文件处理服务器将收到的请求读取的小文件映射到小文件的合并文件中,并将合并文件送入Hadoop分布式文件系统HDFS的客户端;(2.3) According to the file name of the small file and the combined file, the small file processing server maps the received small file to the combined file of the small file, and sends the combined file to the client of the Hadoop distributed file system HDFS end;

(2.4)Hadoop分布式文件系统HDFS的客户端,将接收到的请求读取的合并文件,从Hadoop分布式文件系统HDFS中读出,得到合并文件的元数据信息与索引信息,从而完成读取操作。(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate.

进一步地,所述小文件处理服务器采用小文件分离方法,从Hadoop分布式文件系统HDFS中读取合并文件,将请求读取的小文件从合并文件中分离出来返回给用户。Further, the small file processing server adopts a small file separation method to read the merged file from the Hadoop distributed file system HDFS, and separate the requested small file from the merged file and return it to the user.

进一步地,所述方法还包括:小文件处理服务器从步骤(2.4)得到的合并文件的元数据信息与索引信息中,提取每个小文件的文件名,数据块位置、偏移量和文件长度,得到小文件的元数据信息,并将每个小文件及其相应的元数据信息存储到缓存中。Further, the method further includes: the small file processing server extracts the file name, data block position, offset and file length of each small file from the metadata information and index information of the merged file obtained in step (2.4) , get the metadata information of the small files, and store each small file and its corresponding metadata information in the cache.

进一步地,所述方法还包括,在小文件的元数据预取记录和小文件预取记录的首部,分别添加一个用于记录文件访问频率的32位的文件访问标识,以五分钟为计时标准,如果在五分钟之内有访问小文件的请求,则将文件访问标识加1,否则,将文件访问标识减1,如果文件访问标识为0,则将这个小文件从缓存中移除。Further, the method further includes adding a 32-bit file access identifier for recording file access frequency to the metadata prefetch record of the small file and the header of the small file prefetch record respectively, with five minutes as the timing standard , if there is a request for accessing a small file within five minutes, add 1 to the file access ID, otherwise, decrement the file access ID by 1, and if the file access ID is 0, remove the small file from the cache.

与现有技术相比,本发明的有益效果是:Compared with prior art, the beneficial effect of the present invention is:

(1)能够向HDFS文件系统上传海量小文件和并且能够实现小文件快速索引,独立于HDFS文件系统的小文件处理服务器通过建立相应的小文件索引和小文件处理机制,来有效完成小文件的合并,映射,预取,这可以很大程度上减少namenode上的开销,有效改善系统处理小文件的效率,而且也不会改变现有的HDFS系统,保证了系统的通用性;在小文件处理服务器,通过将小文件进一步进行划分,有区别地建立合适的索引,能够节省索引文件的存储空间;小文件处理服务器中的缓存缓存部分已访问文件的相关文件及相应的元数据信息,并制定更新策略,来提高访问效率(1) A large number of small files can be uploaded to the HDFS file system and fast indexing of small files can be realized. The small file processing server independent of the HDFS file system can effectively complete the small file processing by establishing a corresponding small file index and small file processing mechanism Merge, map, and prefetch, which can greatly reduce the overhead on the namenode, effectively improve the efficiency of the system to process small files, and will not change the existing HDFS system, ensuring the versatility of the system; in small file processing The server can save the storage space of the index files by further dividing the small files and establishing appropriate indexes in different ways; the small file processing server caches the related files and corresponding metadata information of some accessed files, and formulates Update strategies to improve access efficiency

(2)将海量小文件合并为有限个数的大文件,并且对小文件的索引方法进一步划分,降低了Hadoop集群中NameNode节点的内存占有率;(2) Merge a large number of small files into a limited number of large files, and further divide the indexing method of small files, reducing the memory occupancy rate of the NameNode node in the Hadoop cluster;

(3)将小文件管理功能封装成为一个服务,降低了用户的使用难度;(3) Encapsulate the small file management function into a service, which reduces the difficulty of use for users;

(4)独立的小文件处理服务器不会影响到HDFS文件系统的正常运行,提高了文件访问效率。(4) The independent small file processing server will not affect the normal operation of the HDFS file system, which improves the file access efficiency.

附图说明Description of drawings

图1为本发明一种基于Hadoop的小文件分级索引的存储优化方法的流程图;Fig. 1 is a flow chart of the storage optimization method of a Hadoop-based small file hierarchical index of the present invention;

图2为本发明的小文件存储流程示意图;Fig. 2 is a schematic diagram of the small file storage process of the present invention;

图3为本发明的小文件读取流程示意图;Fig. 3 is a schematic diagram of a small file reading process of the present invention;

图4为K级索引文件的索引结构示意图。FIG. 4 is a schematic diagram of an index structure of a K-level index file.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

如图1所示,为本发明一种基于Hadoop的小文件分级索引的存储优化方法,在Hadoop分布式文件系统HDFS之外,新增一台用于文件读写请求的网络服务器WebServer,新增一台用于处理小文件的小文件处理服务器,其步骤如下:As shown in Figure 1, for the storage optimization method of a kind of Hadoop-based small file classification index of the present invention, outside Hadoop distributed file system HDFS, add a web server WebServer that is used for file read and write request, add A small file processing server for processing small files, the steps are as follows:

(1)网络服务器WebServer接收文件读取或写入请求,并判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;(1) The web server WebServer receives a file read or write request, and judges whether the file is larger than the preset block size. Files larger than the preset block size are processed according to the normal process of the HDFS file system, and files smaller than the preset block size are sent to Go to the small file processing server for subsequent processing;

(2)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(2) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(3)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中,以减少小文件处理服务器的开销;(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server In order to reduce the overhead of small file processing server;

(4)小文件处理服务器的缓存区保存部分已访问的小文件和它的元数据信息,并根据小文件的访问频率来更新缓存区中的小文件及其元数据。(4) The cache area of the small file processing server saves some accessed small files and their metadata information, and updates the small files and their metadata in the cache area according to the access frequency of the small files.

所述的二级索引,一级索引是范围索引,二级索引才是真正记录小文件在合并后的大文件中的偏移的索引文件。所述索引结构如图4所示。每当二级索引文件记录超过1000记录(默认设置为1000条),则向一级索引记录一个范围,包括这1000条的第一条记录的索引值,最后一条的索引值,第一条记录的索引值在二级索引文件中的位置,最后一条的索引值在二级索引文件中的结束位置。二级索引记录的信息包括文件名的索引值,文件的相对路径名,是目录还是文件,在合并后文件中的位置,文件的长度,合并后的文件名。As for the secondary index, the primary index is a range index, and the secondary index is an index file that actually records the offset of the small file in the merged large file. The index structure is shown in FIG. 4 . Whenever the secondary index file records more than 1000 records (the default setting is 1000), record a range to the primary index, including the index value of the first record of these 1000 records, the index value of the last record, and the first record The position of the index value of the index in the secondary index file, and the end position of the index value of the last item in the secondary index file. The information recorded in the secondary index includes the index value of the file name, the relative path name of the file, whether it is a directory or a file, the position in the merged file, the length of the file, and the merged file name.

小文件处理服务器主要负责小文件的合并,映射及预取。它的功能有三方面:The small file processing server is mainly responsible for merging, mapping and prefetching of small files. Its function has three aspects:

(1)写入文件时,它根据小文件的大小建立索引,1~1023KB的小文件采用线性索引结构,1M到块大小之间的小文件采用二级索引结构,对文件进行合并;(1) When writing a file, it builds an index according to the size of the small file. Small files from 1 to 1023 KB use a linear index structure, and small files between 1M and block size use a secondary index structure to merge files;

(2)读取文件时,它完成小文件到其合并文件的映射过程,并根据索引文件把所需小文件从大文件中分离出来;(2) When reading a file, it completes the mapping process from the small file to its merged file, and separates the required small file from the large file according to the index file;

(3)小文件处理服务器还设有缓存区,可根据文件相关性,在文件分离时预取一部分小文件,并设计相应的缓存更新策略,来提高文件访问效率。(3) The small file processing server also has a cache area, which can prefetch some small files when the files are separated according to file dependencies, and design a corresponding cache update strategy to improve file access efficiency.

对于本发明中HDFS部分来说,在NameNode层,我们维持了原有的NameNode和SecondaryNameNode的关系,其作用也与原HDFS中NameNode相同,在DataNode层,我们将DataNode分为两块云存储区域,在DataNode首部加入K级小文件的二级索引文件,其他部分仍然作为数据存储区。For the HDFS part of the present invention, at the NameNode layer, we maintain the relationship between the original NameNode and SecondaryNameNode, and its function is the same as that of the NameNode in the original HDFS. At the DataNode layer, we divide the DataNode into two cloud storage areas. Add the secondary index file of K-level small files to the head of the DataNode, and the other parts are still used as data storage areas.

对于独立的小文件处理服务器而言,它只负责小文件的处理,对小文件进一步细划分,以及缓存部分已访问文件的相关文件及相应的元数据信息,并制定更新策略。For an independent small file processing server, it is only responsible for processing small files, further subdividing small files, caching related files and corresponding metadata information of some accessed files, and formulating update strategies.

本发明方法的架构下,大文件和处理后的小文件均在文件等待队列等待写入或读取,这就引发了计算机并发执行的理念,使得在处理小文件的合并,映射或读取的同时不影响对大文件或者对已合并的小文件的写入或读取,来提高并发存储访问效率。Under the framework of the method of the present invention, large files and processed small files are all waiting to be written or read in the file waiting queue, which leads to the concept of concurrent computer execution, so that when processing small files, they can be merged, mapped or read At the same time, it does not affect the writing or reading of large files or small files that have been merged to improve the efficiency of concurrent storage access.

图2为本发明基于hadoop的小文件分级索引的存储优化方法的写入流程图,具体的,如图2,所述方法包括小文件的判断,分类及写入:网络服务器WebServer接收文件写入请求,并判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;Fig. 2 is the writing flowchart of the storage optimization method of the small file hierarchical index based on hadoop of the present invention, specifically, as Fig. 2, described method comprises the judgment of small file, classification and writing: network server WebServer receives file and writes Request, and determine whether the file is larger than the preset block size, the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing;

具体的,所述写入过程包括:Specifically, the writing process includes:

(1.1)网络服务器WebServer判断文件的请求类型,此步骤中,检测的请求类型应为文件写入请求;(1.1) Web server WebServer judges the request type of file, and in this step, the request type of detection should be file writing request;

(1.2)网络服务器WebServer判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理,此步骤中,请求文件大小小于预设块大小;(1.2) The network server WebServer judges whether the file is larger than the preset block size, and the files larger than the preset block size are processed according to the normal process of the HDFS file system, and the files smaller than the preset block size are sent to the small file processing server for subsequent processing. In the step, the requested file size is smaller than the preset block size;

(1.3)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(1.3) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files;

(1.4)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中,以减少小文件处理服务器的开销;(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server In order to reduce the overhead of small file processing server;

图3为本发明基于hadoop的小文件分级索引的存储优化方法的读取流程图,具体的,如图3,所述方法包括小文件的判断,分类及读取:网络服务器WebServer接收文件读取请求,并判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件读取请求被送到小文件处理服务器进行后续处理;Fig. 3 is the reading flowchart of the storage optimization method of the small file hierarchical index based on hadoop in the present invention, concretely, as Fig. 3, described method comprises the judgment of small file, classification and reading: network server WebServer receives file and reads Request, and determine whether the file is larger than the preset block size, the file larger than the preset block size is processed according to the normal flow of the HDFS file system, and the file read request smaller than the preset block size is sent to the small file processing server for subsequent processing;

具体的,所述读取过程包括:Specifically, the reading process includes:

(2.1)小文件处理服务器接收到小文件的读取请求,首先会检查缓存中是否存在需要读取的小文件,若存在,小文件处理服务器将缓存中的读请求文件取出返回给客户,完成读取操作,否则,执行步骤(2.2);(2.1) When the small file processing server receives a read request for a small file, it will first check whether there is a small file that needs to be read in the cache. If it exists, the small file processing server will take out the read request file in the cache and return it to the client. Read operation, otherwise, perform step (2.2);

(2.2)小文件处理服务器会检查缓存中是否存在读请求小文件的元数据信息,若存在,小文件处理服务器将直接和HDFS客户端交互,将小文件从HDFS中取出返回给用户,完成读取操作,否则,执行步骤(2.3);(2.2) The small file processing server will check whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the read Take the operation, otherwise, perform step (2.3);

(2.3)根据小文件和合并文件的文件名,小文件处理服务器将收到的请求读取的小文件映射到小文件的合并文件中,并将合并文件送入Hadoop分布式文件系统HDFS的客户端;(2.3) According to the file name of the small file and the combined file, the small file processing server maps the received small file to the combined file of the small file, and sends the combined file to the client of the Hadoop distributed file system HDFS end;

(2.4)Hadoop分布式文件系统HDFS的客户端,将接收到的请求读取的合并文件,从Hadoop分布式文件系统HDFS中读出,得到合并文件的元数据信息与索引信息,从而完成读取操作;(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate;

具体地,小文件处理服务器采用小文件分离方法,从Hadoop分布式文件系统HDFS中读取合并文件,将请求读取的小文件从合并文件中分离出来返回给用户;Specifically, the small file processing server adopts a small file separation method to read the combined file from the Hadoop distributed file system HDFS, and separate the small file requested to be read from the combined file and return it to the user;

小文件处理服务器从步骤(2.4)得到的合并文件的元数据信息与索引信息中,提取每个小文件的文件名,数据块位置,偏移量和文件长度,得到小文件的元数据信息,并将每个小文件及其相应的元数据信息存储到缓存中。The small file processing server extracts the file name, data block position, offset and file length of each small file from the metadata information and index information of the merged file obtained in step (2.4), and obtains the metadata information of the small file, And store each small file and its corresponding metadata information into the cache.

在小文件的元数据预取记录和小文件预取记录的首部,分别添加一个用于记录文件访问频率的32位的文件访问标识,以五分钟为计时标准,如果在五分钟之内有访问小文件的请求,则将文件访问标识加1,否则,将文件访问标识减1,如果文件访问标识为0,则将这个小文件从缓存中移除。Add a 32-bit file access identifier for recording the file access frequency to the header of the metadata prefetch record and the small file prefetch record of the small file respectively, with five minutes as the timing standard, if there is an access within five minutes For a small file request, add 1 to the file access ID, otherwise, subtract 1 from the file access ID, and if the file access ID is 0, remove the small file from the cache.

本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。It is easy for those skilled in the art to understand that the above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, All should be included within the protection scope of the present invention.

Claims (10)

1.一种基于Hadoop的小文件分级索引的存储优化方法,其特征在于,在Hadoop分布式文件系统HDFS之外,新增一台用于文件读写请求的网络服务器WebServer,新增一台用于处理小文件的小文件处理服务器,所述方法包括如下步骤:1. a storage optimization method based on Hadoop's small file hierarchical index, it is characterized in that, outside Hadoop distributed file system HDFS, add a network server WebServer that is used for file read and write request, add a new one with A small file processing server for processing small files, the method includes the following steps: (1)网络服务器WebServer接收文件读取或写入请求,并判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;(1) The web server WebServer receives a file read or write request, and judges whether the file is larger than the preset block size. Files larger than the preset block size are processed according to the normal process of the HDFS file system, and files smaller than the preset block size are sent to Go to the small file processing server for subsequent processing; (2)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(2) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files; (3)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中;(3) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle; (4)小文件处理服务器的缓存区保存部分已访问的小文件和它的元数据信息,并根据小文件的访问频率来更新缓存区中的小文件及其元数据。(4) The cache area of the small file processing server saves some accessed small files and their metadata information, and updates the small files and their metadata in the cache area according to the access frequency of the small files. 2.根据权利要求1所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,所述网络服务器WebServer为一台监听文件读写请求的网络服务器,小文件处理服务器是一台用于小文件处理的服务器。2. the storage optimization method of the small file hierarchical index based on hadoop according to claim 1, is characterized in that, described network server WebServer is a network server that monitors file read and write requests, and small file processing server is a A server for small file processing. 3.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,所述小文件大小小于所述分布式文件系统上块的大小。3. The storage optimization method of hadoop-based small file hierarchical indexing according to claim 1 or 2, wherein the size of the small file is smaller than the size of the block on the distributed file system. 4.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,所述小文件处理服务器对小文件进一步划分,步骤如下:4. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, described small file processing server divides small file further, and the steps are as follows: 1)小文件处理服务器收到读取或写入小文件的操作后,小文件分级模块将1~1023KB的小文件划分为K级,将1M到块大小之间的小文件划分为M级;1) After the small file processing server receives the operation of reading or writing small files, the small file classification module divides small files from 1 to 1023KB into K grades, and divides small files between 1M and block size into M grades; 2)K级小文件采用二级索引,一级索引为范围划分,以1000条记录为一个范围,存储在小文件处理服务器中,二级索引是真正记录小文件在合并后的大文件中的偏移的索引文件,存储在datanode中;M级小文件采用线性索引结构,存储在小文件处理服务器中。2) The K-level small files adopt the secondary index. The primary index is divided into ranges, with 1000 records as a range, and stored in the small file processing server. The secondary index is to actually record the small files in the merged large file. Offset index files are stored in the datanode; M-level small files adopt a linear index structure and are stored in the small file processing server. 5.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,小文件处理服务器中的缓存保存部分已访问文件的相关文件和相应的元数据信息,并根据采用相应的更新策略来更新存储在缓存区中的小文件及相应的元数据信息。5. according to claim 1 or 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, the relevant file and the corresponding metadata information of the cache storage part of the accessed file in the small file processing server, and The small files stored in the cache area and the corresponding metadata information are updated according to the corresponding updating strategy. 6.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,对于文件写入请求,其处理过程如下:6. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, for file writing request, its process is as follows: (1.1)网络服务器WebServer判断文件的请求类型为文件写入请求;(1.1) Web server WebServer judges that the request type of file is a file write request; (1.2)网络服务器WebServer判断文件是否大于预设块大小,大于预设块大小的文件按照HDFS文件系统的正常流程处理,小于预设块大小的文件被送到小文件处理服务器进行后续处理;(1.2) Web server WebServer judges whether the file is greater than the preset block size, and the file larger than the preset block size is processed according to the normal process of the HDFS file system, and the file smaller than the preset block size is sent to the small file processing server for subsequent processing; (1.3)小文件处理服务器对小文件进行进一步的划分,把1~1023KB之间的小文件划分为K级小文件,把1M到预设块大小之间的小文件划分为M级小文件;(1.3) The small file processing server further divides the small files, divides the small files between 1~1023KB into K-level small files, and divides the small files between 1M and the preset block size into M-level small files; (1.4)对K级小文件建立二级索引,其中,一级索引存储在小文件处理服务器中,二级索引存储在datanode中,对M级小文件建立线性索引结构,存储在小文件处理服务器中。(1.4) Establish a secondary index for K-level small files, wherein the primary index is stored in the small file processing server, the secondary index is stored in the datanode, and a linear index structure is established for M-level small files and stored in the small file processing server middle. 7.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,对于文件读取请求,其处理过程如下:7. according to claim 1 and 2 described storage optimization methods based on the small file hierarchical index of hadoop, it is characterized in that, for file reading request, its processing procedure is as follows: (2.1)小文件处理服务器接收到小文件的读取请求,首先检查缓存中是否存在需要读取的小文件,若存在,小文件处理服务器将缓存中的读请求文件取出返回给客户,完成读取操作,否则,执行步骤(2.2);(2.1) The small file processing server receives a read request for a small file, first checks whether there is a small file to be read in the cache, if it exists, the small file processing server takes out the read request file in the cache and returns it to the client, and completes the Take the operation, otherwise, perform step (2.2); (2.2)小文件处理服务器检查缓存中是否存在读请求小文件的元数据信息,若存在,小文件处理服务器将直接和HDFS客户端交互,将小文件从HDFS中取出返回给用户,完成读取操作,否则执行步骤(2.3);(2.2) The small file processing server checks whether the metadata information of the read requested small file exists in the cache. If it exists, the small file processing server will directly interact with the HDFS client, take the small file from HDFS and return it to the user, and complete the reading operation, otherwise perform step (2.3); (2.3)根据小文件和合并文件的文件名,小文件处理服务器将收到的请求读取的小文件映射到小文件的合并文件中,并将合并文件送入Hadoop分布式文件系统HDFS的客户端;(2.3) According to the file name of the small file and the combined file, the small file processing server maps the received small file to the combined file of the small file, and sends the combined file to the client of the Hadoop distributed file system HDFS end; (2.4)Hadoop分布式文件系统HDFS的客户端,将接收到的请求读取的合并文件,从Hadoop分布式文件系统HDFS中读出,得到合并文件的元数据信息与索引信息,从而完成读取操作。(2.4) The client of the Hadoop distributed file system HDFS reads the merged file received from the Hadoop distributed file system HDFS to obtain the metadata information and index information of the merged file, thereby completing the reading operate. 8.根据权利要求7所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,小文件处理服务器采用小文件分离方法,从Hadoop分布式文件系统HDFS中读取合并文件,将请求读取的小文件从合并文件中分离出来返回给用户。8. the storage optimization method of the small file hierarchical index based on hadoop according to claim 7, is characterized in that, the small file processing server adopts the small file separation method, reads the merged file from the Hadoop distributed file system HDFS, and requests The read small files are separated from the merged file and returned to the user. 9.根据权利要求7所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,还包括:小文件处理服务器从步骤(2.4)得到的合并文件的元数据信息与索引信息中,提取每个小文件的文件名,数据块位置、偏移量和文件长度,得到小文件的元数据信息,并将每个小文件及其相应的元数据信息存储到缓存中。9. the storage optimization method of the small file hierarchical index based on hadoop according to claim 7, is characterized in that, also comprises: in the metadata information and the index information of the merged file that the small file processing server obtains from step (2.4), Extract the file name, data block location, offset and file length of each small file, obtain the metadata information of the small file, and store each small file and its corresponding metadata information in the cache. 10.根据权利要求1或2所述的基于hadoop的小文件分级索引的存储优化方法,其特征在于,所述方法还包括,在小文件的元数据预取记录和小文件预取记录的首部,分别添加一个用于记录文件访问频率的32位的文件访问标识,以五分钟为计时标准,如果在五分钟之内有访问小文件的请求,则将文件访问标识加1,否则,将文件访问标识减1,如果文件访问标识为0,则将这个小文件从缓存中移除。10. The storage optimization method of hadoop-based small file hierarchical indexing according to claim 1 or 2, characterized in that the method further comprises, at the head of the metadata prefetch record and the small file prefetch record of the small file , respectively add a 32-bit file access identifier for recording file access frequency, with five minutes as the timing standard, if there is a request to access a small file within five minutes, add 1 to the file access identifier, otherwise, delete the file The access identifier is decremented by 1, and if the file access identifier is 0, the small file is removed from the cache.
CN201510557719.XA 2015-09-02 2015-09-02 Hadoop-based storage optimizing method for small file hierachical indexing Pending CN105183839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510557719.XA CN105183839A (en) 2015-09-02 2015-09-02 Hadoop-based storage optimizing method for small file hierachical indexing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510557719.XA CN105183839A (en) 2015-09-02 2015-09-02 Hadoop-based storage optimizing method for small file hierachical indexing

Publications (1)

Publication Number Publication Date
CN105183839A true CN105183839A (en) 2015-12-23

Family

ID=54905921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510557719.XA Pending CN105183839A (en) 2015-09-02 2015-09-02 Hadoop-based storage optimizing method for small file hierachical indexing

Country Status (1)

Country Link
CN (1) CN105183839A (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701233A (en) * 2016-02-18 2016-06-22 焦点科技股份有限公司 Method for optimizing server cache management
CN105843841A (en) * 2016-03-07 2016-08-10 青岛理工大学 Small file storage method and system
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
CN107092604A (en) * 2016-02-18 2017-08-25 中国移动通信集团河北有限公司 A kind of document handling method and device
CN107391280A (en) * 2017-07-31 2017-11-24 郑州云海信息技术有限公司 A kind of reception of small documents and storage method and device
CN107402924A (en) * 2016-05-19 2017-11-28 普天信息技术有限公司 MR files apply the implementation method and device in HDFS
CN108174136A (en) * 2018-03-14 2018-06-15 成都创信特电子技术有限公司 Cloud disk video coding and storage method
CN108287869A (en) * 2017-12-20 2018-07-17 江苏省公用信息有限公司 A kind of mass small documents solution based on speedy storage equipment
CN108366217A (en) * 2018-03-14 2018-08-03 成都创信特电子技术有限公司 Monitor video acquisition and storage method
CN108491488A (en) * 2018-03-14 2018-09-04 成都创信特电子技术有限公司 Media high speed storing method
CN108806773A (en) * 2018-05-21 2018-11-13 上海熙业信息科技有限公司 Medical image cloud storage platform designing method
CN108932287A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents wiring method based on Hadoop
CN108932288A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents caching method based on Hadoop
CN108959660A (en) * 2018-08-15 2018-12-07 东北大学 A kind of storage method and application method based on HDFS distributed file system
CN109947703A (en) * 2017-11-09 2019-06-28 北京京东尚科信息技术有限公司 File system, file memory method, storage device and computer-readable medium
CN109947718A (en) * 2019-02-25 2019-06-28 全球能源互联网研究院有限公司 A data storage method, storage platform and storage device
CN109977074A (en) * 2019-03-11 2019-07-05 北京东方国信科技股份有限公司 A kind of lob data processing method and processing device based on HDFS
CN110046161A (en) * 2019-03-18 2019-07-23 平安普惠企业管理有限公司 Method for writing data and device, storage medium, electronic equipment
CN110059025A (en) * 2019-04-22 2019-07-26 北京电子工程总体研究所 A kind of method and system of cache prefetching
CN110196841A (en) * 2018-06-21 2019-09-03 腾讯科技(深圳)有限公司 The storage method and device of file, querying method and device and server
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN110765086A (en) * 2019-10-25 2020-02-07 浪潮电子信息产业股份有限公司 Directory reading method and system for small files, electronic equipment and storage medium
CN111258955A (en) * 2018-11-30 2020-06-09 北京白山耘科技有限公司 File reading method and system, storage medium and computer equipment
CN112463755A (en) * 2020-12-11 2021-03-09 同济大学 Heterogeneous Internet of things big data storage and reading system and method based on HDFS
CN112597104A (en) * 2021-01-11 2021-04-02 武汉飞骥永泰科技有限公司 Small file performance optimization method and system
CN113204520A (en) * 2021-04-28 2021-08-03 武汉大学 Remote sensing data rapid concurrent read-write method based on distributed file system
CN113407620A (en) * 2020-03-17 2021-09-17 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN114546962A (en) * 2022-02-17 2022-05-27 桂林理工大学 Hadoop-based distributed storage system for marine bureau ship inspection big data
CN114637733A (en) * 2022-02-25 2022-06-17 广东电网有限责任公司东莞供电局 A small file preprocessing system for deep learning
CN117519612A (en) * 2024-01-06 2024-02-06 深圳市杉岩数据技术有限公司 Mass small file storage system and method based on index online splicing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 A method for associative storage of massive non-independent small files based on Hadoop
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332027A (en) * 2011-10-15 2012-01-25 西安交通大学 A method for associative storage of massive non-independent small files based on Hadoop
CN103856567A (en) * 2014-03-26 2014-06-11 西安电子科技大学 Small file storage method based on Hadoop distributed file system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
樊超 等: "改善Hadoop文件处理效率的技术研究", 《微电子学与计算机》 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701233A (en) * 2016-02-18 2016-06-22 焦点科技股份有限公司 Method for optimizing server cache management
CN107092604A (en) * 2016-02-18 2017-08-25 中国移动通信集团河北有限公司 A kind of document handling method and device
CN107092604B (en) * 2016-02-18 2020-03-20 中国移动通信集团河北有限公司 File processing method and device
CN105701233B (en) * 2016-02-18 2018-12-14 南京焦点领动云计算技术有限公司 A method of optimization server buffer management
CN105843841A (en) * 2016-03-07 2016-08-10 青岛理工大学 Small file storage method and system
CN107402924A (en) * 2016-05-19 2017-11-28 普天信息技术有限公司 MR files apply the implementation method and device in HDFS
CN105956183A (en) * 2016-05-30 2016-09-21 广东电网有限责任公司电力调度控制中心 Method and system for multi-stage optimization storage of a lot of small files in distributed database
CN105956183B (en) * 2016-05-30 2019-04-30 广东电网有限责任公司电力调度控制中心 The multilevel optimization's storage method and system of mass small documents in a kind of distributed data base
CN107391280A (en) * 2017-07-31 2017-11-24 郑州云海信息技术有限公司 A kind of reception of small documents and storage method and device
CN109947703A (en) * 2017-11-09 2019-06-28 北京京东尚科信息技术有限公司 File system, file memory method, storage device and computer-readable medium
CN108287869A (en) * 2017-12-20 2018-07-17 江苏省公用信息有限公司 A kind of mass small documents solution based on speedy storage equipment
CN108366217A (en) * 2018-03-14 2018-08-03 成都创信特电子技术有限公司 Monitor video acquisition and storage method
CN108491488B (en) * 2018-03-14 2021-11-30 成都创信特电子技术有限公司 High-speed medium storing method
CN108366217B (en) * 2018-03-14 2021-04-06 成都创信特电子技术有限公司 Monitoring video acquisition and storage method
CN108491488A (en) * 2018-03-14 2018-09-04 成都创信特电子技术有限公司 Media high speed storing method
CN108174136A (en) * 2018-03-14 2018-06-15 成都创信特电子技术有限公司 Cloud disk video coding and storage method
CN108174136B (en) * 2018-03-14 2021-03-02 成都创信特电子技术有限公司 Cloud disk video coding storage method
CN108806773A (en) * 2018-05-21 2018-11-13 上海熙业信息科技有限公司 Medical image cloud storage platform designing method
CN108932287A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents wiring method based on Hadoop
CN108932288B (en) * 2018-05-22 2022-04-12 广东技术师范大学 Hadoop-based mass small file caching method
CN108932288A (en) * 2018-05-22 2018-12-04 广东技术师范学院 A kind of mass small documents caching method based on Hadoop
CN110196841A (en) * 2018-06-21 2019-09-03 腾讯科技(深圳)有限公司 The storage method and device of file, querying method and device and server
CN110196841B (en) * 2018-06-21 2023-12-05 腾讯科技(深圳)有限公司 File storage method and device, query method and device and server
CN108959660A (en) * 2018-08-15 2018-12-07 东北大学 A kind of storage method and application method based on HDFS distributed file system
CN108959660B (en) * 2018-08-15 2021-07-27 东北大学 A storage method and using method based on HDFS distributed file system
CN111258955A (en) * 2018-11-30 2020-06-09 北京白山耘科技有限公司 File reading method and system, storage medium and computer equipment
CN111258955B (en) * 2018-11-30 2023-09-19 北京白山耘科技有限公司 File reading method and system, storage medium and computer equipment
CN109947718A (en) * 2019-02-25 2019-06-28 全球能源互联网研究院有限公司 A data storage method, storage platform and storage device
CN109977074A (en) * 2019-03-11 2019-07-05 北京东方国信科技股份有限公司 A kind of lob data processing method and processing device based on HDFS
CN110046161A (en) * 2019-03-18 2019-07-23 平安普惠企业管理有限公司 Method for writing data and device, storage medium, electronic equipment
CN110059025A (en) * 2019-04-22 2019-07-26 北京电子工程总体研究所 A kind of method and system of cache prefetching
CN110515920A (en) * 2019-08-30 2019-11-29 北京浪潮数据技术有限公司 A kind of mass small documents access method and system based on Hadoop
CN110765086B (en) * 2019-10-25 2022-08-02 浪潮电子信息产业股份有限公司 Directory reading method and system for small files, electronic equipment and storage medium
CN110765086A (en) * 2019-10-25 2020-02-07 浪潮电子信息产业股份有限公司 Directory reading method and system for small files, electronic equipment and storage medium
CN113407620A (en) * 2020-03-17 2021-09-17 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN113407620B (en) * 2020-03-17 2023-04-21 北京信息科技大学 Data block placement method and system based on heterogeneous Hadoop cluster environment
CN112463755A (en) * 2020-12-11 2021-03-09 同济大学 Heterogeneous Internet of things big data storage and reading system and method based on HDFS
CN112597104A (en) * 2021-01-11 2021-04-02 武汉飞骥永泰科技有限公司 Small file performance optimization method and system
CN113204520A (en) * 2021-04-28 2021-08-03 武汉大学 Remote sensing data rapid concurrent read-write method based on distributed file system
CN113204520B (en) * 2021-04-28 2023-04-07 武汉大学 Remote sensing data rapid concurrent read-write method based on distributed file system
CN114546962A (en) * 2022-02-17 2022-05-27 桂林理工大学 Hadoop-based distributed storage system for marine bureau ship inspection big data
CN114637733A (en) * 2022-02-25 2022-06-17 广东电网有限责任公司东莞供电局 A small file preprocessing system for deep learning
CN117519612A (en) * 2024-01-06 2024-02-06 深圳市杉岩数据技术有限公司 Mass small file storage system and method based on index online splicing
CN117519612B (en) * 2024-01-06 2024-04-12 深圳市杉岩数据技术有限公司 Mass small file storage system and method based on index online splicing

Similar Documents

Publication Publication Date Title
CN105183839A (en) Hadoop-based storage optimizing method for small file hierachical indexing
CN106775446B (en) Distributed file system small file access method based on solid state disk acceleration
US8874850B1 (en) Hierarchically tagged cache
US9396290B2 (en) Hybrid data management system and method for managing large, varying datasets
CN103229173B (en) Metadata management method and system
CN102158546B (en) Cluster file system and file service method thereof
US11561930B2 (en) Independent evictions from datastore accelerator fleet nodes
CN106776967B (en) Method and device for storing massive small files in real time based on time sequence aggregation algorithm
CN103020255B (en) Classification storage means and device
CN104111804B (en) A kind of distributed file system
KR101672901B1 (en) Cache Management System for Enhancing the Accessibility of Small Files in Distributed File System
CN106021381A (en) Data access/storage method and device for cloud storage service system
US20130232215A1 (en) Virtualized data storage system architecture using prefetching agent
CN106909651A (en) A kind of method for being write based on HDFS small documents and being read
CN102662992A (en) Method and device for storing and accessing massive small files
CN103530387A (en) Improved method aimed at small files of HDFS
CN103034684A (en) Optimizing method for storing virtual machine mirror images based on CAS (content addressable storage)
CN103488685B (en) Fragmented-file storage method based on distributed storage system
CN103139300A (en) Virtual machine image management optimization method based on data de-duplication
US11080207B2 (en) Caching framework for big-data engines in the cloud
CN106484820A (en) A kind of renaming method, access method and device
CN103942301B (en) Distributed file system oriented to access and application of multiple data types
US10552371B1 (en) Data storage system with transparent presentation of file attributes during file system migration
US20170286442A1 (en) File system support for file-level ghosting
US11341163B1 (en) Multi-level replication filtering for a distributed database

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151223