CN116089364A - Storage file management method and device, AI platform and storage medium - Google Patents
Storage file management method and device, AI platform and storage medium Download PDFInfo
- Publication number
- CN116089364A CN116089364A CN202310377465.8A CN202310377465A CN116089364A CN 116089364 A CN116089364 A CN 116089364A CN 202310377465 A CN202310377465 A CN 202310377465A CN 116089364 A CN116089364 A CN 116089364A
- Authority
- CN
- China
- Prior art keywords
- directory
- subdirectory
- database
- queue
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000007726 management method Methods 0.000 title claims abstract description 95
- 238000012986 modification Methods 0.000 claims abstract description 83
- 230000004048 modification Effects 0.000 claims abstract description 83
- 238000000034 method Methods 0.000 claims description 33
- 230000008859 change Effects 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000012549 training Methods 0.000 description 8
- 238000012423 maintenance Methods 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 230000007774 longterm Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 239000002699 waste material Substances 0.000 description 2
- 239000002253 acid Substances 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/122—File system administration, e.g. details of archiving or snapshots using management policies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/1727—Details of free space management performed by the file system
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请涉及计算机领域,公开了一种存储文件管理方法、装置、AI平台和存储介质,包括:当从栈中取出的第一目录的最近修改时间发生变化,获取所述第一目录的子目录;所述第一目录保存于AI平台;判断所述子目录是否为空;若所述子目录不为空,则将子目录放入所述栈中;若所述子目录为空,则将所述子目录、满足预设条件的所述子目录的父目录放入队列中;所述预设条件为所述父目录的所有子目录全部放入所述队列中;从队列中取出第二目录直至所述栈和所述队列中目录为空,并根据所述第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。本申请可以缩短统计管理时间,提升存储操作效率,并降低AI平台资源的消耗。
This application relates to the field of computers, and discloses a storage file management method, device, AI platform, and storage medium, including: when the latest modification time of the first directory taken from the stack changes, obtain the subdirectory of the first directory ; The first directory is stored on the AI platform; judge whether the subdirectory is empty; if the subdirectory is not empty, put the subdirectory into the stack; if the subdirectory is empty, put the The subdirectory and the parent directory of the subdirectory that meet the preset condition are put into the queue; the preset condition is that all the subdirectories of the parent directory are all put into the queue; the second directory until the directory in the stack and the queue is empty, and perform corresponding operations on the database used to store the directory according to the last modification time of the second directory. This application can shorten the statistical management time, improve storage operation efficiency, and reduce the consumption of AI platform resources.
Description
技术领域technical field
本申请涉及计算机领域,特别是涉及一种存储文件管理方法、装置、AI平台和计算机可读存储介质。The present application relates to the field of computers, in particular to a storage file management method, device, AI platform and computer-readable storage medium.
背景技术Background technique
AI(Artificial Intelligence,人工智能)平台是一种可以管理和调度使用计算资源(例如GPU(Graphics Processing Unit,图形处理器)、CPU(Central ProcessingUnit,中央处理器)等)和存储资源的平台,并且可以大规模支撑AI训练和AI推理等业务场景。AI (Artificial Intelligence, artificial intelligence) platform is a platform that can manage and schedule the use of computing resources (such as GPU (Graphics Processing Unit, graphics processor), CPU (Central Processing Unit, central processing unit), etc.) and storage resources, and It can support business scenarios such as AI training and AI reasoning on a large scale.
AI平台中关于存储的一个突出特点是存储文件的数量是海量(TB级别以上)的,因此AI平台的一个重要基本功能是对海量的存储文件的统计管理。目前,AI平台中在统计管理海量的文件时有如下几种方法。第一种,直接遍历存储的所有文件,得出各个文件目录的大小;第二种,使用并发统计,分割存储目录方案;第三种,存储本身提供Quota配额功能,例如Nfs和Beegfs等文件存储系统。第一种管理方法在遍历过程会不断的消耗存储的IO(Input/Output,输入/输出)资源,同时也在不断消耗的业务服务的CPU和内存等资源,容易造成存储节点其他文件操作卡顿的问题,同时该方法得出统计结果也不理想,在海量文件下统计大小具有延时性,获取的统计结果有误差。第二种方法也会产生大量的资源消耗,容易造成存储节点其他文件操作卡顿的问题。第三种方法对存储资源的网络、磁盘IO、CPU和内存的消耗也非常大,增大对AI平台存储的压力。A prominent feature of storage in the AI platform is that the number of stored files is massive (above TB level), so an important basic function of the AI platform is the statistical management of massive stored files. At present, there are several methods for statistically managing massive files in the AI platform as follows. The first is to directly traverse all the stored files to obtain the size of each file directory; the second is to use concurrent statistics to split the storage directory scheme; the third is to provide Quota quota function for the storage itself, such as Nfs and Beegfs file storage system. The first management method will continuously consume storage IO (Input/Output, input/output) resources during the traversal process, and also consume resources such as CPU and memory of business services, which will easily cause other file operations on storage nodes to freeze At the same time, the statistical results obtained by this method are not ideal. The statistical size of a large number of files is delayed, and the statistical results obtained have errors. The second method also consumes a lot of resources, which may easily cause other file operations on the storage node to freeze. The third method also consumes a lot of network, disk IO, CPU, and memory of storage resources, which increases the pressure on the storage of AI platforms.
因此,如何解决上述技术问题应是本领域技术人员重点关注的。Therefore, how to solve the above technical problems should be the focus of those skilled in the art.
发明内容Contents of the invention
本申请的目的是提供一种存储文件管理方法、装置、AI平台和计算机可读存储介质,以降低资源消耗、缩短统计管理时间。The purpose of this application is to provide a storage file management method, device, AI platform and computer-readable storage medium, so as to reduce resource consumption and shorten statistical management time.
为解决上述技术问题,本申请提供一种存储文件管理方法,包括:In order to solve the above technical problems, this application provides a storage file management method, including:
当从栈中取出的第一目录的最近修改时间发生变化,获取所述第一目录的子目录;所述第一目录保存于AI平台;When the latest modification time of the first directory taken out from the stack changes, obtain the subdirectory of the first directory; the first directory is saved on the AI platform;
判断所述子目录是否为空;Determine whether the subdirectory is empty;
若所述子目录不为空,则将子目录放入所述栈中;If the subdirectory is not empty, then put the subdirectory into the stack;
若所述子目录为空,则将所述子目录、满足预设条件的所述子目录的父目录放入队列中;所述预设条件为所述父目录的所有子目录全部放入所述队列中;If the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory that meet the preset condition into the queue; the preset condition is that all the subdirectories of the parent directory are put into the queue. in the queue;
从队列中取出第二目录直至所述栈和所述队列中目录为空,并根据所述第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。Take out the second directory from the queue until the directories in the stack and the queue are empty, and perform corresponding operations on the database for storing the directory according to the latest modification time of the second directory.
可选的,根据所述第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作包括:Optionally, performing corresponding operations on the database for storing the directory according to the latest modification time of the second directory includes:
当所述第二目录的最近修改时间未发生变化,根据所述第二目录的大小变化情况,确定是否需要对所述数据库进行更新;When the latest modification time of the second directory does not change, determine whether the database needs to be updated according to the size change of the second directory;
当所述第二目录的最近修改时间发生变化,根据所述数据库中关于所述第二目录的保存情况,对所述数据库进行相应的操作。When the latest modification time of the second directory changes, corresponding operations are performed on the database according to the storage situation of the second directory in the database.
可选的,根据所述第二目录的大小变化情况,确定是否需要对所述数据库进行更新包括:Optionally, according to the size change of the second directory, determining whether to update the database includes:
判断所述第二目录的大小是否发生变化;judging whether the size of the second directory changes;
若所述第二目录的大小发生变化,则更新所述数据库中所述第二目录的存储路径信息;If the size of the second directory changes, updating the storage path information of the second directory in the database;
若所述第二目录的大小未发生变化,则对所述数据库不进行更新操作。If the size of the second directory does not change, no update operation is performed on the database.
可选的,根据所述数据库中关于所述第二目录的保存情况,对所述数据库进行相应的操作包括:Optionally, performing corresponding operations on the database according to the storage situation of the second directory in the database includes:
判断所述数据库中是否保存有所述第二目录;judging whether the second directory is saved in the database;
若所述数据库中保存有所述第二目录,则更新所述数据库中所述第二目录的存储路径信息;If the second directory is stored in the database, updating the storage path information of the second directory in the database;
若所述数据库中未保存有所述第二目录,则在所述数据库中插入所述第二目录。If the second directory is not saved in the database, insert the second directory into the database.
可选的,还包括:Optionally, also include:
对所述数据库中的表建立索引。Index the tables in the database.
可选的,还包括:Optionally, also include:
根据所述存储路径信息,对所述第二目录进行分表存储。According to the storage path information, the second directory is stored in sub-tables.
可选的,还包括:Optionally, also include:
判断所述第二目录的最近修改时间是否发生变化。It is judged whether the latest modification time of the second directory changes.
可选的,所述数据库为可嵌入AI平台中微服务、可迁移且无配置安装的数据库。Optionally, the database is a database that can be embedded in microservices in the AI platform, can be migrated, and can be installed without configuration.
可选的,从队列中取出第二目录包括:Optionally, removing the second directory from the queue includes:
将不互为父子目录的第二目录同时从所述队列中取出。The second directories that are not mutually parent-child directories are taken out from the queue at the same time.
可选的,还包括:Optionally, also include:
从所述数据库中删除所述数据库中存在且底层不存在的目录,所述目录包括父目录和父目录下的所有子目录。A directory that exists in the database and does not exist in the bottom layer is deleted from the database, and the directory includes a parent directory and all subdirectories under the parent directory.
可选的,若所述子目录不为空,还包括:Optionally, if the subdirectory is not empty, it also includes:
记录所述子目录的父目录的信息。Record the information of the parent directory of the subdirectory.
可选的,若所述子目录为空,将所述子目录、满足预设条件的所述子目录的父目录放入队列中包括:Optionally, if the subdirectory is empty, putting the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue includes:
若所述子目录为空,则将所述子目录放入队列中;If the subdirectory is empty, then put the subdirectory into the queue;
判断所述子目录的父目录的所有子目录是否全部放入队列中;Judging whether all subdirectories of the parent directory of the subdirectory are all put into the queue;
若所述子目录的父目录的所有子目录全部放入队列中,则将父目录放入队列中,直至不满足所述预设条件的父目录为止。If all the subdirectories of the parent directory of the subdirectory are put into the queue, put the parent directory into the queue until the parent directory that does not meet the preset condition.
可选的,当从栈中取出的第一目录的最近修改时间发生变化,获取所述第一目录的子目录之前,包括:Optionally, when the latest modification time of the first directory taken from the stack changes, before acquiring the subdirectories of the first directory, the method includes:
判断从栈中取出的第一目录的最近修改时间是否发生变化。It is judged whether the latest modification time of the first directory taken out from the stack changes.
可选的,判断从栈中取出的第一目录的最近修改时间是否发生变化之前,还包括:Optionally, before judging whether the latest modification time of the first directory taken out of the stack changes, the method also includes:
将所述第一目录放入栈中;put said first directory on a stack;
从栈中取出所述第一目录。The first directory is removed from the stack.
可选的,获取所述第一目录的子目录包括Optionally, obtaining the subdirectories of the first directory includes
采用可清理方式遍历所述第一目录,以得到所述第一目录的子目录。The first directory is traversed in a cleanable manner to obtain subdirectories of the first directory.
可选的,采用可清理方式遍历所述第一目录包括:Optionally, traversing the first directory in a cleanable manner includes:
利用打开函数打开所述第一目录,并利用读函数读取所述第一目录。The first directory is opened with an open function, and the first directory is read with a read function.
可选的,还包括:Optionally, also include:
当从栈中取出的第一目录的最近修改时间未发生变化时,从所述数据库中获取所述第一目录的子目录。When the latest modification time of the first directory taken from the stack has not changed, acquire the subdirectories of the first directory from the database.
本申请还提供一种存储文件管理装置,包括:The present application also provides a storage file management device, including:
第一获取模块,用于当从栈中取出的第一目录的最近修改时间发生变化,获取所述第一目录的子目录;所述第一目录保存于AI平台;The first acquisition module is used to obtain the subdirectories of the first directory when the latest modification time of the first directory taken out from the stack changes; the first directory is stored on the AI platform;
第一判断模块,用于判断所述子目录是否为空;The first judging module is used to judge whether the subdirectory is empty;
第一存放模块,用于若所述子目录不为空,则将子目录放入所述栈中;The first storage module is used to put the subdirectory into the stack if the subdirectory is not empty;
第二存放模块,用于若所述子目录为空,则将所述子目录、满足预设条件的所述子目录的父目录放入队列中;所述预设条件为所述父目录的所有子目录全部放入所述队列中;The second storage module is used to put the subdirectory and the parent directory of the subdirectory that meet the preset condition into the queue if the subdirectory is empty; the preset condition is the parent directory of the parent directory All subdirectories are put into the queue;
去除及处理模块,用于从队列中取出第二目录直至所述栈和所述队列中目录为空,并根据所述第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。The removal and processing module is used to take out the second directory from the queue until the directory in the stack and the queue is empty, and according to the change of the latest modification time of the second directory, perform corresponding operations on the database used to store the directory operation.
本申请还提供一种AI平台,包括:This application also provides an AI platform, including:
存储器,用于存储计算机程序;memory for storing computer programs;
处理器,用于执行所述计算机程序时实现上述任一种所述存储文件管理方法的步骤。A processor configured to implement the steps of any one of the storage file management methods described above when executing the computer program.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一种所述存储文件管理方法的步骤。The present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of any one of the storage file management methods described above are implemented.
本申请所提供的一种存储文件管理方法,包括:当从栈中取出的第一目录的最近修改时间发生变化,获取所述第一目录的子目录;所述第一目录保存于AI平台;判断所述子目录是否为空;若所述子目录不为空,则将子目录放入所述栈中;若所述子目录为空,则将所述子目录、满足预设条件的所述子目录的父目录放入队列中;所述预设条件为所述父目录的所有子目录全部放入所述队列中;从队列中取出第二目录直至所述栈和所述队列中目录为空,并根据所述第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。A storage file management method provided by the present application includes: when the latest modification time of the first directory taken out from the stack changes, obtaining a subdirectory of the first directory; the first directory is saved on the AI platform; judging whether the subdirectory is empty; if the subdirectory is not empty, then put the subdirectory into the stack; if the subdirectory is empty, then put the subdirectory and all the The parent directory of the subdirectory is put into the queue; the preset condition is that all the subdirectories of the parent directory are all put into the queue; the second directory is taken out from the queue until the stack and the directory in the queue is empty, and according to the change of the latest modification time of the second directory, corresponding operations are performed on the database used to store the directory.
可见,本申请第一目录存放在栈中,通过从栈中取出第一目录且在第一目录的最近修改时间发生变化时,获取第一目录的子目录,并根据子目录是否为空确定子目录是放入队列还是放栈入中。当队列中有目录时,从队列中取出目录直至栈和队列中全部为空,同时根据从队列中取出的目录的最近修改时间是否发生变化对数据库进行相应的操作。栈是实现数据先进后出的数据结构方式,队列是实现数据先进先出的数据结构方式,所以本申请通过重复“出栈-入栈-出栈并入队列-出队列”,直至栈和队列中全部为空,实现目录“自下而上”存储统计管理。在统计管理过程中,由于不断的在栈和和队列中删除(取出),不会造成系统的内存溢出,并且统计发生变化的目录,避免全量统计造成时间和平台资源的浪费。It can be seen that the first directory of this application is stored in the stack. By taking out the first directory from the stack and when the latest modification time of the first directory changes, the subdirectory of the first directory is obtained, and the subdirectory is determined according to whether the subdirectory is empty. Whether the directory is queued or stacked. When there is a directory in the queue, remove the directory from the queue until the stack and the queue are all empty, and at the same time perform corresponding operations on the database according to whether the latest modification time of the directory removed from the queue has changed. The stack is a data structure method that realizes data first-in-first-out, and the queue is a data structure method that realizes data first-in-first-out, so this application repeats "out of the stack - into the stack - out of the stack and into the queue - out of the queue" until the stack and the queue All of them are empty to realize the "bottom-up" storage statistics management of the directory. In the process of statistics management, due to the continuous deletion (removal) in the stack and queue, the memory of the system will not overflow, and the statistics of the changed directory will avoid the waste of time and platform resources caused by full statistics.
所以,一方面本申请占用AI平台资源非常少,降低AI平台存储CPU、内存、IO等资源消耗,降低AI平台存储和负载压力,防止造成存储节点其他文件出现操作卡顿,提高人工智能模型训练效率,并降低业务模块的资源消耗与和长期占用,增强AI平台存储的性能。另一方面,本申请缩短统计管理时间,提升AI平台中各个存储操作效率,缩短模型训练时间,减少运维人员的运维成本,提高了AI平台市场竞争力。Therefore, on the one hand, this application occupies very few AI platform resources, reduces the consumption of AI platform storage CPU, memory, IO and other resources, reduces the storage and load pressure of the AI platform, prevents other files on the storage node from being stuck, and improves the artificial intelligence model training. Improve efficiency, reduce resource consumption and long-term occupation of business modules, and enhance the performance of AI platform storage. On the other hand, this application shortens the statistical management time, improves the efficiency of each storage operation in the AI platform, shortens the model training time, reduces the operation and maintenance costs of the operation and maintenance personnel, and improves the market competitiveness of the AI platform.
此外,本申请还提供一种具有上述优点的装置、AI平台和计算机可读存储介质。In addition, the present application also provides a device, an AI platform, and a computer-readable storage medium having the above-mentioned advantages.
附图说明Description of drawings
为了更清楚的说明本申请实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application or the prior art, the accompanying drawings that need to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are only For some embodiments of the present application, those of ordinary skill in the art can also obtain other drawings based on these drawings without creative effort.
图1为本申请实施例所提供的一种存储文件管理方法的流程图一;FIG. 1 is a flow chart 1 of a storage file management method provided by an embodiment of the present application;
图2为本申请实施例所提供的栈数据结构示意图;Fig. 2 is a schematic diagram of the stack data structure provided by the embodiment of the present application;
图3为本申请实施例所提供的队列数据结构示意图;FIG. 3 is a schematic diagram of the queue data structure provided by the embodiment of the present application;
图4为本申请实施例所提供的一种存储文件管理方法的流程图二;FIG. 4 is a second flowchart of a storage file management method provided by an embodiment of the present application;
图5为本申请实施例所提供的一种存储文件管理方法的流程图三;FIG. 5 is a flowchart three of a storage file management method provided by an embodiment of the present application;
图6为本申请实施例所提供的一种存储文件管理方法的流程图四;FIG. 6 is a flowchart 4 of a storage file management method provided by an embodiment of the present application;
图7为本申请实施例所提供的一种存储文件管理方法的流程图五;FIG. 7 is a flowchart five of a storage file management method provided by the embodiment of the present application;
图8为本申请实施例所提供的一种存储文件管理方法的流程图六;FIG. 8 is a flowchart six of a storage file management method provided by an embodiment of the present application;
图9为本申请实施例所提供的一种存储文件管理方法的流程图七;FIG. 9 is the seventh flowchart of a storage file management method provided by the embodiment of the present application;
图10为本申请实施例提供的存储文件管理装置的结构框图;FIG. 10 is a structural block diagram of a storage file management device provided by an embodiment of the present application;
图11为本申请实施例提供的AI平台的结构框图;Fig. 11 is a structural block diagram of the AI platform provided by the embodiment of the present application;
图12为本申请实施例提供的AI平台上存储统计管理系统的框架图。Fig. 12 is a frame diagram of the storage statistics management system on the AI platform provided by the embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面结合附图和具体实施方式对本申请作进一步的详细说明。显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to enable those skilled in the art to better understand the solution of the present application, the present application will be further described in detail below in conjunction with the drawings and specific implementation methods. Apparently, the described embodiments are only some of the embodiments of this application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.
正如背景技术部分,由于AI平台上存储有海量的文件,在对海量的文件进行统计管理时,相关技术存在以下缺点,第一,消耗AI平台的各种资源,增大对AI平台存储的压力,同时容易造成存储节点其他文件操作卡顿的问题;第二,统计管理耗时长。As mentioned in the background technology section, due to the massive files stored on the AI platform, the related technologies have the following disadvantages when performing statistical management on massive files. First, it consumes various resources of the AI platform and increases the pressure on the storage of the AI platform. , and at the same time, it is easy to cause other file operations on the storage node to be stuck; second, the statistics management takes a long time.
有鉴于此,本申请提供一种存储文件管理方法,请参考图1,该方法包括:In view of this, this application provides a storage file management method, please refer to Figure 1, the method includes:
步骤S101:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S101: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
本申请中AI平台可以为AI集群平台。In this application, the AI platform may be an AI cluster platform.
栈(stack)是一种只能从一端存取数据且遵循“先进后出”原则的线性存储结构。A stack is a linear storage structure that can only access data from one end and follows the "first in, last out" principle.
基于栈结构的特点,在实际应用中,通常只会对栈执行以下两种操作:第一种,向栈中添加数据,此过程被称为“进栈”(入栈或压栈);第二种,从栈中提取出数据,此过程被称为“出栈”(弹栈)。最先入栈的数据所在的位置称为“栈底”,最后入栈的数据所在的位置称为“栈顶”。Based on the characteristics of the stack structure, in practical applications, only the following two operations are usually performed on the stack: the first one is to add data to the stack, and this process is called "pushing" (pushing or pushing); The second is to extract data from the stack. This process is called "popping the stack" (popping the stack). The position where the data is first pushed into the stack is called the "bottom of the stack", and the position where the data is last pushed into the stack is called the "top of the stack".
例如,如图2所示,当有a1目录、a2目录、…、an目录n个目录,将a1目录、a2目录、…、an目录按照顺序依次放入栈中。当从栈中取出目录时,n个目录按照与放入顺序相反的顺序取出,即会最先取出an目录,然后取出a(n-1)目录,…,取出a2目录,最后取出a1目录。For example, as shown in Figure 2, when there are n directories a1, a2, ..., an, put a1, a2, ..., an into the stack in sequence. When taking out directories from the stack, n directories are taken out in the reverse order of the putting order, that is, the an directory is taken out first, then the a(n-1) directory, ..., the a2 directory is taken out, and the a1 directory is finally taken out.
第一目录存在栈中,第一目录从栈中取出即第一目录出栈。The first directory is stored in the stack, and the first directory is taken out from the stack, that is, the first directory is popped out of the stack.
本申请基于Linux操作系统文件的特性。文件系统只有文件具有大小,目录没有实际的占用大小信息,AI平台存储进行目录统计,可以降低对存储CPU、内存、IO等资源消耗,从而降低AI平台存储和负载压力,增强AI平台存储的性能。This application is based on the characteristics of Linux operating system files. The file system only has the size of the file, and the directory does not have the actual size information. The AI platform storage performs directory statistics, which can reduce the consumption of storage CPU, memory, IO and other resources, thereby reducing the storage and load pressure of the AI platform, and enhancing the performance of the AI platform storage. .
可以引起第一目录的最近修改时间发生变化的操作包括但不限于增加软链接、增加硬链接、压缩与解压缩、重命名、隐藏文件的增加、删除、修改等。Operations that may cause the latest modification time of the first directory to change include but are not limited to adding soft links, adding hard links, compressing and decompressing, renaming, adding, deleting, and modifying hidden files.
对文件进行增删改操作都会引起其父目录的最近修改时间会发生变化,但只限于父目录,其父目录的父目录的最近修改时间并不会发生变化,也即,存储的目录的发生变化时,只会引起上层存储的目录的变化,不会造成其它目录发生变化。故在统计存储的目录大小时,需要根据目录最近修改时间的变化,进行目录大小的统计,且从最底层目录开始往最外层目录进行统计。Adding, deleting, and modifying files will cause the latest modification time of its parent directory to change, but only for the parent directory, and the last modification time of the parent directory of the parent directory will not change, that is, the stored directory changes When , it will only cause changes to the directory of the upper storage, and will not cause changes to other directories. Therefore, when counting the size of the stored directory, it is necessary to perform statistics on the size of the directory according to the change of the latest modification time of the directory, and start from the bottommost directory to the outermost directory.
需要说明的是,本申请中对第一目录的类型不做限定,视情况而定。例如,第一目录包括但不限于用户家目录、公共目录、数据集目录、模型目录等。It should be noted that, in this application, the type of the first directory is not limited, and it depends on the situation. For example, the first directory includes, but is not limited to, a user's home directory, a public directory, a dataset directory, a model directory, and the like.
由于第一目录可以为用户家目录、公共目录、数据集目录、模型目录等,所以,本申请中的存储文件管理方法可以应用不同AI业务场景,统计公共目录和数据集目录的大小。Since the first directory can be the user's home directory, public directory, dataset directory, model directory, etc., the storage file management method in this application can be applied to different AI business scenarios to count the size of the public directory and the dataset directory.
第一目录的子目录也即第一目录的下一层目录。第一目录下的子目录的数量本申请中不做具体限定,视情况而定。例如,第一目录的子目录的数量可以为一个,第一目录的子目录的数量也可以在两个以上。The subdirectory of the first directory is also the directory of the next level of the first directory. The number of subdirectories under the first directory is not specifically limited in this application, and depends on circumstances. For example, the number of subdirectories of the first directory may be one, and the number of subdirectories of the first directory may be more than two.
步骤S102:判断子目录是否为空。Step S102: Determine whether the subdirectory is empty.
需要说明的是,判断子目录是否为空,即判断子目录是否有子目录,也即判断子目录的下一层是否还有目录。It should be noted that judging whether the subdirectory is empty means judging whether the subdirectory has a subdirectory, that is, judging whether there is a directory in the lower layer of the subdirectory.
步骤S103:若子目录不为空,则将子目录放入栈中。Step S103: If the subdirectory is not empty, put the subdirectory on the stack.
若第一目录的子目录不为空,即子目录下还有下一层目录,此时将子目录放入到栈中。If the subdirectory of the first directory is not empty, that is, there is a next-level directory under the subdirectory, then put the subdirectory into the stack.
在本申请的一个实施例中,若子目录不为空,还包括:In one embodiment of the application, if the subdirectory is not empty, it also includes:
记录子目录的父目录的信息,以保证不重复查询数据库。Record the information of the parent directory of the subdirectory to ensure that the database is not repeatedly queried.
步骤S104:若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;Step S104: if the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue;
预设条件为父目录的所有子目录全部放入队列中。The default condition is that all subdirectories of the parent directory are put into the queue.
若第一目录的子目录为空,即子目录下没有下一层目录了,即此时从栈中取出的第一目录已经是底层目录,然后将子目录放入队列中,然后将符合预设条件的所有父目录放入队列中。If the subdirectory of the first directory is empty, that is, there is no next directory under the subdirectory, that is, the first directory taken out from the stack at this time is already the bottom directory, and then the subdirectory is put into the queue, and then it will meet the preset All parent directories of the set condition are put into the queue.
例如,当A目录的下一层只有一个B目录,且B目录没有下一层目录时,当将B目录放入队列后,将A目录也放入队列。当A目录的下一层有B1目录和B2目录,且B1目录和B2目录没有下一层目录时,当B1目录和B2目录全部放入队列后,再将A目录也放入队列。For example, when there is only one B directory under the A directory, and the B directory has no next layer directory, after the B directory is put into the queue, the A directory is also put into the queue. When there are B1 directory and B2 directory in the lower layer of A directory, and there is no lower layer directory in B1 directory and B2 directory, after B1 directory and B2 directory are all put into the queue, then A directory is also put into the queue.
队列(Queue),和栈一样,也是一种对数据的“存”和“取”有严格要求的线性存储结构。与栈结构不同的是,队列的两端都“开口”,要求数据只能从一端进,从另一端出。通常,称进数据的一端为“队列尾”,出数据的一端为“队列头”,数据元素进队列的过程称为“入队”,出队列的过程称为“出队”。A queue (Queue), like a stack, is also a linear storage structure that has strict requirements on the "storage" and "fetch" of data. Different from the stack structure, both ends of the queue are "open", requiring data to only enter from one end and exit from the other end. Usually, the end of the data entry is called the "queue tail", and the end of the data output is called the "queue head". The process of data elements entering the queue is called "queue entry", and the process of leaving the queue is called "queue exit".
队列中数据的进出要遵循“先进先出”的原则,即最先进队列的数据,同样要最先出队列。The entry and exit of data in the queue must follow the principle of "first in, first out", that is, the data in the most advanced queue must also be out of the queue first.
例如,如图3所示,当有a1目录、a2目录、…、an目录n个目录,将a1目录、a2目录、…、an目录按照顺序依次放入队列中,即先将a1目录放入队列中,然后将a2目录放入队列中,...,最后将an目录放入队列中。当从队列中取出目录时,n个目录按照与放入顺序相同的顺序取出,即会最先取出a1目录,然后取出a2目录,...,最后取出an目录。For example, as shown in Figure 3, when there are n directories such as a1 directory, a2 directory, ..., an directory, the a1 directory, a2 directory, ..., an directory are put into the queue in order, that is, the a1 directory is put into the queue first queue, then put the a2 directory into the queue, ..., and finally put the an directory into the queue. When taking out directories from the queue, n directories are taken out in the same order as they were put in, that is, the a1 directory will be taken out first, then the a2 directory, ..., and finally the an directory will be taken out.
步骤S105:从队列中取出第二目录直至栈和队列中目录为空,并根据第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。Step S105: Take out the second directory from the queue until the directory in the stack and queue is empty, and perform corresponding operations on the database for storing the directory according to the latest modification time of the second directory.
不断的重复“出栈-入栈-出栈并入队列-出队列”这一过程,直至出现栈和队列全部为空。然后根据从队列中取出的第二目录的最近修改时间是否发生变化进行后续处理,具体处理下下述实施例中进行阐述,请参考下文。Repeat the process of "popping - pushing - popping and merging into the queue - dequeue" until the stack and the queue are all empty. Then, follow-up processing is performed according to whether the latest modification time of the second directory taken out of the queue has changed. The specific processing will be described in the following embodiments, please refer to the following.
数据库中只保存目录,并不进行文件的存储。将目录信息存储在数据库中,方便快速在海量的存储目录中进行文件目录的查找。Only directories are saved in the database, and files are not stored. The directory information is stored in the database, which is convenient for quickly searching the file directory in the massive storage directory.
作为一种可实施方式,数据库为可嵌入AI平台中微服务、可迁移且无配置安装的数据库。AI平台中可以有一个或者多个微服务,数据库可以根据需要嵌入一个或者多个微服务中。As a possible implementation, the database is a database that can be embedded in microservices in the AI platform, can be migrated, and has no configuration and installation. There can be one or more microservices in the AI platform, and the database can be embedded in one or more microservices as needed.
优选地,数据库为SQLite3数据库。Preferably, the database is an SQLite3 database.
SQLite3数据库是一种实现“存储管理”的轻量级数据库,是遵守ACID的关系型数据库管理系统,实现了自给自足的、无服务器的、零配置的、事务性的SQL数据库引擎。并且,SQLite3数据库保证不影响AI平台业务模块对数据库的使用,无需在AI平台业务模块中安装,具有伴随存储进行方便迁移的优势,可以伴随AI平台升级,补丁,并且,存储的数据不会丢失。SQLite3 database is a lightweight database that implements "storage management". It is a relational database management system that complies with ACID, and realizes a self-sufficient, serverless, zero-configuration, and transactional SQL database engine. In addition, the SQLite3 database is guaranteed not to affect the use of the database by the AI platform business module, and does not need to be installed in the AI platform business module. It has the advantage of convenient migration along with storage, and can be accompanied by AI platform upgrades and patches, and the stored data will not be lost. .
当使用多数据源操作SQLite3数据库时,不影响本身其他数据库的使用,例如Mysql等。SQLite3数据库无需安装和配置,千万级别的数据在索引条件下在毫秒(ms)内,SQLite3会在操作系统上生成一个二进制文件(数据库名.db),若AI平台机器宕机,则只需要将该文件复制到其他机器上可以继续使用,该数据库最大可以支持128TiB。非常适用于解决AI场景下存储目录数据文件的存储,可以快速得到存储目录的大小。When using multiple data sources to operate the SQLite3 database, it does not affect the use of other databases, such as Mysql. The SQLite3 database does not need to be installed and configured. Tens of millions of data are indexed within milliseconds (ms). SQLite3 will generate a binary file (database name.db) on the operating system. If the AI platform machine is down, you only need to You can continue to use the file by copying it to other machines, and the database can support a maximum of 128TiB. It is very suitable for solving the storage of storage directory data files in AI scenarios, and can quickly get the size of the storage directory.
可选的,在本申请的一个实施例中,从队列中取出第二目录之前,还包括:Optionally, in one embodiment of the present application, before taking out the second directory from the queue, further include:
判断队列中是否存在第二目录;Determine whether there is a second directory in the queue;
若队列中存在第二目录,则执行从队列中取出第二目录的步骤;If there is a second directory in the queue, then perform the step of taking out the second directory from the queue;
若队列中不存在第二目录,则等待栈结束。If the second directory does not exist in the queue, wait for the end of the stack.
等待栈结束后,再出队列,即从队列中取出第二目录,以实现目录“自下向上”的统计。After waiting for the end of the stack, exit the queue, that is, take out the second directory from the queue, so as to realize the "bottom-up" statistics of the directory.
本实施例中第一目录存放在栈中,通过从栈中取出第一目录且在第一目录的最近修改时间发生变化时,获取第一目录的子目录,并根据子目录是否为空确定子目录是放入队列还是放栈入中。当队列中有目录时,从队列中取出目录直至栈和队列中全部为空,同时根据从队列中取出的目录的最近修改时间是否发生变化对数据库进行相应的操作。栈是实现数据先进后出的数据结构方式,队列是实现数据先进先出的数据结构方式,所以本申请通过重复“出栈-入栈-出栈并入队列-出队列”,直至栈和队列中全部为空,实现目录“自下而上”存储统计管理。在统计管理过程中,由于不断的在栈和和队列中删除(取出),不会造成系统的内存溢出,并且统计发生变化的目录,避免全量统计造成时间和平台资源的浪费。所以,一方面本申请占用AI平台资源非常少,降低AI平台存储CPU、内存、IO等资源消耗,降低AI平台存储和负载压力,防止造成存储节点其他文件出现操作卡顿,提高人工智能模型训练效率,并降低业务模块的资源消耗与和长期占用,增强AI平台存储的性能。另一方面,本申请缩短统计管理时间,提升AI平台中各个存储操作效率,缩短模型训练时间,减少运维人员的运维成本,提高了AI平台市场竞争力。In this embodiment, the first directory is stored in the stack. By taking out the first directory from the stack and when the latest modification time of the first directory changes, the subdirectory of the first directory is obtained, and the subdirectory is determined according to whether the subdirectory is empty. Whether the directory is queued or stacked. When there is a directory in the queue, remove the directory from the queue until the stack and the queue are all empty, and at the same time perform corresponding operations on the database according to whether the latest modification time of the directory removed from the queue has changed. The stack is a data structure method that realizes data first-in-first-out, and the queue is a data structure method that realizes data first-in-first-out, so this application repeats "out of the stack - into the stack - out of the stack and into the queue - out of the queue" until the stack and the queue All of them are empty to realize the "bottom-up" storage statistics management of the directory. In the process of statistics management, due to the continuous deletion (removal) in the stack and queue, the memory of the system will not overflow, and the statistics of the changed directory will avoid the waste of time and platform resources caused by full statistics. Therefore, on the one hand, this application occupies very few AI platform resources, reduces the consumption of AI platform storage CPU, memory, IO and other resources, reduces the storage and load pressure of the AI platform, prevents other files on the storage node from being stuck, and improves the artificial intelligence model training. Improve efficiency, reduce resource consumption and long-term occupation of business modules, and enhance the performance of AI platform storage. On the other hand, this application shortens the statistical management time, improves the efficiency of each storage operation in the AI platform, shortens the model training time, reduces the operation and maintenance costs of the operation and maintenance personnel, and improves the market competitiveness of the AI platform.
在上述实施例的基础上,在本申请的一个实施例中,请参考图4,存储文件管理方法包括:On the basis of the foregoing embodiments, in one embodiment of the present application, please refer to FIG. 4 , the storage file management method includes:
步骤S201:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S201: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
步骤S202:判断子目录是否为空。Step S202: Determine whether the subdirectory is empty.
步骤S203:若子目录不为空,则将子目录放入栈中。Step S203: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S204:若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;预设条件为父目录的所有子目录全部放入队列中。Step S204: If the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue; the preset condition is that all the subdirectories of the parent directory are put into the queue.
步骤S205:从队列中取出第二目录直至栈和队列中目录为空。Step S205: Take out the second directory from the queue until the directories in the stack and the queue are empty.
可选的,在本申请的一个实施例中,从队列中取出第二目录直至栈和队列中目录为空之后,还包括:Optionally, in one embodiment of the present application, after taking out the second directory from the queue until the directory in the stack and the queue is empty, further includes:
判断第二目录的最近修改时间是否发生变化。It is judged whether the latest modification time of the second directory has changed.
步骤S206:当第二目录的最近修改时间未发生变化,根据第二目录的大小变化情况,确定是否需要对数据库进行更新。Step S206: When the latest modification time of the second directory has not changed, determine whether the database needs to be updated according to the size change of the second directory.
不会引起第二目录的最近修改时间发生变化的操作包括:文件权限修改等。Operations that do not cause the latest modification time of the second directory to change include: modification of file permissions, and the like.
作为一种可实施方式,根据第二目录的大小变化情况,确定是否需要对数据库进行更新包括:As an implementable manner, according to the size change of the second directory, determining whether to update the database includes:
步骤S2061:判断第二目录的大小是否发生变化。Step S2061: Determine whether the size of the second directory changes.
步骤S2062:若第二目录的大小发生变化,则更新数据库中第二目录的存储路径信息。Step S2062: If the size of the second directory changes, update the storage path information of the second directory in the database.
存储路径信息即第二目录的目录路径。The storage path information is the directory path of the second directory.
数据库中存储的目录信息包括目录的存储路径信息、目录的最近修改时间,目录大小,目录拥有者信息等。The directory information stored in the database includes directory storage path information, the latest modification time of the directory, directory size, directory owner information, and the like.
步骤S2063:若第二目录的大小未发生变化,则对数据库不进行更新操作。Step S2063: If the size of the second directory does not change, no update operation is performed on the database.
当第二目录的大小未发生变化时,无需对数据库做任何处理。When the size of the second directory does not change, no processing needs to be performed on the database.
步骤S207:当第二目录的最近修改时间发生变化,根据数据库中关于第二目录的保存情况,对数据库进行相应的操作。Step S207: When the latest modification time of the second directory changes, perform corresponding operations on the database according to the storage situation of the second directory in the database.
可以引起第二目录的最近修改时间发生变化的操作包括但不限于增加软链接、增加硬链接、压缩与解压缩、重命名、隐藏文件的增加、删除、修改等。Operations that can cause the last modification time of the second directory to change include but are not limited to adding soft links, adding hard links, compressing and decompressing, renaming, adding, deleting, and modifying hidden files.
作为一种可实施方式,根据数据库中关于第二目录的保存情况,对数据库进行相应的操作包括:As an implementable manner, according to the storage situation of the second directory in the database, performing corresponding operations on the database includes:
步骤S2071:判断数据库中是否保存有第二目录。Step S2071: Determine whether the second directory is stored in the database.
步骤S2072:若数据库中保存有第二目录,则更新数据库中第二目录的存储路径信息。Step S2072: If the second directory is stored in the database, update the storage path information of the second directory in the database.
第二目录的最近修改时间发生变化,且数据库中也保存有第二目录,即需要对数据库中第二目录的存储路径信息进行更新。The latest modification time of the second directory changes, and the second directory is also stored in the database, that is, the storage path information of the second directory in the database needs to be updated.
步骤S2073:若数据库中未保存有第二目录,则在数据库中插入第二目录。Step S2073: If the second directory is not stored in the database, insert the second directory into the database.
第二目录的最近修改时间发生变化,且数据库中没有保存第二目录时,这是第一次统计遍历的情况,所以需要在数据库中插入第二目录。When the latest modification time of the second directory changes and the second directory is not saved in the database, this is the case of the first statistical traversal, so the second directory needs to be inserted into the database.
对数据库进行更新,目的是保证目录查询只有一次。The purpose of updating the database is to ensure that there is only one catalog query.
需要指出的是,步骤S201至步骤S205请参考上述实施例的内容,此处不再详细赘述。It should be pointed out that, for steps S201 to S205, please refer to the content of the above-mentioned embodiments, which will not be described in detail here.
为了提高检索效率,快速得到目录大小信息,以及降低统计时对AI平台存储产生的CPU和IO等资源的消耗,在上任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法还包括:In order to improve retrieval efficiency, quickly obtain directory size information, and reduce the consumption of resources such as CPU and IO generated by AI platform storage during statistics, on the basis of any of the above embodiments, in one embodiment of the present application, storage file management Methods also include:
对数据库中的表建立索引。Create an index on a table in the database.
进一步的,在本申请的一个实施例中,存储文件管理方法还包括:Further, in one embodiment of the present application, the stored file management method further includes:
根据存储路径信息,对第二目录进行分表存储。According to the storage path information, the second directory is stored in sub-tables.
在数据库中建立两个以上的表,根据存储路径信息将第二目录分表存储,避免使用AI平台的数据库造成一定程度上数据库的压力,使得AI平台更适用于大规模集群文件查找,同时可以实现快速查找目录信息,提高检索效率。Create more than two tables in the database, store the second directory in separate tables according to the storage path information, avoid using the database of the AI platform to cause a certain degree of pressure on the database, making the AI platform more suitable for large-scale cluster file search, and at the same time can Realize fast search for directory information and improve retrieval efficiency.
当数据库中建立有表时,对数据库进行更新即对数据库中的表进行更新。When a table is established in the database, updating the database means updating the table in the database.
请参考图5,在上述任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法包括:Please refer to FIG. 5, on the basis of any of the above embodiments, in one embodiment of the present application, the storage file management method includes:
步骤S301:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S301: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
步骤S302:判断子目录是否为空。Step S302: Determine whether the subdirectory is empty.
步骤S303:若子目录不为空,则将子目录放入栈中。Step S303: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S304:若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;预设条件为父目录的所有子目录全部放入队列中。Step S304: If the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue; the preset condition is that all the subdirectories of the parent directory are put into the queue.
步骤S305:将不互为父子目录的第二目录同时从队列中取出,直至栈和队列中目录为空。Step S305: Take out the second directories that are not mutually parent-child directories from the queue at the same time, until the directories in the stack and the queue are empty.
步骤S306:判断第二目录的最近修改时间是否发生变化。Step S306: Determine whether the latest modification time of the second directory has changed.
步骤S307:当第二目录的最近修改时间未发生变化,根据第二目录的大小变化情况,确定是否需要对数据库进行更新。Step S307: When the latest modification time of the second directory has not changed, determine whether the database needs to be updated according to the size change of the second directory.
步骤S308:当第二目录的最近修改时间发生变化,根据数据库中关于第二目录的保存情况,对数据库进行相应的操作。Step S308: When the latest modification time of the second directory changes, perform corresponding operations on the database according to the storage situation of the second directory in the database.
需要指出的是,步骤S301至步骤S304以及步骤S306至步骤S308请参考上述实施例的内容,此处不再详细赘述。It should be pointed out that, for steps S301 to S304 and steps S306 to S308, please refer to the contents of the above embodiments, and details are not repeated here.
本实施例中在从队列中取出第二目录时,将不互为父子目录的第二目录同时从队列中取出,即将不互为父子目录的目录数据进行异步统计,以加快“向上”过程统计,缩短统计管理时间,提升统计效率。In this embodiment, when the second directory is taken out from the queue, the second directories that are not mutually parent-child directories are taken out from the queue at the same time, that is, the directory data that are not mutually parent-child directories are counted asynchronously to speed up the "upward" process statistics , shorten statistical management time, and improve statistical efficiency.
请参考图6,在上述任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法包括:Please refer to FIG. 6, on the basis of any of the above embodiments, in one embodiment of the present application, the storage file management method includes:
步骤S401:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S401: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
步骤S402:判断子目录是否为空。Step S402: Determine whether the subdirectory is empty.
步骤S403:若子目录不为空,则将子目录放入栈中。Step S403: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S404:若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;预设条件为父目录的所有子目录全部放入队列中。Step S404: If the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue; the preset condition is that all the subdirectories of the parent directory are put into the queue.
步骤S405:从队列中取出第二目录直至栈和队列中目录为空,并根据第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。Step S405: Take out the second directory from the queue until the directory in the stack and the queue is empty, and perform corresponding operations on the database for storing the directory according to the latest modification time of the second directory.
步骤S406:从数据库中删除数据库中存在且底层不存在的目录,目录包括父目录和父目录下的所有子目录。Step S406: delete from the database the directories that exist in the database and do not exist in the bottom layer, the directories include the parent directory and all subdirectories under the parent directory.
需要指出的是,步骤S401至步骤S405请参考上述实施例的内容,此处不再详细赘述。It should be pointed out that, for steps S401 to S405, please refer to the content of the above-mentioned embodiments, which will not be described in detail here.
本实施中删除并清理底层不存在但是数据库中存在的目录,以保证数据库中不存在脏数据,避免数据库越来越大。In this implementation, delete and clean up the directory that does not exist in the bottom layer but exists in the database to ensure that there is no dirty data in the database and prevent the database from growing larger.
请参考图7,在上述任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法包括:Please refer to FIG. 7, on the basis of any of the above embodiments, in one embodiment of the present application, the storage file management method includes:
步骤S501:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S501: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
步骤S502:判断子目录是否为空。Step S502: Determine whether the subdirectory is empty.
步骤S503:若子目录不为空,则将子目录放入栈中。Step S503: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S504:若子目录为空,则将子目录放入队列中。Step S504: If the subdirectory is empty, put the subdirectory into the queue.
步骤S505:判断子目录的父目录的所有子目录是否全部放入队列中。Step S505: Determine whether all the subdirectories of the parent directory of the subdirectory are put into the queue.
步骤S506:若子目录的父目录的所有子目录全部放入队列中,则将父目录放入队列中,直至不满足预设条件的父目录为止。Step S506: If all the subdirectories of the parent directory of the subdirectory are put into the queue, put the parent directory into the queue until the parent directory that does not meet the preset condition.
步骤S507:从队列中取出第二目录直至栈和队列中目录为空,并根据第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。Step S507: Take out the second directory from the queue until the directory in the stack and the queue are empty, and perform corresponding operations on the database for storing the directory according to the latest modification time of the second directory.
步骤S508:若子目录的父目录的所有子目录没有全部放入队列中,父目录则不放入队列中。Step S508: If not all subdirectories of the parent directory of the subdirectory are put into the queue, the parent directory is not put into the queue.
需要指出的是,步骤S501至步骤S505以及步骤S507请参考上述实施例的内容,此处不再详细赘述。It should be pointed out that, for steps S501 to S505 and S507, please refer to the contents of the above embodiments, and details are not repeated here.
在上述任一实施例的基础上,在本申请的一个实施例中,当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录之前,包括:On the basis of any of the above embodiments, in one embodiment of the present application, when the latest modification time of the first directory taken from the stack changes, before obtaining the subdirectories of the first directory, include:
判断从栈中取出的第一目录的最近修改时间是否发生变化。It is judged whether the latest modification time of the first directory taken out from the stack changes.
当从栈中取出的第一目录的最近修改时间发生变化,则执行获取第一目录的子目录的步骤。When the latest modification time of the first directory taken from the stack changes, the step of obtaining the subdirectory of the first directory is executed.
当从栈中取出的第一目录的最近修改时间未发生变化,执行的具体操作在下述实施例中进行介绍。When the latest modification time of the first directory fetched from the stack has not changed, specific operations performed are described in the following embodiments.
进一步的,在本申请的一个实施例中,判断从栈中取出的第一目录的最近修改时间是否发生变化之前,还包括:Further, in one embodiment of the present application, before judging whether the latest modification time of the first directory taken out of the stack changes, it also includes:
将第一目录放入栈中;Put the first directory on the stack;
从栈中取出第一目录。Remove the first directory from the stack.
请参考图8,在上述任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法包括:Please refer to FIG. 8. On the basis of any of the above embodiments, in one embodiment of the present application, the storage file management method includes:
步骤S601:当从栈中取出的第一目录的最近修改时间发生变化,采用可清理方式遍历第一目录,以得到第一目录的子目录;第一目录保存于AI平台。Step S601: When the latest modification time of the first directory taken from the stack changes, traverse the first directory in a cleanable manner to obtain subdirectories of the first directory; the first directory is saved on the AI platform.
可清理方式在打开第一目录后,可以将第一目录关闭,具有性能高的特点。可选的,可清理方式可以为stream(流)方式。In the cleanable mode, after opening the first directory, the first directory can be closed, which has the characteristics of high performance. Optionally, the clearable mode can be a stream (stream) mode.
作为一种可实施方式,采用可清理方式遍历第一目录包括:As an implementable manner, traversing the first directory in a cleanable manner includes:
利用打开函数打开第一目录,并利用读函数读取第一目录。The first directory is opened with the open function, and the first directory is read with the read function.
打开函数可以为opendir函数,读函数可以为readdir函数。The opening function may be an opendir function, and the reading function may be a readdir function.
步骤S602:判断子目录是否为空。Step S602: Determine whether the subdirectory is empty.
步骤S603:若子目录不为空,则将子目录放入栈中。Step S603: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S604:若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;预设条件为父目录的所有子目录全部放入队列中。Step S604: If the subdirectory is empty, put the subdirectory and the parent directory of the subdirectory satisfying the preset condition into the queue; the preset condition is that all the subdirectories of the parent directory are put into the queue.
步骤S605:从队列中取出第二目录直至栈和队列中目录为空,并根据第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。Step S605: Take out the second directory from the queue until the directory in the stack and the queue is empty, and perform corresponding operations on the database for storing the directory according to the latest modification time of the second directory.
需要指出的是,步骤S602至步骤S605请参考上述实施例的内容,此处不再详细赘述。It should be pointed out that, for steps S602 to S605, please refer to the content of the above-mentioned embodiments, which will not be described in detail here.
请参考图9,在上述任一实施例的基础上,在本申请的一个实施例中,存储文件管理方法包括:Please refer to FIG. 9, on the basis of any of the above embodiments, in one embodiment of the present application, the storage file management method includes:
步骤S701:判断从栈中取出的第一目录的最近修改时间是否发生变化。Step S701: Determine whether the latest modification time of the first directory taken from the stack has changed.
步骤S702:当从栈中取出的第一目录的最近修改时间未发生变化时,从数据库中获取第一目录的子目录。Step S702: when the latest modification time of the first directory taken from the stack has not changed, obtain the subdirectories of the first directory from the database.
若单个目录下具有万级别以上的文件数量,由于该目录未发生变化,只需要在数据库中获取该目录下的目录,不需要使用opendir函数和readdir函数进行底层存储遍历。由于目录的数量远小于文件的数量,AI平台下1TB存储使用量的情况下,目录数量在2万左右,故只需要数据库存放目录信息即可。If there are more than 10,000 files in a single directory, since the directory has not changed, you only need to obtain the directories under the directory in the database, and do not need to use the opendir function and readdir function to traverse the underlying storage. Since the number of directories is much smaller than the number of files, the number of directories is about 20,000 in the case of 1TB storage usage on the AI platform, so only the database is required to store directory information.
AI平台中在一段时间间隔内(如5分钟),99%的目录不会发生变化。对未发生变化的目录使用之前的大小,只需统计发生变化的目录大小。In the AI platform, within a period of time (such as 5 minutes), 99% of the directories will not change. Use the previous size for the directory that has not changed, and only need to count the size of the directory that has changed.
步骤S703:当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台。Step S703: When the latest modification time of the first directory taken from the stack changes, obtain the subdirectories of the first directory; the first directory is saved on the AI platform.
步骤S704:判断子目录是否为空。Step S704: Determine whether the subdirectory is empty.
步骤S705:若子目录不为空,则将子目录放入栈中。Step S705: If the subdirectory is not empty, put the subdirectory on the stack.
步骤S706:若子目录为空,则将子目录放入队列中。Step S706: If the subdirectory is empty, put the subdirectory into the queue.
步骤S707:判断子目录的父目录的所有子目录是否全部放入队列中。Step S707: Determine whether all the subdirectories of the parent directory of the subdirectory are put into the queue.
步骤S708:若子目录的父目录的所有子目录全部放入队列中,则将父目录放入队列中,直至不满足预设条件的父目录为止。Step S708: If all the subdirectories of the parent directory of the subdirectory are put into the queue, put the parent directory into the queue until the parent directory that does not meet the preset condition.
步骤S709:从队列中取出第二目录直至栈和队列中目录为空。Step S709: Take out the second directory from the queue until the directories in the stack and the queue are empty.
步骤S710:判断第二目录的最近修改时间是否发生变化。Step S710: Determine whether the latest modification time of the second directory has changed.
步骤S711:当第二目录的最近修改时间未发生变化,根据第二目录的大小变化情况,确定是否需要对数据库进行更新。Step S711: When the latest modification time of the second directory has not changed, determine whether the database needs to be updated according to the size change of the second directory.
步骤S712:当第二目录的最近修改时间发生变化,根据数据库中关于第二目录的保存情况,对数据库进行相应的操作。Step S712: When the latest modification time of the second directory changes, perform corresponding operations on the database according to the storage situation of the second directory in the database.
当目录的最近修改时间并没有发生变化时,从数据库中获取第一目录的子目录,即使用之前的目录大小,只统计发生变化的目录的大小,每次存储统计防止重复进行目录遍历,避免传统的全量存储统计方法,提升存储统计效率。When the latest modification time of the directory has not changed, the subdirectory of the first directory is obtained from the database, that is, the size of the previous directory is used, and only the size of the directory that has changed is counted. Each time the statistics are stored to prevent repeated directory traversal, avoid The traditional full storage statistics method improves storage statistics efficiency.
下面以一情况对本申请中的存储文件管理方法进行阐述。The storage file management method in this application will be described below with a situation.
第一步:首先定义使用数据结构栈和数据结构队列。The first step: first define the use of data structure stack and data structure queue.
第二步:将目录不断的放入到栈中后,然后取出栈中的目录,如果该目录下没有子目录了(即栈取出的数据一定是底层),则定义使用数据结构队列,将栈中取出的目录放入到队列中。Step 2: After continuously putting the directory into the stack, then take out the directory in the stack, if there is no subdirectory under the directory (that is, the data taken out of the stack must be the bottom layer), then define the data structure queue to be used, and the stack The directory fetched from is put into the queue.
第三步:第二步完成后,需要判断子目录的父目录的所有子目录是否全部入队列,若所有子目录全部入队列,则父目录入队列,直到不满足条件的父目录为止。该步保证所有目录都可以入队列。时间效率O(1)(表示没有任何时间消耗),效率和目录层级有关。Step 3: After the second step is completed, it is necessary to judge whether all subdirectories of the parent directory of the subdirectory are queued. If all subdirectories are queued, the parent directory is queued until the parent directory that does not meet the conditions. This step ensures that all directories can be queued. Time efficiency O(1) (indicating no time consumption), efficiency is related to directory hierarchy.
第四步:取出队列的目录,该目录进行分三种情况:Step 4: Take out the directory of the queue, which is divided into three situations:
1)最近修改时间发生变化,SQLite3数据库的表中不存在(第一次统计遍历情况);1) The latest modification time has changed, and the table in the SQLite3 database does not exist (the first statistical traversal);
2)最近修改时间发生变化,SQLite3数据库的表中已存在,更新表字段;2) The latest modification time has changed, and the table in the SQLite3 database already exists, so update the table fields;
3)最近修改时间未发生变化,目录大小变化则更新SQLite3数据库的表字段,否则不更新,(统计遇到的大部分场景为这种情况);3) If the latest modification time has not changed, if the directory size changes, the table fields of the SQLite3 database will be updated, otherwise it will not be updated, (this is the case for most scenarios encountered in statistics);
其中,更新时进行存储路径信息更新。Wherein, the storage path information is updated during updating.
出队列可以使用多线程同时出队列,加快“向上”过程统计,需要保证出的目录不互为父子目录关系。You can use multiple threads to exit the queue at the same time to speed up the statistics of the "upward" process. It is necessary to ensure that the directories that are exported are not parent-child directories.
第五步:不断的“出栈-入栈-出栈并入队列-出队列”,直到栈和队列为空结束。由于不断的在栈和和队列中删除(取出),不会造成本系统的内存溢出。Step 5: Continuously "pop-up-stack-pop-up and put into queue-out queue" until the stack and queue are empty. Due to the continuous deletion (removal) in the stack and queue, it will not cause memory overflow of the system.
第六步:删除并清理底层不存在,SQLite3数据库存在的目录,包含该目录下所有子目录的数据库删除,以保证数据库不存在脏数据,不会使得数据表越来越大。Step 6: Delete and clean up the directory that does not exist in the bottom layer and the SQLite3 database, and delete the database including all subdirectories under this directory to ensure that there is no dirty data in the database and the data table will not become larger and larger.
本申请中的存储文件管理方法具有以下优势:The storage file management method in this application has the following advantages:
第一,全量快速统计存储各个目录的大小,不统计没有发生变化的目录,统计完更新存储各个目录的大小,利用SQLite3轻量级数据库进行系统设计,最后用于AI平台各个业务功能存储目录大小使用,快速获取AI平台存储空间使用大小,方便AI平台进行存储空间进行管理展示和限制。本申请降低存储服务器存储和负载压力,降低业务模块的资源消耗与和长期占用,增强AI平台存储的性能。本申请可以提升AI平台业务性能,使用SQLite3建立存储目录索引,并进行存储目录进行分表,快速得到存储统计目录大小请求,搭建该存储统计管理系统提升AI业务中各个存储操作效率,缩短了模型训练时间,提升模型训练效率,减少运维人员的运维成本,提高了AI平台市场竞争力;First, quickly count and store the size of each directory in full, do not count the directories that have not changed, update and store the size of each directory after counting, use SQLite3 lightweight database for system design, and finally use it to store the size of the directory for each business function of the AI platform Use to quickly obtain the size of the storage space used by the AI platform, which is convenient for the AI platform to manage, display and limit the storage space. This application reduces the storage and load pressure of storage servers, reduces resource consumption and long-term occupation of business modules, and enhances the performance of AI platform storage. This application can improve the business performance of the AI platform, use SQLite3 to build a storage directory index, and divide the storage directory into tables, quickly get the size request of the storage statistics directory, build the storage statistics management system to improve the efficiency of each storage operation in the AI business, and shorten the model Training time, improve model training efficiency, reduce operation and maintenance personnel's operation and maintenance costs, and improve the market competitiveness of AI platforms;
第二,本申请保证AI平台高效稳定运行,有效缩短了算法人员进行模型训练的时间,提高了AI平台的存储性能,解决了大存储文件目录统计技术难题,同时解决了AI平台的网络和IO进行频繁交互统计的性能痛点,提高了文件操作与管理性能,降低了AI平台的资源整体利用率,使得AI平台使用起来更加流畅,增强AI平台的竞争力。Second, this application ensures the efficient and stable operation of the AI platform, effectively shortens the time for algorithm personnel to conduct model training, improves the storage performance of the AI platform, solves the technical problem of large storage file directory statistics, and solves the network and IO of the AI platform The performance pain point of frequent interactive statistics improves the file operation and management performance, reduces the overall resource utilization of the AI platform, makes the use of the AI platform smoother, and enhances the competitiveness of the AI platform.
下面对本申请实施例提供的存储文件管理装置进行介绍,下文描述的存储文件管理装置与上文描述的存储文件管理方法可相互对应参照。The storage file management device provided by the embodiment of the present application is introduced below, and the storage file management device described below and the storage file management method described above can be referred to in correspondence.
图10为本申请实施例提供的存储文件管理装置的结构框图,参照图10存储文件管理装置可以包括:FIG. 10 is a structural block diagram of a storage file management device provided in an embodiment of the present application. Referring to FIG. 10, the storage file management device may include:
第一获取模块100,用于当从栈中取出的第一目录的最近修改时间发生变化,获取第一目录的子目录;第一目录保存于AI平台;The first obtaining
第一判断模块200,用于判断子目录是否为空;The
第一存放模块300,用于若子目录不为空,则将子目录放入栈中;The
第二存放模块400,用于若子目录为空,则将子目录、满足预设条件的子目录的父目录放入队列中;预设条件为父目录的所有子目录全部放入队列中;The
去除及处理模块500,用于从队列中取出第二目录直至栈和队列中目录为空,并根据第二目录的最近修改时间变化情况,对用于存放目录的数据库进行相应的操作。The removal and
本实施例的存储文件管理装置用于实现前述的存储文件管理方法,因此存储文件管理装置中的具体实施方式可见前文中的存储文件管理方法的实施例部分,例如,第一获取模块100,第一判断模块200,第一存放模块300,第二存放模块400,去除及处理模块500,分别用于实现上述存储文件管理方法中步骤S101,S102,S103,S104和S105,所以,其具体实施方式可以参照相应的各个部分实施例的描述,在此不再赘述。The storage file management device in this embodiment is used to implement the aforementioned storage file management method, so the specific implementation of the storage file management device can be seen in the embodiment part of the storage file management method above, for example, the
可选的,去除及处理模块500包括:Optionally, the removal and
第一操作子模块,用于当第二目录的最近修改时间未发生变化,根据第二目录的大小变化情况,确定是否需要对数据库进行更新;The first operation submodule is used to determine whether the database needs to be updated according to the size change of the second directory when the last modification time of the second directory has not changed;
第二操作子模块,用于当第二目录的最近修改时间发生变化,根据数据库中关于第二目录的保存情况,对数据库进行相应的操作。The second operation sub-module is used for performing corresponding operations on the database according to the storage situation of the second directory in the database when the latest modification time of the second directory changes.
可选的,第一操作子模块包括:Optionally, the first operation submodule includes:
第一判断单元,用于判断第二目录的大小是否发生变化;a first judging unit, configured to judge whether the size of the second directory changes;
第一更新单元,用于若第二目录的大小发生变化,则更新数据库中第二目录的存储路径信息;The first update unit is configured to update the storage path information of the second directory in the database if the size of the second directory changes;
停止操作单元,用于若第二目录的大小未发生变化,则对数据库不进行更新操作。The stop operation unit is configured to not update the database if the size of the second directory does not change.
可选的,第二操作子模块包括:Optionally, the second operation submodule includes:
第二判断单元,用于判断数据库中是否保存有第二目录;The second judging unit is used to judge whether there is a second directory stored in the database;
第二更新单元,用于若数据库中保存有第二目录,则更新数据库中第二目录的存储路径信息;The second update unit is used to update the storage path information of the second directory in the database if the second directory is stored in the database;
插入单元,用于若数据库中未保存有第二目录,则在数据库中插入第二目录。The insertion unit is configured to insert the second directory into the database if the second directory is not saved in the database.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
索引建立模块,用于对数据库中的表建立索引。The index building module is used for building indexes on the tables in the database.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
存储模块,用于根据存储路径信息,对第二目录进行分表存储。The storage module is configured to store the second directory in sub-tables according to the storage path information.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
第二判断模块,用于判断第二目录的最近修改时间是否发生变化。The second judging module is used for judging whether the latest modification time of the second directory has changed.
可选的,去除及处理模块500具体用于将不互为父子目录的第二目录同时从队列中取出。Optionally, the removing and
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
删除模块,用于从数据库中删除数据库中存在且底层不存在的目录,目录包括父目录和父目录下的所有子目录。The delete module is used to delete from the database the directories that exist in the database and do not exist in the bottom layer. The directories include the parent directory and all subdirectories under the parent directory.
可选的,若子目录不为空,存储文件管理装置还包括:Optionally, if the subdirectory is not empty, the storage file management device also includes:
记录模块,用于记录子目录的父目录的信息。The recording module is used to record the information of the parent directory of the subdirectory.
可选的,第二存放模块400包括:Optionally, the
第一存放子模块,用于若子目录为空,则将子目录放入队列中;The first storage submodule is used to put the subdirectory into the queue if the subdirectory is empty;
判断子模块,用于判断子目录的父目录的所有子目录是否全部放入队列中;The judging sub-module is used to judge whether all the sub-directories of the parent directory of the sub-directory are put into the queue;
第二存放子模块,用于若子目录的父目录的所有子目录全部放入队列中,则将父目录放入队列中,直至不满足预设条件的父目录为止。The second storage submodule is used to put the parent directory into the queue if all the subdirectories of the parent directory of the subdirectory are put into the queue until the parent directory that does not meet the preset condition.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
第三判断模块,用于判断从栈中取出的第一目录的最近修改时间是否发生变化。The third judging module is used for judging whether the latest modification time of the first directory taken out from the stack has changed.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
第三存放模块,用于将第一目录放入栈中;The third storage module is used to put the first directory into the stack;
取出模块,用于从栈中取出第一目录。Remove module, used to remove the first directory from the stack.
可选的,第一获取模块100具体用于采用可清理方式遍历第一目录,以得到第一目录的子目录。Optionally, the first obtaining
可选的,第一获取模块100包括:Optionally, the
打开子模块,用于利用打开函数打开第一目录;Open the sub-module for opening the first directory by using the open function;
读取子模块,用于利用读函数读取第一目录。The read submodule is used to read the first directory by using a read function.
可选的,存储文件管理装置还包括:Optionally, the storage file management device also includes:
第二获取模块,用于当从栈中取出的第一目录的最近修改时间未发生变化时,从数据库中获取第一目录的子目录。The second obtaining module is configured to obtain the subdirectories of the first directory from the database when the latest modification time of the first directory taken out from the stack has not changed.
下面对本申请实施例提供的AI平台进行介绍,下文描述的AI平台与上文描述的存储文件管理方法可相互对应参照。The following is an introduction to the AI platform provided by the embodiment of the present application. The AI platform described below and the storage file management method described above can be referred to in correspondence.
图11为本申请实施例提供的AI平台的结构框图,AI平台包括:FIG. 11 is a structural block diagram of the AI platform provided by the embodiment of the present application. The AI platform includes:
存储器11,用于存储计算机程序;
处理器12,用于执行计算机程序时实现上述任一实施例存储文件管理方法的步骤。The processor 12 is configured to implement the steps of the storage file management method in any of the above embodiments when executing the computer program.
AI平台上存储统计管理系统的框架图如图12所示,包括框架依赖、框架入参、框架方法以及SQLite3数据库,其中,框架依赖包括数据库驱动(Sqlite-jdbc)、动态加载表、数据库查询语句(mybats)和动态加载多数据源(Dynamic-datasource);框架入参包括统计存储路径(Storage path)、统计线程数(ThreadNum)、过滤目录列表(Filter path);框架方法包括存储统计任务下发(Storage Statistic Service.storage Statistic (StorageParmeter))、获取存储路径大小(Storage Statistic Service.size By Storage Path(Storage Path))、获取组共享和全局共享属于该用户的所有文件大小(StorageStatistic Service.size By Share Path Owner (owner,path));数据库文件映射到存储目录上时,存储路径可以为/mnt/inspurfs/db/storageSqlite.db。The framework diagram of the storage statistics management system on the AI platform is shown in Figure 12, including framework dependencies, framework input parameters, framework methods, and SQLite3 databases. Among them, framework dependencies include database drivers (Sqlite-jdbc), dynamic loading tables, and database query statements (mybats) and dynamic loading of multiple data sources (Dynamic-datasource); framework input parameters include statistics storage path (Storage path), statistics thread number (ThreadNum), filter directory list (Filter path); framework methods include storage statistics task delivery (Storage Statistic Service.storage Statistic (StorageParmeter)), Get the size of the storage path (Storage Statistic Service.size By Storage Path(Storage Path)), Get the size of all files belonging to the user for group sharing and global sharing (StorageStatistic Service.size By Share Path Owner (owner, path)); when the database file is mapped to the storage directory, the storage path can be /mnt/inspurfs/db/storageSqlite.db.
AI平台的业务模块在进行存储统计时在引入框架依赖的前提下,只需要调用管理方法即可,传入统计参数:统计的存储路径,统计线程数据,过滤目录列表(支持统计过滤部分目录,在实际场景下,有一部分目录不需要统计,例如不属于用户家目录,保证统计效率,在下发统计任务可以过滤的目录),快速得到各个目录的大小,该目录大小用途除了用于磁盘配额限制以外,也可以用于文件管理展示,数据集大小获取等。本专利系统基于AI平台存储集成AI平台业务,进行高效的存储统计,并对存储目录进行管理,使用此方案进行存储大小统计,可以提高AI平台中存储的性能和管理效率,扩大AI平台的节点数和用户数等规模管理。On the premise of introducing framework dependencies, the business modules of the AI platform only need to call the management method and pass in the statistical parameters: the storage path of the statistics, the statistical thread data, and the filtering directory list (supporting statistical filtering of some directories, In actual scenarios, there are some directories that do not need to be counted, such as those that do not belong to the user’s home directory to ensure statistical efficiency, and the directories that can be filtered when the statistical task is delivered), and the size of each directory can be quickly obtained. The purpose of the directory size is not only for disk quota restrictions In addition, it can also be used for file management display, data set size acquisition, etc. This patent system is based on the AI platform storage and integrates the AI platform business, performs efficient storage statistics, and manages the storage directory. Using this solution for storage size statistics can improve the performance and management efficiency of storage in the AI platform, and expand the nodes of the AI platform Scale management such as the number of users and the number of users.
下面对本申请实施例提供的计算机可读存储介质进行介绍,下文描述的计算机可读存储介质与上文描述的存储文件管理方法可相互对应参照。The computer-readable storage medium provided by the embodiments of the present application is introduced below, and the computer-readable storage medium described below and the storage file management method described above may be referred to in correspondence.
一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述任一实施例存储文件管理方法的步骤。A computer-readable storage medium. A computer program is stored on the computer-readable storage medium. When the computer program is executed by a processor, the steps of the storage file management method in any of the above embodiments are implemented.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。Each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and for relevant details, please refer to the description of the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals can further realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, computer software or a combination of the two. In order to clearly illustrate the possible For interchangeability, in the above description, the composition and steps of each example have been generally described according to their functions. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the methods or algorithms described in connection with the embodiments disclosed herein may be directly implemented by hardware, software modules executed by a processor, or a combination of both. Software modules can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other Any other known storage medium.
以上对本申请所提供的存储文件管理方法、装置、AI平台和计算机可读存储介质进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。The storage file management method, device, AI platform, and computer-readable storage medium provided by the present application have been introduced in detail above. In this paper, specific examples are used to illustrate the principles and implementation methods of the present application, and the descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. It should be pointed out that those skilled in the art can make some improvements and modifications to the application without departing from the principles of the application, and these improvements and modifications also fall within the protection scope of the claims of the application.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310377465.8A CN116089364B (en) | 2023-04-11 | 2023-04-11 | Storage file management method and device, AI platform and storage medium |
PCT/CN2023/141775 WO2024212594A1 (en) | 2023-04-11 | 2023-12-26 | Storage file management method and apparatus, and ai platform and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310377465.8A CN116089364B (en) | 2023-04-11 | 2023-04-11 | Storage file management method and device, AI platform and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116089364A true CN116089364A (en) | 2023-05-09 |
CN116089364B CN116089364B (en) | 2023-07-14 |
Family
ID=86212382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310377465.8A Active CN116089364B (en) | 2023-04-11 | 2023-04-11 | Storage file management method and device, AI platform and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116089364B (en) |
WO (1) | WO2024212594A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117235313A (en) * | 2023-11-09 | 2023-12-15 | 苏州元脑智能科技有限公司 | Storage catalog statistics method and device, electronic equipment and storage medium |
WO2024212594A1 (en) * | 2023-04-11 | 2024-10-17 | 山东英信计算机技术有限公司 | Storage file management method and apparatus, and ai platform and storage medium |
CN119376899A (en) * | 2024-12-27 | 2025-01-28 | 苏州元脑智能科技有限公司 | A catalog statistics method, product, device and medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070289016A1 (en) * | 2006-06-13 | 2007-12-13 | Sanjay Pradhan | Bi-modular system and method for detecting and removing harmful files using signature scanning |
CN106874370A (en) * | 2016-12-30 | 2017-06-20 | 厦门天锐科技股份有限公司 | A kind of method for quickly retrieving of catalogue file |
CN111352586A (en) * | 2020-02-23 | 2020-06-30 | 苏州浪潮智能科技有限公司 | Directory aggregation method, device, equipment and medium for accelerating file reading and writing |
CN111769933A (en) * | 2020-06-29 | 2020-10-13 | 北京天融信网络安全技术有限公司 | Method and device for monitoring file change, electronic equipment and storage medium |
CN113010479A (en) * | 2021-03-18 | 2021-06-22 | 山东英信计算机技术有限公司 | File management method, device and medium |
CN113254398A (en) * | 2020-12-29 | 2021-08-13 | 深圳市怡化时代科技有限公司 | Sample file management method, device, equipment and medium |
US20220075830A1 (en) * | 2020-09-10 | 2022-03-10 | EMC IP Holding Company LLC | Resumable ordered recursive traversal of an unordered directory tree |
CN114265818A (en) * | 2021-12-27 | 2022-04-01 | 完美世界(北京)软件科技发展有限公司 | File uploading method, device, equipment and computer readable medium |
CN115168291A (en) * | 2022-06-09 | 2022-10-11 | 北京百度网讯科技有限公司 | Hierarchical directory implementation method, apparatus, electronic device and storage medium |
CN115525603A (en) * | 2022-09-30 | 2022-12-27 | 苏州浪潮智能科技有限公司 | Storage statistics method and device, computer readable storage medium and AI device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100592298C (en) * | 2008-05-13 | 2010-02-24 | 华为技术有限公司 | File synchronization method and device |
US9430331B1 (en) * | 2012-07-16 | 2016-08-30 | Emc Corporation | Rapid incremental backup of changed files in a file system |
CN104317952B (en) * | 2014-11-13 | 2017-10-13 | 北京奇虎科技有限公司 | The scan method and device of memory space in mobile terminal |
CN116089364B (en) * | 2023-04-11 | 2023-07-14 | 山东英信计算机技术有限公司 | Storage file management method and device, AI platform and storage medium |
CN117235313B (en) * | 2023-11-09 | 2024-02-13 | 苏州元脑智能科技有限公司 | Storage catalog statistics method and device, electronic equipment and storage medium |
-
2023
- 2023-04-11 CN CN202310377465.8A patent/CN116089364B/en active Active
- 2023-12-26 WO PCT/CN2023/141775 patent/WO2024212594A1/en unknown
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070289016A1 (en) * | 2006-06-13 | 2007-12-13 | Sanjay Pradhan | Bi-modular system and method for detecting and removing harmful files using signature scanning |
CN106874370A (en) * | 2016-12-30 | 2017-06-20 | 厦门天锐科技股份有限公司 | A kind of method for quickly retrieving of catalogue file |
CN111352586A (en) * | 2020-02-23 | 2020-06-30 | 苏州浪潮智能科技有限公司 | Directory aggregation method, device, equipment and medium for accelerating file reading and writing |
CN111769933A (en) * | 2020-06-29 | 2020-10-13 | 北京天融信网络安全技术有限公司 | Method and device for monitoring file change, electronic equipment and storage medium |
US20220075830A1 (en) * | 2020-09-10 | 2022-03-10 | EMC IP Holding Company LLC | Resumable ordered recursive traversal of an unordered directory tree |
CN113254398A (en) * | 2020-12-29 | 2021-08-13 | 深圳市怡化时代科技有限公司 | Sample file management method, device, equipment and medium |
CN113010479A (en) * | 2021-03-18 | 2021-06-22 | 山东英信计算机技术有限公司 | File management method, device and medium |
CN114265818A (en) * | 2021-12-27 | 2022-04-01 | 完美世界(北京)软件科技发展有限公司 | File uploading method, device, equipment and computer readable medium |
CN115168291A (en) * | 2022-06-09 | 2022-10-11 | 北京百度网讯科技有限公司 | Hierarchical directory implementation method, apparatus, electronic device and storage medium |
CN115525603A (en) * | 2022-09-30 | 2022-12-27 | 苏州浪潮智能科技有限公司 | Storage statistics method and device, computer readable storage medium and AI device |
Non-Patent Citations (3)
Title |
---|
JAE-WOO AHN 等: "Implementation of Packet Queue with Two Dimensional Array on Embedded System", 2019 21ST INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT) * |
傅凯亮: "面向Office文档的数据提取与模板渲染技术研究", 中国优秀硕士学位论文全文数据库 (信息科技辑), pages 138 - 1515 * |
李恒恒;岳春生;胡泽明;: "基于LIRS的YAFFS2元数据缓存管理机制设计", 信息工程大学学报, no. 02, pages 250 - 256 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024212594A1 (en) * | 2023-04-11 | 2024-10-17 | 山东英信计算机技术有限公司 | Storage file management method and apparatus, and ai platform and storage medium |
CN117235313A (en) * | 2023-11-09 | 2023-12-15 | 苏州元脑智能科技有限公司 | Storage catalog statistics method and device, electronic equipment and storage medium |
CN117235313B (en) * | 2023-11-09 | 2024-02-13 | 苏州元脑智能科技有限公司 | Storage catalog statistics method and device, electronic equipment and storage medium |
CN119376899A (en) * | 2024-12-27 | 2025-01-28 | 苏州元脑智能科技有限公司 | A catalog statistics method, product, device and medium |
Also Published As
Publication number | Publication date |
---|---|
WO2024212594A1 (en) | 2024-10-17 |
CN116089364B (en) | 2023-07-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116089364B (en) | Storage file management method and device, AI platform and storage medium | |
US20210149847A1 (en) | Policy driven data placement and information lifecycle management | |
US7370068B1 (en) | Sorting of records with duplicate removal in a database system | |
US7890541B2 (en) | Partition by growth table space | |
CN102521334B (en) | Data storage and query method based on classification characteristics and balanced binary tree | |
CN111427847B (en) | Indexing and querying method and system for user-defined metadata | |
US20080059492A1 (en) | Systems, methods, and storage structures for cached databases | |
US20160283538A1 (en) | Fast multi-tier indexing supporting dynamic update | |
CN106294757B (en) | A kind of distributed data base and its clustered partition method divided based on hypergraph | |
CN101286160A (en) | method of database indexing | |
CN116089414B (en) | Time series database writing performance optimization method and device based on massive data scenarios | |
CN108140040A (en) | The selective data compression of database in memory | |
CN111522791B (en) | Distributed file repeated data deleting system and method | |
CN101916299A (en) | A Method for Storage and Management of 3D Spatial Data Based on File System | |
CN107273443B (en) | A Hybrid Indexing Method Based on Big Data Model Metadata | |
Achakeev et al. | Efficient bulk updates on multiversion b-trees | |
CN115114294A (en) | Adaptive method, device and computer equipment for database storage mode | |
US9734171B2 (en) | Intelligent redistribution of data in a database | |
US20050021924A1 (en) | Memory management tile optimization | |
Carter et al. | Nanosecond indexing of graph data with hash maps and VLists | |
CN110069466B (en) | A small file storage method and device for distributed file system | |
CN114895850A (en) | A method for optimizing data lake writing | |
WO2018218504A1 (en) | Method and device for data query | |
Raghuveer et al. | Towards efficient search on unstructured data: an intelligent-storage approach | |
CN119357135A (en) | A cluster file statistics method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |