[go: up one dir, main page]

CN119311638A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN119311638A
CN119311638A CN202311435047.6A CN202311435047A CN119311638A CN 119311638 A CN119311638 A CN 119311638A CN 202311435047 A CN202311435047 A CN 202311435047A CN 119311638 A CN119311638 A CN 119311638A
Authority
CN
China
Prior art keywords
heat
directory
reading
file
latest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311435047.6A
Other languages
Chinese (zh)
Inventor
崔杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Cloud Computing Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Cloud Computing Technologies Co Ltd filed Critical Huawei Cloud Computing Technologies Co Ltd
Priority to PCT/CN2024/086651 priority Critical patent/WO2025015980A1/en
Publication of CN119311638A publication Critical patent/CN119311638A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • G06F16/125File system administration, e.g. details of archiving or snapshots using management policies characterised by the use of retention policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and device, and relates to the field of storage. The method is applied to a server of a distributed file system, and comprises the steps of firstly receiving data management requests aiming at M catalogues of the distributed file system, then respectively reading M latest reading time points in the M catalogues in response to the data management requests, respectively determining the reading heat of each catalogue according to the latest reading time point of each catalogue, and finally determining whether to carry out data management on the catalogue corresponding to each reading heat according to each reading heat. Therefore, the access complexity in data management is reduced from the file level to the directory level, and the data management efficiency is improved.

Description

Data processing method and device
Technical Field
The present application relates to the field of storage, and in particular, to a data processing method and apparatus.
Background
With the rapid development of computer technology, the demand of users for services provided by servers is increasing, and a single server cannot meet the larger service demand of users, so that a distributed system is generated. The distributed system is characterized in that a plurality of physically scattered server nodes are connected by a high-speed computer network to form a logically unified cluster, and the basic idea is to scatter and store original centralized data to a plurality of server nodes connected by the network so as to obtain larger storage capacity and higher concurrent access quantity. In order to ensure high availability of the distributed system, data management needs to be performed on the distributed system at regular time to delete or dump files with low access heat in the distributed system.
When data management is performed on files stored in a distributed system, a user needs to determine metadata of all files in the whole distributed system, read heat information of each file is obtained, access pressure to a server is high, and therefore data management task efficiency is low.
Disclosure of Invention
The application provides a data processing method and a data processing device, which reduce the access complexity from a file level to a directory level during data management and improve the data management efficiency.
In order to achieve the above purpose, the application adopts the following technical scheme:
In a first aspect, a data processing method is provided, where the data processing method is applied to a server of a distributed file system, and the execution body of the method may be the server, or may be a component or a device (such as a processor, a chip, or a chip system) applied to the server, or may be a logic module or software capable of implementing all or part of functions of the server. The method comprises the steps of firstly receiving data management requests aiming at M catalogues of a distributed file system, then respectively reading M latest reading time points recorded in the M catalogues in response to the data management requests, respectively determining the reading heat of each catalogue according to the latest reading time point of each catalogue, and finally determining whether to carry out data management on the catalogue corresponding to each reading heat according to each reading heat.
In a first aspect, data governance is performed by accessing the latest read time point of the directory. Compared with the existing data management, the method reduces the complexity of data access from the file level to the directory level, and greatly improves the data management efficiency.
In one implementation, determining the read heat of each directory based on the latest read time point of each directory includes comparing each read heat with a heat threshold, respectively, determining the directory as a directory of high read heat if the read heat is above the heat threshold, and determining the directory as a directory of low read heat if the read heat is below or equal to the heat threshold.
In this implementation, by comparing the heat threshold to the read heat, it is possible to accurately determine whether the directory is low or high.
In one implementation, determining whether to perform data governance on the directory corresponding to each read heat according to each read heat includes discarding accessing file heat in the directory with low read heat, accessing file heat in the directory with high read heat, and performing data governance on the directory with high read heat according to the file heat.
In the implementation, the catalogue is subjected to targeted data management according to different reading heat of the catalogue, and the data management process only needs to access the file heat in the catalogue with high reading heat, so that the data management efficiency is improved.
In one implementation, data governance includes one or more of deleting low-heat files, storing low-heat files at low frequencies, or archiving low-heat files.
In this implementation, a variety of data governance implementations are provided that enable efficient data governance.
In one implementation, the distributed file system further comprises a plurality of home storage nodes, and the method further comprises the steps of respectively confirming H home storage nodes where the M catalogues are located, respectively recording M latest reading time points in the H home storage nodes, wherein H is a positive integer greater than or equal to the positive integer.
In this implementation, the latest reading time point of the directory is recorded in the corresponding home storage node, and the reading heat of the directory in the home storage node can be determined.
In a second aspect, a data processing apparatus is provided, where the apparatus is applied to a server of a distributed file system, and the communication apparatus may be the server or a chip or a system on a chip in the server. The communication device may implement the functions performed by the server in the first aspect or the possible designs of the first aspect, where the functions may be implemented by hardware, or may be implemented by hardware to execute corresponding software. The hardware or software comprises one or more modules corresponding to the functions. The communication device comprises a receiving module, an acquisition module, a determining module and a determining module, wherein the receiving module is used for receiving data management requests aiming at M catalogues of a distributed file system, the acquisition module is used for respectively reading M latest reading time points recorded in the M catalogues in response to the data management requests, the determining module is used for respectively determining the reading heat of each catalogue according to the latest reading time point of each catalogue, and the determining module is also used for determining whether to carry out data management on the catalogue corresponding to each reading heat according to each reading heat.
In one implementation, the determining module is specifically configured to compare each reading heat with a heat threshold, determine that the directory is a directory with high reading heat when the reading heat is higher than the heat threshold, and determine that the directory is a directory with low reading heat when the reading heat is lower than or equal to the heat threshold.
In one implementation, the determining module is specifically configured to discard the file heat in the directory with low read heat, and access the file heat in the directory with high read heat, and perform data management on the directory with high read heat according to the file heat.
In one implementation, data governance includes one or more of deleting low-heat files, storing low-heat files at low frequencies, or archiving low-heat files.
In one implementation, the distributed file system further comprises a plurality of home storage nodes, the device further comprises a recording module, a determining module and a recording module, wherein the determining module is further used for respectively confirming H home storage nodes where M catalogues are located, and the recording module is used for respectively recording M latest reading time points in the H home storage nodes, and H is a positive integer greater than or equal to the positive integer.
In a third aspect, the present application provides a computing device cluster comprising at least one computing device, each of the at least one computing device comprising at least one processor and at least one memory, the at least one memory having stored therein computer readable instructions, the at least one processor executing the computer readable instructions to cause the computing device cluster to perform the method described in the first aspect or any one of the possible implementations of the first aspect.
In a fourth aspect, the application provides a computer program product comprising instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method described in the first aspect or any one of the possible implementations of the first aspect.
In a fifth aspect, the present application provides a computer readable storage medium storing computer program instructions which, when executed by a cluster of computing devices, perform the method described in the first aspect or any one of the possible implementations of the first aspect.
The advantages described in the second to fifth aspects of the present application may correspond to the advantageous effect analysis referred to in the first aspect, and are not described herein.
Drawings
Fig. 1 is a schematic structural diagram of a distributed system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of another distributed system according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating another data processing method according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating another data processing method according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating another data processing method according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a computing device according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a computing device cluster according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of another computing device cluster according to an embodiment of the present application.
Detailed Description
The network architecture and the service scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and do not constitute a limitation on the technical solution provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of the network architecture and the appearance of the new service scenario, the technical solution provided by the embodiments of the present application is applicable to similar technical problems.
It should be noted that the terms "first" and "second" and the like in the description, the claims and the drawings of the present application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
It should be understood that, in the embodiment of the present application, "at least one (item)" means one or more, "a plurality" means two or more, "at least two (items)" means two or three and three or more, "and/or" for describing an association relationship of an association object, it means that three relationships may exist, for example, "a and/or B" may mean that only a exists, only B exists, and three cases of a and B exist at the same time, where a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one of a, b or c may represent a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural. It should be understood that in embodiments of the present application, "B corresponding to a" means that B is associated with a. For example, B may be determined from a. It should also be appreciated that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information. In addition, the "connection" in the embodiment of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in any way.
It should be noted that the terms "first" and "second" and the like in the description, the claims and the drawings of the present application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Before describing the data processing method provided by the embodiment of the present application, some terms related to the embodiment of the present application are explained.
Low frequency storage, low price and high reliability, and the usability and integrity of the data are also ensured. However, the read-write speed of the low-frequency storage is relatively slow, and the prefetching and preheating operation is required, so that the low-frequency storage is not suitable for frequently accessed data.
Archival storage, which is the lowest in price but the slowest in read-write speed, requires pre-fetching and pre-heating operations and requires a certain wait for access. The archiving storage has high reliability and high security, and the integrity and confidentiality of the data are ensured. Therefore, archive storage is suitable for data that needs to be retained for a long period of time, but is not suitable for data that is frequently accessed.
Distributed systems, which may also be referred to as distributed clusters. Are a group of interconnected computers or servers that cooperate to perform a common task. Computers or servers in a distributed system are typically located in different physical locations and are interconnected by a network. Distributed systems are commonly used to provide high availability, scalability, and fault tolerance for applications and services. In a distributed system, each computer or server may be referred to as a node, with multiple nodes constituting the distributed system. As an example, a distributed system may be shown in fig. 1, and fig. 1 is a schematic structural diagram of a data access system provided by the present application, where the data access system includes a distributed system 100, and the distributed system 100 includes a network device 130 and a plurality of service end nodes, namely, a node 111, a node 112, and a node 113. A server node is used to process portions of data in the distributed system 100 and communicates with hosts (clients) via a network or hardware communication channel. It should be understood that fig. 1 is only an example provided in this embodiment, and that the distributed system 100 may include more or fewer server nodes, which is not limited by the present application.
The distributed system 100 may support multiple hosts (also known as clients) for processing and accessing files, data, etc. in a server node. The plurality of hosts may include host 1, host 2, and host 3 shown in fig. 1, it being noted that the distributed system 100 may also support more or fewer hosts for data processing and access. The host, or client, user, etc. may be a computer running an application, and the computer running the application may be a physical machine or a virtual machine. For example, if the computer running the application is a physical computing device, the physical computing device may be a host or Terminal (Terminal).
In addition to the distributed clusters described above, the distributed clusters in the embodiments of the present application may also be cloud servers, e.g., public cloud servers, private cloud servers, hybrid cloud servers, etc., and while the cloud servers are not traditional distributed clusters (because the cloud servers are not a set of interconnected physical servers), they may be considered as one virtual distributed cluster. This is because cloud servers are typically hosted on a hypervisor, which is a virtualized platform that allows multiple virtual machines to run on a single physical server. Each virtual machine is equivalent to a node, and a plurality of nodes form a distributed cluster.
As shown in fig. 2, fig. 2 is a schematic structural and organizational diagram of a distributed system according to the present application, and the hardware implementation of the distributed system 200 may refer to the relevant content in fig. 1, which is not described herein. As shown in fig. 2, the distributed system 200 includes one or more nodes, with data in one node being processed and accessed by one server node.
A distributed system may also be referred to as a file system, which is a structured data file storage and organization form. As shown in fig. 2, a node includes a plurality of directories, such as directory 1, directory 2, and directory 3 shown in fig. 2. A plurality of files are stored under a directory, each file being operable to store one or more sets of data.
In the application process of the distributed system, in order to ensure the high availability of the distributed system, the data management needs to be performed on the distributed system at regular time so as to delete or dump files with low access heat in the distributed system. When data management is performed on files stored in a distributed system, the access heat of each file in the whole distributed system needs to be determined so as to perform deletion operation or dump operation on the files with low access heat. Typically, the number of directories is much smaller than the number of files. However, the directory metadata table does not record the file-level read heat information in the directory, but records the directory-level read heat information. Currently, the read heat information atime update scenario in the directory metadata table is shown in table 1. The read heat information atime is updated only when the directory is enumerated. That is, the current read-heat information atime cannot indicate the access heat of the files in the directory. This results in the current data management, which requires access to metadata for all files in the entire distributed system each time, with low data management efficiency.
TABLE 1
In order to solve the above-mentioned problems, an embodiment of the present application provides a data processing method, which is applied to a server of a distributed system. The method performs data governance by accessing the latest read time point of the directory. Compared with the existing data management, the method reduces the complexity of data access from the file level to the directory level, and greatly improves the data management efficiency.
Fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 3, the method may include the steps of:
step S310, a data governance request for M directories of a distributed file system is received.
The data governance request may be initiated by a user on a client, and sent by the client to a server, for requesting governance of M directories.
Step S320, the M latest reading time points recorded in the M catalogues are respectively read in response to the data governance request.
Wherein each directory corresponds to a latest reading time point, and M is a positive integer greater than or equal to 1. The latest reading time point of each directory is the latest time point in the N time points of each file in the directory, which is respectively read by the server based on N file reading requests of at least one file in each directory, and N is a positive integer greater than or equal to 1. For example, N is 3, the corresponding time points of the N file reading requests are T1, T2 and T3 respectively, and T3 is later than T1 and T2, then the latest time point of the directory is T3. The files requested by the N file read requests may be the same file or may be different files. The latest reading time point is determined according to a plurality of reading time points generated by the client. Specifically, when receiving a file reading request (indicating a file and a file storage address, where the file storage address includes a directory to which the file belongs and a server to which the directory belongs, the file storage address may include a server identifier and a directory identifier inode, for example), the client records a corresponding reading time point, and sends the reading time point and the corresponding file storage address to the server. The read request may request one or more files in one directory, and may also request one or more files in a different directory, which may be stored on the same server or on different servers, without limitation.
After receiving the multiple groups of reading time points and the corresponding file storage addresses, the server can determine multiple reading time points corresponding to the same directory, and determine the latest reading time point of the directory from the multiple reading time points. For example, the read requests of the files are continuously received in a preset period to respectively request the files A, B and C (corresponding to the read times T1, T2 and T3 respectively), and the file storage addresses corresponding to the files A, B and C are respectively a server P1-directory I, a server P1-directory I and a server P2-directory U. The reading time point of the directory I includes T1 and T2, the reading time point of the directory U includes T3, and T2 is later than T1, and the latest reading time point of the directory I is T2, and the latest reading time point of the directory U is T3. Based on this, the M latest reading time points recorded in the M directories can be read.
In one embodiment, the client may continuously receive the read requests of the file for a preset period, so as to obtain multiple read requests of the file. In other words, the client may aggregate the read time points of a preset period and then send the read time points to the server. In this implementation, when the client aggregates the read time points of a preset period, the latest read time point of the same directory may be determined, and the determined latest read time point is used as the transmission content for the server. The specific determining process can refer to the principle that the server determines the latest reading time point, and will not be described in detail. The client may have a plurality of latest reading time points, and when the server receives the plurality of latest reading time points of the same catalog of the plurality of clients, the final latest reading time point can be determined from the plurality of latest reading time points. By the operation, communication overhead caused by the fact that the client side and the server side transmit file heat in real time is reduced. The preset period can be flexibly set according to the application scene. For example, if the file heat accumulation is faster, the preset period may be set shorter, whereas the preset period may be set longer. For example, the preset period may be set to 100ms.
Step S330, the reading heat of each directory is respectively determined according to the latest reading time point of each directory.
In determining the reading heat of each directory, the latest reading time point of the directory can be directly used as the reading heat. The latest reading time point can also be converted into other data types which can be quantitatively compared, and the method is not limited.
Step S340, determining whether to carry out data management on the catalogue corresponding to each reading heat according to each reading heat.
Wherein, after determining the reading heat of each catalogue, it can be determined whether to treat the catalogue based on the reading heat.
In the embodiment of the application, the data management is performed by accessing the latest reading time point of the catalogue. Compared with the existing data management, the method reduces the complexity of data access from the file level to the directory level, and greatly improves the data management efficiency.
In one embodiment, step S330 may include:
and comparing each reading heat with a heat threshold, determining the directory as a directory with high reading heat when the reading heat is higher than the heat threshold, and determining the directory as a directory with low reading heat when the reading heat is lower than or equal to the heat threshold.
Wherein the heat threshold can be flexibly set. For example, a threshold time point obtained by subtracting a preset time from the current time may be used as the heat threshold. Specifically, assuming that the current time is 2022, 9, 1, 15, and the preset time is 72 hours, the heat threshold is 2022, 8, 29, 15. The reading heat of the a directory is 2022, 8, 30, 12 days, and 2022, 8, 29, 15 days later (corresponding to the reading heat being higher than the heat threshold), the a directory is a directory with high reading heat. If the reading heat of the B directory is 2022, 8, 28, 15 and is 2022, 8, 29, 15 (which corresponds to the reading heat being lower than or equal to the heat threshold), the B directory is a directory with low reading heat.
In the embodiment of the application, the directory can be accurately determined to be low or high by comparing the heat threshold with the read heat.
In one embodiment, as shown in fig. 4, step S340 may include:
Step S410, relinquishes accessing file warmth in the directory of low read warmth.
Wherein, the low reading heat of the directory indicates that the files in the directory are not used in the near term. At this time, for a directory of low read heat, file heat access may not be necessary for the files therein. Instead, the directory is used as a processing granularity to process the directory with low reading heat by deleting, storing or archiving all files in the directory with low reading heat.
In step S420, the file hotness in the directory of high read hotness is accessed.
Wherein, the higher the reading heat of the directory, the more recently some files in the directory are used. At this time, the file heat in the directory with high reading heat can be accessed to conduct differentiation management on the files in the directory. The attribute information of the file is recorded with the file heat of the file, the file heat is similar to the reading heat of the catalogue, the file heat of the file is determined according to the latest reading time point of the file, and the specific determining process of the file heat can refer to the determining process of the reading heat of the catalogue and is not repeated.
Step S430, according to the file heat, the data management is carried out on the catalogue with high reading heat.
According to the file heat, it can be determined which files in the directory with high reading heat have high file heat and which files have low file heat, so that data management is performed. Specifically, for files with low file heat, one or more means of deleting, storing or archiving files with low heat can be adopted for data management. And for the files with high file heat, the files can be continuously stored in the server.
According to the embodiment of the application, the catalogue is subjected to targeted data management according to different reading heat of the catalogue, and the data management process only needs to access the file heat in the catalogue with high reading heat, so that the data management efficiency is improved.
In one embodiment, as shown in fig. 5, the distributed file system further comprises a plurality of home storage nodes, the method further comprising:
step S510, confirming the H home storage nodes where the M catalogues are located respectively.
Where H is a positive integer greater than or equal to 1, and the server node storing the directory is referred to as the home storage node of the directory. The server node may store one or more directories, possibly M directories stored in the same home storage node, or in different home storage nodes. The server performing steps S510-S520 may be the same node as the home storage node of the directory, or may be a different node.
The file may be requested by a read request, or may be requested by a plurality of files, and the requested plurality of files may be stored in the same home node or may be stored in different home nodes. For the scenario that the file requested by the read request is stored in different home nodes, when the access node receives the read request of the file, the access node sends the read request to the different home nodes respectively so as to acquire all the files requested by the read request. Illustratively, as shown in fig. 4, the files requested by the read request are stored in the home node 41 and the home node 42, respectively, and then the access node 40 sends the read request to the home node 41 and the home node 42, respectively, to obtain all the files requested by the read request.
Step S520, the M latest reading time points are recorded in the H home storage nodes, respectively.
Assuming that there are three directories a, B and C, M is 3, and the latest reading time points corresponding to the directories a, B and C are T1, T2 and T3, respectively. As shown in fig. 6, the home storage nodes of the directory a and the directory B are node 1, the home storage node of the directory C is node 2, and both the node 1 and the node 2 are not servers for executing steps S510-S520. T1, T2 are recorded in the corresponding directory attribute information in node 1 and T3 is recorded in the corresponding directory attribute information in node 2. For example, the latest reading time point may be recorded in the metadata table of the directory in the form of ext_dir_atime, in other words, once the file in the directory is read, the ext_dir_atime in the directory is updated, and the update scenario of the ext_dir_atime in the metadata table is shown in table 2.
TABLE 2
In the embodiment of the application, the latest reading time point of the catalogue is recorded in the corresponding home storage node and can be used for determining the reading heat of the catalogue in the home storage node.
In summary, the embodiment of the application performs data management by accessing the latest reading time point of the directory. Compared with the existing data management, the method reduces the complexity of data access from the file level to the directory level, and greatly improves the data management efficiency.
The above description has been presented with respect to the solution provided by the embodiment of the present application, mainly from the viewpoint of logic execution of each step. It will be appreciated that, in order to implement the above-mentioned functions, the server side includes corresponding hardware structures and/or software modules for executing the respective functions. Those of skill in the art will readily appreciate that the algorithm steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware, software, or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Fig. 7 shows a block diagram of a data processing apparatus 700, which is applied to a server side. The modules in the apparatus shown in fig. 7 have functions corresponding to the steps in the method embodiment, and achieve the corresponding technical effects. The corresponding beneficial effects of the execution steps of each module can refer to the explanation of the corresponding steps of the method embodiment, and will not be repeated. The functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software comprises one or more modules corresponding to the functions. The data processing device comprises a receiving module 710, an obtaining module 720, a determining module 730 and a determining module, wherein the receiving module is used for receiving data governance requests of M catalogues of the distributed file system, the obtaining module 720 is used for respectively reading M latest reading time points recorded in the M catalogues in response to the data governance requests, the determining module 730 is also used for respectively determining the reading heat of each catalogue according to the latest reading time point of each catalogue, and the determining module is also used for determining whether to conduct data governance on the catalogue corresponding to each reading heat according to each reading heat.
The modules described above may be implemented by software, or may be implemented by hardware. Illustratively, the implementation of the receiving module 710 is described next as an example of the receiving module 710. Similarly, the implementation of the acquisition module 720 and the determination module 730 may refer to the implementation of the receiving module 710.
Module as an example of a software functional unit, the receiving module 710 may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container, among others. Further, the above-described computing examples may be one or more. For example, the receiving module 710 may include code running on multiple hosts/virtual machines/containers. It should be noted that, multiple hosts/virtual machines/containers for running the code may be distributed in the same region (region), or may be distributed in different regions. Further, multiple hosts/virtual machines/containers for running the code may be distributed in the same availability zone (availability zone, AZ) or may be distributed in different AZs, each AZ comprising one data center or multiple geographically close data centers. Wherein typically a region may comprise a plurality of AZs.
Also, multiple hosts/virtual machines/containers for running the code may be distributed in the same virtual private cloud (virtual private cloud, VPC) or may be distributed in multiple VPCs. In general, one VPC is disposed in one region, and a communication gateway is disposed in each VPC for implementing inter-connection between VPCs in the same region and between VPCs in different regions.
Module as an example of a hardware functional unit, the receiving module 710 may include at least one computing device, such as a server or the like. Alternatively, the receiving module 710 may be a device implemented using an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or the like. The PLD may be implemented as a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (FPGA) GATE ARRAY, a general-purpose array logic (GENERIC ARRAY logic, GAL), or any combination thereof.
The multiple computing devices included in the receiving module 710 may be distributed in the same region or may be distributed in different regions. The plurality of computing devices included in the receiving module 710 may be distributed in the same AZ or may be distributed in different AZ. Also, the multiple computing devices included in the receiving module 710 may be distributed in the same VPC or may be distributed in multiple VPCs. Wherein the plurality of computing devices may be any combination of computing devices such as servers, ASIC, PLD, CPLD, FPGA, and GAL.
It should be noted that, in other embodiments, the receiving module 710 may be configured to perform any step in the data processing method, the acquiring module 720 and the determining module 730 may be configured to perform any step in the data processing method, the steps that the receiving module 710, the acquiring module 720 and the determining module 730 are responsible for implementing may be specified as needed, and the receiving module 710, the acquiring module 720 and the determining module 730 implement different steps in the data processing method to implement all functions of the data processing apparatus.
The present application also provides a computing device 100. As shown in fig. 8, computing device 100 includes a bus 102, a processor 104, a memory 106, and a communication interface 108. Communication between the processor 104, the memory 106, and the communication interface 108 is via the bus 102. Computing device 100 may be a server or a terminal device. It should be understood that the present application is not limited to the number of processors, memories in computing device 100.
Bus 102 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one line is shown in fig. 8, but not only one bus or one type of bus. Bus 104 may include a path to transfer information between various components of computing device 100 (e.g., memory 106, processor 104, communication interface 108).
The processor 104 may include any one or more of a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), a Microprocessor (MP), or a digital signal processor (DIGITAL SIGNAL processor, DSP).
The memory 106 may include volatile memory (RAM), such as random access memory (random access memory). The processor 104 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk (HARD DISK DRIVE, HDD) or Solid State Disk (SSD).
The memory 106 stores executable program codes, and the processor 104 executes the executable program codes to implement the functions of the foregoing receiving module 710, obtaining module 720 and determining module 730, respectively, so as to implement a data processing method. That is, the memory 106 has stored thereon instructions for performing the data processing method.
Or the memory 106 has stored therein executable code that the processor 104 executes to implement the functions of the aforementioned data processing apparatus, respectively, to thereby implement the data processing method. That is, the memory 106 has stored thereon instructions for performing the data processing method.
Communication interface 103 enables communication between computing device 100 and other devices or communication networks using a transceiver module such as, but not limited to, a network interface card, transceiver, or the like.
The embodiment of the application also provides a computing device cluster. The cluster of computing devices includes at least one computing device. The computing device may be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device may also be a terminal device such as a desktop, notebook, or smart phone.
As shown in fig. 9, the cluster of computing devices includes at least one computing device 100. The same instructions for performing the data processing method may be stored in memory 106 in one or more computing devices 100 in the cluster of computing devices.
In some possible implementations, portions of the instructions for performing the data processing method may also be stored separately in the memory 106 of one or more computing devices 100 in the cluster of computing devices. In other words, a combination of one or more computing devices 100 may collectively execute instructions for performing the data processing method.
It should be noted that the memories 106 in different computing devices 100 in the computing device cluster may store different instructions for performing part of the functions of the data processing apparatus. That is, the instructions stored by the memory 106 in the different computing devices 100 may implement the functionality of one or more of the receiving module 710 and the obtaining module 720 and the determining module 730.
In some possible implementations, one or more computing devices in a cluster of computing devices may be connected through a network. Wherein the network may be a wide area network or a local area network, etc. Fig. 10 shows one possible implementation. As shown in fig. 10, two computing devices 100A and 100B are connected by a network. Specifically, the connection to the network is made through a communication interface in each computing device. In this type of possible implementation, instructions to perform the functions of the receiving module 710 are stored in the memory 106 in the computing device 100A. Meanwhile, instructions for performing the functions of the acquisition module 720 and the determination module 730 are stored in the memory 106 in the computing device 100B.
The manner of connection between clusters of computing devices shown in fig. 10 may be in view of the need for the data processing method provided by the present application, and thus, it is contemplated that the functions implemented by the acquisition module 720 and the determination module 730 may be performed by the computing device 100B.
It should be appreciated that the functionality of computing device 100A shown in fig. 10 may also be performed by multiple computing devices 100. Likewise, the functionality of computing device 100B may also be performed by multiple computing devices 100.
The embodiment of the application also provides another computing device cluster. The connection between computing devices in the computing device cluster may be similar to the connection of the computing device cluster described with reference to fig. 9 and 10. In contrast, the same instructions for performing the data processing method may be stored in memory 106 in one or more computing devices 100 in the cluster of computing devices.
In some possible implementations, portions of the instructions for performing the data processing method may also be stored separately in the memory 106 of one or more computing devices 100 in the cluster of computing devices. In other words, a combination of one or more computing devices 100 may collectively execute instructions for performing the data processing method.
It should be noted that the memory 106 in different computing devices 100 in the computing device cluster may store different instructions for performing part of the functions of the data detection system. That is, the instructions stored in the memory 106 in the different computing devices 100 may implement the functions of the data processing apparatus.
Embodiments of the present application also provide a computer program product comprising instructions. The computer program product may be software or a program product containing instructions capable of running on a computing device or stored in any useful medium. The computer program product, when run on at least one computing device, causes the at least one computing device to perform a data processing method, or a data processing method.
The embodiment of the application also provides a computer readable storage medium. The computer readable storage medium may be any available medium that can be stored by a computing device or a data storage device such as a data center containing one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc. The computer-readable storage medium includes instructions that instruct a computing device to perform a data processing method or instruct a computing device to perform a data processing method.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the protection scope of the technical solution of the embodiments of the present invention.

Claims (13)

1. A data processing method, wherein the method is applied to a server side of a distributed file system, the method comprising:
Receiving data governance requests for M directories of the distributed file system;
Respectively reading M latest reading time points recorded in the M catalogues in response to the data management request, wherein each catalogue is recorded with one latest reading time point, and M is a positive integer greater than or equal to 1;
Determining the reading heat of each directory according to the latest reading time point of each directory, wherein the latest reading time point of each directory is the latest time point in the N time points of each file in the directory, which is respectively read by the server based on N file reading requests of at least one file in each directory, and N is a positive integer greater than or equal to 1;
and determining whether to carry out data management on the catalogue corresponding to each reading heat according to each reading heat.
2. The method of claim 1, wherein the determining the read heat of each directory based on the latest read time point of each directory, respectively, comprises:
And comparing each reading heat with a heat threshold, determining the directory as a directory with high reading heat when the reading heat is higher than the heat threshold, and determining the directory as a directory with low reading heat when the reading heat is lower than or equal to the heat threshold.
3. The method of claim 2, wherein determining whether to perform data governance on the directory corresponding to each read heat according to each read heat comprises:
giving up accessing file hotness in the low read-hotness directory, and
Accessing the file heat in the directory with high reading heat, and carrying out data management on the directory with high reading heat according to the file heat.
4. A method according to any one of claims 1 to 3, wherein the data governance comprises one or more of:
delete low-heat files, low-frequency store low-heat files, or archive low-heat files.
5. The method of any of claims 1 to 4, wherein the distributed file system further comprises a plurality of home storage nodes, the method further comprising:
Respectively confirming H attribution storage nodes where the M catalogues are located;
and respectively recording the M latest reading time points in the H home storage nodes, wherein H is a positive integer greater than or equal to 1.
6. A data processing apparatus for application to a server of a distributed file system, the apparatus comprising:
The receiving module is used for receiving data governance requests aiming at M catalogues of the distributed file system;
The acquisition module is used for respectively determining M latest reading time points corresponding to the M catalogues in response to the data management request, wherein each catalogue corresponds to one latest reading time point, and M is a positive integer greater than or equal to 6;
The determining module is further configured to determine a reading heat of each directory according to a latest reading time point of each directory, where the latest reading time point of each directory is a latest time point in N time points of each file in the directory that is respectively read by the server based on N file reading requests of at least one file in each directory, and N is a positive integer greater than or equal to 6;
the determining module is further configured to determine whether to perform data management on the directory corresponding to each reading heat according to each reading heat.
7. The apparatus according to claim 6, wherein the determining module is specifically configured to:
And comparing each reading heat with a heat threshold, determining the directory as a directory with high reading heat when the reading heat is higher than the heat threshold, and determining the directory as a directory with low reading heat when the reading heat is lower than or equal to the heat threshold.
8. The apparatus of claim 7, wherein the determining module is specifically configured to:
And accessing the file heat in the directory with high reading heat, and performing data management on the directory with high reading heat according to the file heat.
9. The apparatus of any one of claims 6 to 8, wherein the data governance comprises one or more of:
delete low-heat files, low-frequency store low-heat files, or archive low-heat files.
10. The apparatus according to any one of claims 6 to 9, wherein the distributed file system further comprises a plurality of home storage nodes, the apparatus further comprising a recording module, the determining module further configured to respectively confirm H home storage nodes where the M directories are located;
the recording module is configured to record the M latest reading time points in the H home storage nodes, where H is a positive integer greater than or equal to 6.
11. A cluster of computing devices, comprising at least one computing device, each computing device comprising a processor and a memory;
the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device to cause the cluster of computing devices to perform the method of any of claims 1-5.
12. A computer program product containing instructions which, when executed by a cluster of computing devices, cause the cluster of computing devices to perform the method of any of claims 1-5.
13. A computer readable storage medium comprising computer program instructions which, when executed by a cluster of computing devices, perform the method of any of claims 1-5.
CN202311435047.6A 2023-07-14 2023-10-30 Data processing method and device Pending CN119311638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2024/086651 WO2025015980A1 (en) 2023-07-14 2024-04-08 Data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202310869573 2023-07-14
CN2023108695737 2023-07-14

Publications (1)

Publication Number Publication Date
CN119311638A true CN119311638A (en) 2025-01-14

Family

ID=94191271

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311435047.6A Pending CN119311638A (en) 2023-07-14 2023-10-30 Data processing method and device

Country Status (2)

Country Link
CN (1) CN119311638A (en)
WO (1) WO2025015980A1 (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9298719B2 (en) * 2012-09-04 2016-03-29 International Business Machines Corporation On-demand caching in a WAN separated distributed file system or clustered file system cache
CN107679193A (en) * 2017-10-09 2018-02-09 郑州云海信息技术有限公司 A kind of hot statistics method and system for distributed file system
CN108846114A (en) * 2018-06-26 2018-11-20 郑州云海信息技术有限公司 Distributed system control method, device, equipment and readable storage medium storing program for executing
CN111158613B (en) * 2020-04-07 2020-07-31 上海飞旗网络技术股份有限公司 Data block storage method and device based on access heat and storage equipment

Also Published As

Publication number Publication date
WO2025015980A1 (en) 2025-01-23

Similar Documents

Publication Publication Date Title
US10853339B2 (en) Peer to peer ownership negotiation
US11347443B2 (en) Multi-tier storage using multiple file sets
US10430398B2 (en) Data storage system having mutable objects incorporating time
US9792344B2 (en) Asynchronous namespace maintenance
US11561930B2 (en) Independent evictions from datastore accelerator fleet nodes
US9015417B2 (en) Deduplication-aware page cache
US11977451B2 (en) Application-based packing for storing backup data to an object storage
US10298709B1 (en) Performance of Hadoop distributed file system operations in a non-native operating system
US10268381B1 (en) Tagging write requests to avoid data-log bypass and promote inline deduplication during copies
US11625192B2 (en) Peer storage compute sharing using memory buffer
US11199990B2 (en) Data reduction reporting in storage systems
US9430492B1 (en) Efficient scavenging of data and metadata file system blocks
US11526469B1 (en) File system reorganization in the presence of inline compression
US9177034B2 (en) Searchable data in an object storage system
CN116954484A (en) Attribute-only reading of specified data
US11586353B2 (en) Optimized access to high-speed storage device
CN119311638A (en) Data processing method and device
CN117880288A (en) Data equalization method and related equipment
CN115544489A (en) Authentication method, device and storage system
US11860834B1 (en) Reporting of space savings due to pattern matching in storage systems
US20200371849A1 (en) Systems and methods for efficient management of advanced functions in software defined storage systems
US20240119029A1 (en) Data processing method and related apparatus
US20220050596A1 (en) Server and method for managing distributed storage
CN119884050A (en) Access control method and device
KR20070061087A (en) Obeject based file system and method for inputting and outputting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication