CN118426713B

CN118426713B - Cluster file distributed management method and system

Info

Publication number: CN118426713B
Application number: CN202410895874.1A
Authority: CN
Inventors: 张涛; 宋小波; 吴晓成
Original assignee: Beijing Tianhong Ruizhi Technology Co ltd
Current assignee: Tianhong Ruizhi Shanghai Software Technology Co ltd
Priority date: 2024-07-05
Filing date: 2024-07-05
Publication date: 2024-09-20
Anticipated expiration: 2044-07-05
Also published as: CN118426713A

Abstract

The present invention provides a cluster file distributed management method and system, which relates to the field of file management technology, including deploying a metadata server cluster; dividing files into objects of preset sizes, and storing multiple objects respectively on multiple storage nodes in a file system based on a consistent hashing algorithm, wherein an object storage service is respectively run on each storage node; in response to a master metadata server receiving a file request sent by a client, the file request is implemented on a corresponding storage node based on a proxy service between the client and the storage node; wherein in the case of failure of the master metadata server, the master metadata server switches to a slave metadata server to receive the file request sent by the client.

Description

Cluster file distributed management method and system

Technical Field

The present invention relates to file management technologies, and in particular, to a method and system for distributed management of cluster files.

Background

With the rapid development of information technology and the rapid increase of data volume, the traditional centralized file storage and management mode cannot meet the storage and access requirements of mass data. The centralized file system faces the problems of poor expansibility, single point failure, performance bottleneck and the like, and is difficult to cope with the continuous increase of the data scale and the high concurrent access of the service.

The cluster file distributed management technology has been developed, and by storing file data on a plurality of server nodes in a scattered manner, the distributed processing of file storage and access is realized, and the limitation of the traditional centralized file system is effectively solved.

Although the existing cluster file distributed management technology solves the problems of mass data storage and access to a certain extent, some disadvantages still exist:

Metadata management performance bottleneck-when the number of files and the concurrency of access are high, a single metadata server may not be able to efficiently process all requests, becoming a performance bottleneck for the system.

And the cost of data replication is that in order to ensure the reliability of the data, multiple copies of each data block are required to be replicated, a large amount of storage space and network bandwidth are occupied, and the storage and management cost of the system is increased.

The granularity of the data blocks is that the data blocks with fixed sizes can not adapt to the characteristics of different types of files, so that the waste of storage space and the reduction of read-write performance are caused.

The difficulty of load balancing is that various factors such as storage capacity, access heat, network topology and the like of the nodes need to be considered in dynamically adjusting the distribution and access strategy of the data blocks, so that the implementation complexity is high.

Therefore, how to further optimize the cluster file distributed management technology, improve the performance of metadata management, reduce the cost of data replication, adapt to the characteristics of different files, and realize efficient load balancing becomes a key problem to be solved urgently.

Disclosure of Invention

The embodiment of the invention provides a distributed management method and a distributed management system for cluster files, which at least can solve part of problems in the prior art.

In a first aspect of an embodiment of the present invention,

The distributed management method for cluster files comprises the following steps:

Deploying a metadata server cluster, wherein the metadata server cluster comprises a master metadata server and at least one slave metadata server, the master metadata server is used for managing a file system naming space and maintaining metadata of a directory tree structure and files, and the slave metadata server is used for backing up the master metadata server;

dividing a file into objects with preset sizes, and respectively storing a plurality of objects on a plurality of storage nodes in a file system based on a consistent hash algorithm, wherein an object storage service is respectively operated on each storage node, and is used for the persistent storage and access of the objects;

And responding to the file request sent by the client terminal received by the main metadata server, and realizing the file request on the corresponding storage node based on a proxy service between the client terminal and the storage node, wherein the file request comprises any one or more of a file creation request, a file writing request and a file reading request, and the main metadata server is switched to the slave metadata server to receive the file request sent by the client terminal under the condition that the main metadata server fails.

In an alternative embodiment of the present invention,

Storing a plurality of objects on a plurality of storage nodes in a file system, respectively, based on a consistent hashing algorithm, comprising:

Mapping the object name of the object to an integer hash value in a hash value space with a fixed size based on a hash function, dividing the hash value space into a plurality of subintervals respectively corresponding to storage nodes, and storing the object corresponding to the integer hash value in the subinterval on the storage node corresponding to the subinterval;

The method further comprises the steps of:

The method comprises the steps of collecting load index data of each storage node at preset time intervals, wherein the load index data comprise any one or more of read-write request quality of object storage service processing on the storage node in unit time, response time of object storage service on the storage node, CPU (central processing unit) utilization rate occupied by an object storage service process on the storage node and memory resource proportion occupied by the object storage service process on the storage node;

and sending the preset load index data to a main metadata server, enabling the main metadata server to generate a global load view based on the load index data on each storage node, judging the load state of each storage node, and selecting objects on the storage nodes to migrate based on the load state of each storage node.

In an alternative embodiment of the present invention,

The metadata server generates a global load view based on load index data on each storage node, judges the load state of each storage node, and selects to migrate objects on the storage nodes based on the load state of each storage node, including:

setting an alert threshold and a critical threshold for each load index data, wherein the alert threshold is less than the critical threshold;

responding to any load index data on a storage node in a preset time period to be larger than an alarm threshold value and smaller than a critical threshold value, and reducing the critical threshold value according to a preset proportion;

And in response to any load index data on a storage node not being smaller than the critical threshold, determining the storage node as a high load node, and selecting a target node from other storage nodes based on a global load view for receiving an object on the high load node, wherein the target node meets the following conditions: the load index data of the target node is smaller than a critical threshold, the storage space and the input/output bandwidth of the target node meet preset conditions, and the network distance between the target node and the high-load node is smaller than a preset threshold;

Determining an object with the access frequency larger than a preset threshold value in the high-load node as an object to be migrated based on the data access mode of the high-load node, and generating a migration task list based on the distribution of the object to be migrated in the high-load node and the position of a target node;

The metadata server sends the migration task list to a high-load node, so that the high-load node sends an object to be migrated to a target node by using a streaming or object-based transmission mode, and the high-load node pauses the modification operation of the object to be migrated in the transmission process;

and the target node synchronizes a confirmation message with the high-load node and the metadata server under the condition that the object to be migrated is received, so that the high-load node deletes local data to be migrated based on the confirmation message, and the metadata server updates global load information based on the confirmation message.

In an alternative embodiment of the present invention,

And responding to the file request sent by the client side received by the main metadata server, and realizing the file request on the corresponding storage node based on the proxy service between the client side and the storage node, wherein the method comprises the following steps:

responding to the file request sent by the client, determining metadata information of a file according to a file path in the file request, wherein the metadata information comprises an object list and an object identifier of the file, and calculating an object range corresponding to the file request by using a proxy service according to the metadata information, the offset and the length of the file request so as to convert the file request into an object request, wherein the object request is used for accessing an object on a storage node;

searching an object to be accessed in a cache of the proxy service based on the object request, responding to the hit of the object to be accessed in the cache of the proxy service, returning the object to be accessed to the client, and storing the object accessed by the client in the cache of the proxy service within a preset duration;

And responding to the object to be accessed in the cache of the proxy service, selecting one storage node based on a data distribution strategy, load index data of the storage node and network topology, and sending the object request to the storage node to return the object to be accessed based on the object request, wherein the data distribution strategy is realized based on a consistent hash algorithm.

In an alternative embodiment of the present invention,

The method further comprises the steps of:

When the free space in the cache of the proxy service reaches a preset threshold value, analyzing the historical access record of the client to determine the mode of the client for accessing data, sequentially reading the follow-up files of the files into the cache of the proxy service when the client accesses the files in response to the mode of the client for accessing the data, and reading the files with the frequency higher than the threshold value in the preset duration into the cache of the proxy service based on the frequency of the client for accessing the files in response to the mode of the client for accessing the data;

Recording the time stamp of the file when the buffer memory of the proxy service reads the file, comparing the time stamp of the object to be accessed corresponding to the buffer memory of the proxy service with the time stamp of the object to be accessed corresponding to the storage node when the object request exists, and responding to the fact that the time stamp of the object to be accessed corresponding to the buffer memory of the proxy service is smaller than the time stamp of the object to be accessed corresponding to the storage node, reading the object to be accessed corresponding to the object request from the storage node.

In an alternative embodiment of the present invention,

Deploying a metadata server cluster, comprising:

determining an odd number of server nodes in a metadata server cluster, wherein each server node runs the same metadata service program, and determining configuration information of each server node in the metadata server cluster in a static configuration or dynamic discovery mode, wherein the configuration information comprises a network address and a port number, and the server nodes are connected with each other through a network based on the configuration information;

Each server node is used as a follower node, and the follower nodes are changed into candidate nodes and initiate a new round of election in response to each follower node not receiving heartbeat messages sent by the leader nodes within a preset election duration, wherein the election is used for determining a new leader node from the candidate nodes;

Updating the period number by each candidate node in the new round of election and sending voting requests to other candidate nodes, and voting to the candidate node sending the voting request and updating the period number of the candidate node when the other candidate nodes do not vote in the new round of election and the period number included in the voting request is larger than the period number of the candidate node;

the candidate node with the highest votes received is taken as a new leader node, and other candidate nodes are taken as new follower nodes, wherein the leader node is a master metadata server in a metadata server cluster, and the follower nodes are slave metadata servers in the metadata server cluster.

In an alternative embodiment of the present invention,

In the event that the master metadata server fails, ceasing to send heartbeat messages to other slave metadata servers, the method further comprising:

encapsulating a file request of a client into a log entry by a main metadata server and adding the log entry to a log of the main metadata server, wherein the log entry comprises an index identifier and a current optional number on the main metadata server;

The master metadata server sends an addition log request to each slave metadata server, wherein the addition log request comprises the log entries;

each slave metadata server responds to the received addition log request, and the addition log request is refused when the tenure number in the addition log request is smaller than the current tenure number of the slave metadata server or log entries with the same index identification but different current tenure numbers exist on the slave metadata server;

Otherwise, receiving the log adding request and adding the log entry into the log of the slave metadata server, and returning a response log adding message to the master metadata server;

When the primary metadata server receives a predetermined number of response-add log messages, the log entries are marked as all metadata servers apply the log entries at the time of commit.

In a second aspect of an embodiment of the present application,

There is provided a cluster file distributed management system comprising:

A first unit configured to deploy a metadata server cluster, where the metadata server cluster includes a master metadata server and at least one slave metadata server, where the master metadata server is configured to manage a file system namespace and maintain metadata of a directory tree structure and files, and the slave metadata server is a backup of the master metadata server;

A second unit, configured to divide a file into objects of a preset size, and store the objects on a plurality of storage nodes in a file system respectively based on a consistent hash algorithm, where an object storage service is executed on each storage node respectively, and the object storage service is used for persistent storage and access of the objects;

And a third unit, configured to, in response to the master metadata server receiving a file request sent by the client, implement the file request on a corresponding storage node based on a proxy service between the client and the storage node, where the file request includes any one or more of a file creation request, a file writing request, and a file reading request, and in case of failure of the master metadata server, switch from the master metadata server to the slave metadata server to receive the file request sent by the client.

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.

In a fourth aspect of an embodiment of the present invention,

There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.

According to the application, the high reliability of metadata management is realized by deploying the master-slave metadata server cluster. The main metadata server is responsible for managing file system namespaces, maintaining directory tree structures and file metadata, and the slave metadata server is used as a backup of the main metadata server, automatically takes over the functions of the main metadata server when the main metadata server fails, and ensures the continuous availability of metadata services.

The load balancing of file storage and access is realized by dividing the file into objects with preset sizes and adopting a consistent hash algorithm to store the objects on a plurality of storage nodes in a distributed manner. Each storage node operates independent object storage service, and the persistent storage and access of the objects are completed on the local nodes, so that the overhead of network transmission is avoided, and the read-write performance of files is improved.

By introducing proxy service between the client and the storage node, the file request of the client is forwarded to the corresponding storage node for processing, and distributed concurrency of file access is realized. Multiple clients can initiate file requests to different storage nodes at the same time, and each storage node processes the requests in parallel, so that the concurrency and throughput of file access are obviously improved.

The horizontal expandability of file storage is realized by adopting the object storage service and the consistent hash algorithm. When the storage capacity is insufficient, new storage nodes can be conveniently added, the consistency hash algorithm can automatically adjust the distribution of objects among the nodes, data migration is not needed, and the expandability and the flexibility of the system are improved. Interaction details between the client and the storage nodes are shielded through the proxy service, the client does not need to be concerned about which node the file is specifically stored on, does not need to know the increase, decrease and change of the storage nodes, and achieves transparency of file access. The client only needs to interact with the proxy service, and the proxy service automatically routes to the corresponding storage node for processing according to the type and the parameters of the file request.

Drawings

FIG. 1 is a flow chart of a distributed cluster file management method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a cluster file distributed management system according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 1 is a flow chart of a cluster file distributed management method according to an embodiment of the present invention, as shown in fig. 1, where the method includes:

S101, deploying a metadata server cluster, wherein the metadata server cluster comprises a master metadata server and at least one slave metadata server, the master metadata server is used for managing a file system naming space and maintaining metadata of a directory tree structure and files, and the slave metadata server is used for backing up the master metadata server;

In an alternative embodiment of the present invention,

Deploying a metadata server cluster, comprising:

Illustratively, deploying the primary metadata server comprises;

And selecting a server node with higher performance and better stability from the cluster file system as a main metadata server. The primary metadata server needs to configure enough memory and disk space for storing and managing metadata information for the file system. Metadata management software, such as APACHE HDFS's NameNode or Ceph's MDS (Metadata Server), is deployed on the primary metadata server.

The main responsibilities of the primary metadata server include:

The namespaces of the file system are managed, and directory tree structures and file metadata, such as file names, file sizes, permissions, etc., are maintained. And processing file metadata requests of the client, such as file creation, deletion, renaming, inquiry and the like. And managing state information of the storage nodes, such as storage capacity, heartbeat and the like, and distributing the storage nodes to the clients according to the load balancing policy. And recording the mapping relation between the file data block and the storage node for reading, writing and recovering the file.

The deployment slave metadata server includes;

to improve the reliability and availability of metadata services, one or more secondary metadata servers are deployed in a clustered file system as backups of the primary metadata servers. The slave metadata server has the same hardware configuration and software environment as the master metadata server.

The main responsibilities from the metadata server include:

And maintaining real-time synchronization with the main metadata server, and periodically acquiring the latest metadata information from the main metadata server, wherein the latest metadata information comprises a directory tree structure, file metadata, storage node states and the like. When the main metadata server fails, the work of the main metadata server is automatically taken over, metadata service is continuously provided for the client, and the continuity of the service is ensured. And the main metadata server is assisted to share the load of metadata management, so that the performance and throughput of the metadata service are improved.

Further:

Determining a server node in a metadata server cluster:

In building a metadata server cluster, an odd number of server nodes, typically 3, 5 or 7 nodes, are selected. The odd nodes can avoid the condition that the number of tickets is the same in the election process, and ensure that the election can be smoothly carried out. Each server node needs to be configured with the same hardware resources and software environment, and consistency and reliability of metadata service are guaranteed.

The same metadata service, such as the MDS of the NameNode or Ceph of APACHE HDFS, runs on each server node. The metadata service program is responsible for managing metadata information of the file system, such as directory tree structures, file attributes, etc., and providing metadata access services for clients.

Configuring connection information of each server node in the metadata server cluster:

In order for the server nodes in the metadata server cluster to be able to communicate and cooperate with each other, it is necessary to configure connection information between them. The connection information includes a network address (e.g., IP address) and a communication port number of each server node.

There are two ways to configure the connection information:

static configuration, namely manually configuring network addresses and port numbers of other nodes on each server node in advance. Static configuration is suitable for the situation that the number of nodes is small and the network environment is stable.

Dynamic discovery-dynamic discovery and connection between server nodes is achieved using a distributed coordination service, such as ZooKeeper or etcd. Each server node registers own connection information with the coordination service when starting, and acquires connection information of other nodes from the coordination service. Dynamic discovery is suitable for the situation that the number of nodes is large and the network environment dynamically changes.

Based on the configured connection information, the server nodes in the metadata server cluster are mutually connected through a network to form a completely connected network topology.

Initializing node roles in the metadata server cluster:

The metadata server cluster employs Raft protocols to achieve distributed consistency and fault tolerance. Three node roles are defined in the Raft protocol, leader, follower (Follower) and Candidate (CANDIDATE). Initially, all server nodes are set to follower roles. The follower node passively receives heartbeat messages and log replication requests from the leader node and participates in the voting process.

The election mechanism of the leader node is realized:

the Raft protocol selects a leader node from the follower nodes through an election mechanism, the leader node being responsible for coordinating and managing the entire metadata server cluster.

The election process is as follows:

When the follower node does not receive the heartbeat message from the leader node within the preset election duration, the current leader node is considered to be invalid, the role of the follower node is changed into a candidate node, and a new round of election is initiated. The candidate node first increases its own tenure number (Term) which is used to identify the round of election. The candidate node then sends a voting request (RequestVote) to all other nodes requesting that the other nodes vote on itself.

When other nodes receive the voting request of the candidate node, if the period number contained in the request is larger than the period number of the node, and no vote is cast in the current period, voting the candidate node, and updating the period number of the node. The candidate node gathers voting responses from other nodes and becomes the new leader node if more than half of the nodes' votes are obtained. The leader node sends heartbeat messages to other nodes informing themselves of the winning and beginning to receive and process client requests. After receiving the heartbeat message of the new leader node, the unselected candidate node changes the role of the candidate node into the follower node, and starts to accept log replication and heartbeat message of the leader node.

Through the above election mechanism, the metadata server cluster can dynamically elect one leader node, i.e., the master metadata server, and the other nodes act as follower nodes, i.e., the slave metadata servers.

In order to ensure that the metadata information of each node in the metadata server cluster is consistent, the leader node needs to synchronize the update operation of the metadata to all follower nodes. The synchronization process is as follows:

when the leader node receives a write request from a client, the request is converted into a Log Entry (Log Entry) and added to its own Log (Log). The leader node sends a log replication request (APPENDENTRIES) to all follower nodes, the request containing a new log entry.

After receiving the log replication request, the follower node checks any period number and log index in the request, and if the random period number and the log index are matched with each other, the follower node accepts a new log entry and adds the new log entry to the log of the follower node. After the follower node completes the log replication, an acknowledgement response is sent to the leader node (AppendEntriesResponse) indicating that the log replication was successful.

The leader node gathers acknowledgement responses from the follower nodes, considers that the log entry has been securely replicated on most nodes when the number of acknowledgement responses reaches most (more than half), applies the log entry to the metadata state machine, and returns a response to the write request to the client. The follower node asynchronously applies the log entries that have been replicated to its own metadata state machine, maintaining metadata consistency with the leader node.

Through the synchronization mechanism, the real-time synchronization of the metadata information between the leader node and the follower node can be realized, and the consistency and reliability of the metadata service are ensured.

In order to improve the performance and throughput of the metadata server cluster, optimization measures may be taken such as batch replication optimization in which a leader node may merge multiple write requests into one batch log replication request, reducing the number of network communications and latency. And (3) optimizing pipeline replication, namely, the leader node can send log replication requests to a plurality of follower nodes in parallel, so that the efficiency of log replication is improved. Snapshot optimization, namely the leader node can periodically snapshot the metadata state machine and send the snapshot to the follower node, so that the data volume and recovery time of log replication are reduced.

In an alternative embodiment of the present invention,

Illustratively, when the primary metadata server receives a file request from a client, the request is encapsulated into a Log Entry (Log Entry). The log entry contains the following information:

Index identification (Index) is used to identify the location of a log entry in the log, typically an increasing integer. The expiration number (Term) indicates the expiration number in which the current primary metadata server is located and is used to distinguish log entries between different primary metadata servers. The requested Data (Data) contains specific contents requested by the client, such as file path, operation type (creation, writing, reading, etc.), data contents, etc.

The master metadata server adds the generated journal entries to its own journal (Log), which is an ordered, persistent data structure that records the state changes of the metadata server.

The master metadata server synchronizes the newly added log entries to all the slave metadata servers to ensure consistency of metadata. The method comprises the following specific steps:

the master metadata server sends an add log request to each slave metadata server in parallel (APPENDENTRIES). The added log request contains new log entries, and the current tenure number and log index information of the master metadata server.

After each slave metadata server receives the request of adding the log of the master metadata server, the following processing is performed:

It is checked whether the tenn (Term) in the request is smaller than its current tenn. If the period number of the request is small, indicating that the period of the master metadata server has expired, the request is denied from the metadata server. It is checked whether an Index identification (Index) of the log entry in the request is already present in the log from the metadata server. If there are log entries of the same index already, but with different arbitrary numbers, this indicates that the log is in conflict and the request is denied from the metadata server.

If the above checks are passed, the metadata server receives a request to add a log, and adds a log entry to its own log, thereby keeping the log of the metadata server consistent. A response add log message (AppendEntriesResponse) is returned from the metadata server to the master metadata server indicating that the log addition was successful.

The master metadata server submits log entries:

The master metadata server collects response addition log messages from the slave metadata servers and counts the number of slave metadata servers that successfully add logs. If the number of secondary metadata servers that successfully add the log reaches a predetermined number (typically more than half), the primary metadata server considers that the log entry has been securely replicated to most nodes. The primary metadata server marks the journal entry as Committed (Committed), indicating that the entry has been persisted and validated. The master metadata server applies the committed journal entries to its own metadata state machine (METADATA STATE MACHINE) to update the state of the metadata. The master metadata server sends a notification commit message (CommitNotification) to the slave metadata server, notifying the slave metadata server to mark all log entries prior to the specified index as committed and applied to the metadata state machine of the slave metadata server.

After receiving the notification submission message of the master metadata server from the metadata server, the following operations are performed:

all log entries in the notification message prior to the specified index are marked as committed. The submitted journal entries are sequentially applied to the metadata state machines of the slave metadata servers in keeping with the metadata state of the master metadata server. After the slave metadata server completes the log application, a confirm application message (Apply Confirmation) is sent to the master metadata server indicating that the log entry has been successfully applied.

Processing of primary metadata server failure:

in the event of failure of the master metadata server, it is necessary to stop sending heartbeat messages to the slave metadata server, triggering a new round of leader election. The specific treatment is as follows:

the master metadata server stops sending heartbeat messages to the slave metadata server indicating that the master metadata server has failed. The slave metadata server does not receive the heartbeat message of the master metadata server within a certain time, considers that the master metadata server has failed, and starts a new round of leader election.

The election process refers to the leader election scheme before and elects a new master metadata server from among the metadata servers through Raft protocol. The newly selected primary metadata server begins receiving client requests and restarting the log replication and synchronization process.

Through the above detailed technical scheme, log copying and synchronization between the master metadata server and the slave metadata server are realized, and consistency and reliability of metadata are ensured. Meanwhile, by adopting a leader election mechanism of Raft protocol, the problem of failover when the main metadata server fails is solved, and the high availability of metadata service is improved.

S102, dividing a file into objects with preset sizes, and respectively storing a plurality of objects on a plurality of storage nodes in a file system based on a consistent hash algorithm, wherein an object storage service is respectively operated on each storage node and is used for the persistent storage and access of the objects;

In an alternative embodiment of the present invention,

The method further comprises the steps of:

By way of example only, and in an illustrative,

Defining a hash function: an appropriate hash function, such as SHA-256 or MD5, is selected for mapping the object name of the object to a hash value of a fixed size. The hash function should have the characteristics of uniform distribution and low collisions. A hash value space of a fixed size is defined, typically using an integer range representation, such as 0 to 2-32-1. The hash value space represents a set of all possible hash values.

Storage nodes in the file system are mapped to different locations on the hash value space. The location of the storage node on the hash value space may be calculated by a hash function using as input a unique identifier of the storage node, such as an IP address or node ID. The hash value space is divided into a plurality of subintervals, and each subinterval corresponds to one storage node. Simple interval partitioning methods, such as average partitioning or node weight partitioning, may be used. Each subinterval represents a range of hash values that the storage node is responsible for storing.

When an object needs to be stored, the following steps are performed:

And applying a hash function according to the object name of the object, and calculating the hash value of the object. And determining the subinterval to which the object belongs according to the hash value of the object, namely determining the storage node responsible for storing the object. The object is stored on the corresponding storage node.

Through the steps, object storage based on a consistent hash algorithm is realized, and the objects are stored on a plurality of storage nodes in a distributed mode.

The load index data to be collected is determined, which may include the following indexes:

Storing the number of read-write requests processed by the object storage service on the node in unit time; response time of the object storage service on the storage node; the CPU utilization rate occupied by the object storage service process on the storage node; objects on the storage nodes store the proportion of memory resources occupied by the service process.

The time interval for collecting load index data is determined, such as once every 5 minutes. The time interval should be adjusted according to the actual requirements and performance requirements of the system. And deploying a load acquisition component on each storage node, and periodically acquiring load index data of the storage nodes. The acquisition component can be implemented using a system monitoring tool or custom script. And sending the collected load index data to a main metadata server. The data may be transmitted to the primary metadata server using a message queue, RPC, or other communication mechanism.

The master metadata server receives load index data from each storage node and performs a summary process. And the main metadata server generates a global load view according to the received load index data. The global load view reflects the load status of each storage node in the entire file system.

The master metadata server analyzes the global load view and judges the load state of each storage node. A load threshold may be set and a storage node is considered to be in a high load state when its load indicator exceeds the threshold. The main metadata server selects an object to be migrated according to the load state of the storage node. Some objects on high load nodes are typically selected for migration to achieve load balancing. And the main metadata server determines a target storage node of the migration object according to the consistent hash algorithm. The node with lower load can be selected as a migration target to realize load balancing.

And the main metadata server sends migration instructions to the source storage node and the target storage node, and coordinates the migration process of the object. The source storage node transmits the object data to the target storage node and updates metadata information of the object. After the migration of the object is completed, the main metadata server updates metadata information of the object and records a new storage position of the object. And simultaneously, informing the related storage nodes and clients to enable the storage nodes and clients to perceive the migration of the object.

Through the above detailed technical scheme, object storage based on the consistent hash algorithm is realized, and a load balancing mechanism is introduced. By collecting load index data of the storage nodes, generating a global load view, judging the load state of the nodes, selecting objects to be migrated, and executing object migration, load balancing among the storage nodes is realized, and the performance and expandability of the file system are improved.

In an alternative embodiment of the present invention,

Illustratively, in order to further improve the load balancing of the system, a dynamic load balancing policy may be introduced. Specifically, the system monitors the load condition of each OSD (object storage service, object Storage Device, OSD) in real time, including I/O throughput, response time, CPU and memory usage, etc. When the load of a certain OSD exceeds a preset threshold, the system triggers a data redistribution operation to shift part of data from a high-load OSD to a low-load OSD so as to realize load balancing.

In order to further improve the load balance of the distributed storage system, the application provides a dynamic data redistribution strategy based on real-time load monitoring. The strategy dynamically adjusts the distribution of data among the OSD by monitoring the load state of each OSD in real time, and transfers the load from a high load node to a low load node, thereby realizing the balance of the whole load of the system. The key techniques and implementation steps of this strategy will be described in detail below.

Firstly, in order to comprehensively evaluate the load state of the OSD node, an appropriate load monitoring index needs to be selected. The application selects the following key indexes:

I/O throughput, which represents the number of read/write requests processed by OSD in unit time, reflects the I/O load of the OSD. Specifically including two indicators of IOPS (I/O operands per second) and throughput (MB/s).

Response time, representing the average time taken for the OSD to process an I/O request, reflects the quality of service of the OSD. Specifically, the method comprises two indexes of average response time and tail response time (such as 99-minute response time).

CPU utilization rate, which represents the proportion of CPU resources occupied by OSD process and reflects the calculation load of OSD.

Memory utilization rate, which represents the memory resource proportion occupied by OSD process and reflects the memory load of OSD. The indexes describe the load condition of the OSD nodes from different angles, and comprehensively consider factors such as I/O, calculation, memory and the like.

In order to collect these load metrics, the system deploys a lightweight monitoring module on each OSD node. The module periodically collects various index data at regular time intervals (e.g., 1 second) and sends the data to a central load manager. The load manager gathers the load data of all OSD nodes to form a global load view, and provides basis for subsequent data redistribution decisions. Meanwhile, the load manager can continuously update the load data, so that the real-time performance of the data is ensured.

After the load data of each OSD node is obtained, the system needs to determine which nodes are in a high load state, and needs to redistribute the data. To this end, the present application introduces the concept of a load threshold.

For each load monitoring indicator, the system sets two thresholds, an alert threshold and a critical threshold. When the load index of a certain OSD node exceeds the warning threshold, the system considers the node to be in a potential high-load state, and when the load index further exceeds the critical threshold, the system confirms that the node is a high-load node and needs to trigger the data redistribution operation.

The threshold value needs to be set by comprehensively considering the factors such as the performance requirement of the system, hardware configuration and the like. In general, the guard threshold may be set to 70% -80% of the maximum load capacity of the node, and the critical threshold may be set to 90% -95% of the maximum load capacity. Thus, the high-load node can be found in time, and frequent data migration operation can be avoided. In addition, the threshold value can be dynamically adjusted according to the actual running condition of the system. For example, if the load of a certain OSD node continues to be above the alert threshold but below the critical threshold, and the state continues for a longer period of time (e.g., 30 minutes), the system may appropriately lower the threshold of that node to trigger the data redistribution earlier. Conversely, if the load of an OSD node frequently exceeds a critical threshold, but the load quickly returns to normal after data redistribution, the system may raise the threshold of that node appropriately to reduce unnecessary data migration.

And when the load manager detects that the load of a certain OSD node exceeds a critical threshold value, triggering a data redistribution flow. The specific operation steps are as follows:

Target node selection the load manager first needs to select the appropriate target node from the global load view for receiving the data of the high load node. The candidate target node should meet the condition that the load level is below the alert threshold; the network distance between the high-load node and the high-load node is as close as possible, so that the cost of data migration is reduced. The load manager may combine these factors to select one or more optimal target nodes.

Migration data selection-after determining the target node, the load manager needs to decide which data to migrate from the high load node. Ideally, higher load hotspot data should be preferentially migrated. For this purpose, the system analyzes the data access pattern of the high load node, and identifies the data object or data area with higher access frequency. At the same time, to avoid data inconsistencies during migration, the system may prefer those independent data blocks that are less associated with other data.

And generating a migration task list by the load manager according to the distribution of the migration data and the position of the target node. Each migration task contains information such as a source OSD node, a target OSD node, a starting position and a size of migration data, and the like. These tasks are issued to the corresponding OSD nodes, which are responsible for specific data transmission and synchronization.

And data transmission and synchronization, namely, the OSD nodes participating in data migration execute actual data transmission operation according to the migration task issued by the load manager. Data transmission is accomplished through network connection between OSD nodes, either streaming or object-based. In order to ensure data consistency, during the transmission process, the source OSD node may suspend the modification operation on the migration data until the target OSD node completes data synchronization. At the same time, the metadata server updates the location information of the data object, ensuring that subsequent I/O requests can be properly routed to the new location.

And (3) confirming migration completion, namely after the target OSD node completes data synchronization, sending a confirmation message to the source OSD node. After the source OSD node receives the confirmation, the local migration data can be safely deleted, and the storage space is released. Meanwhile, the source OSD node and the target OSD node report the completion condition of the migration task to the load manager, and the load manager updates the global load information and the data distribution state according to the completion condition.

And S103, responding to the file request sent by the client side received by the main metadata server, and realizing the file request on the corresponding storage node based on the proxy service between the client side and the storage node, wherein the file request comprises any one or more of a file creation request, a file writing request and a file reading request, and the main metadata server is switched to the slave metadata server to receive the file request sent by the client side under the condition that the main metadata server fails.

In an alternative embodiment of the present invention,

Illustratively, in order to improve the data access efficiency and user experience of the distributed storage system, a layer of Proxy Service (OSD) module design is introduced between the client and the Object Storage Device (OSD). The proxy service is used as a middle layer between the client and the OSD, and the access delay of the client is effectively reduced and the overall performance of the system is improved through key technologies such as request conversion, request routing, data caching and prefetching. The design and implementation steps of the proxy service module will be described in detail below.

The primary task of the proxy service is to convert a client's file request into an object request. In conventional file systems, clients access data through file paths and offsets. Whereas in an object storage system, data is organized as individual objects, each object having a unique Object ID (OID). In order to achieve unification of the two access modes, the proxy service needs to maintain a mapping relation of files to objects.

Specifically, after receiving a file request from a client, the proxy service first queries metadata information of a file according to a file path. The metadata records the attribute of the file and the object list and OID contained in the file. The proxy service extracts an object list corresponding to the request file from the metadata, and then calculates an object range related to the request according to the offset and the length of the request.

For example, the client initiates a request to read file/foo/bar.txt, offset of 1000 and length of 2000. The proxy service queries the metadata and finds that the file contains 3 objects obj1 (0-1023), obj2 (1024-2047) and obj3 (2048-3071). Based on the offset and length, the proxy service determines that the request involves two objects obj2 and obj3, with obj2 offset of 1000-1024=976, obj3 offset of 0, and lengths 1024 and 976, respectively.

Through the calculation, the proxy service converts the original file request into two object requests, namely 976-2047 bytes of obj2 are read, and 0-975 bytes of obj3 are read. In this way, the file request is mapped to an object request in the object storage system, and the data object on the OSD is directly accessible.

After converting the request, the proxy service needs to route the request to the appropriate OSD node for execution. In large-scale distributed storage systems, data objects are typically distributed across different OSD nodes in multiple copies to provide high availability and load balancing. Therefore, the proxy service needs to select an optimal target OSD to process the request according to the data distribution policy and OSD state.

One common data distribution strategy is a data placement algorithm based on consistent hashing (Consistent Hashing). The algorithm maps the object ID and the OSD nodes into the same hash space, and each object is distributed to the nearest OSD node in the hash space according to the ID hash value. The method can realize the balanced distribution of the objects among the OSD nodes and minimize the cost of data migration when the OSD nodes are increased or decreased. The proxy service may employ a consistent hashing algorithm to determine the target OSD node for each object request.

In addition to the data distribution policy, the proxy service needs to consider the real-time status and load condition of OSD nodes when requesting routes. The proxy service may periodically obtain its status information from the OSD nodes, including storage capacity, I/O load, response delay, etc. When the request is routed, the proxy service preferentially selects the OSD nodes with lower load and faster response so as to achieve the aim of load balancing. Meanwhile, the proxy service can also consider the network topology structure to send the request to the OSD node closest to the client as much as possible, thereby reducing the network transmission overhead.

By comprehensively considering factors such as data distribution, OSD state, network topology and the like, the proxy service can reasonably route the request to the optimal OSD node, thereby improving the efficiency and performance of data access.

To further reduce access latency for clients, proxy services may implement a data caching mechanism internally. Since the proxy service is located between the client and the OSD, it can act as an intermediate caching layer, storing recently accessed hotspot data. When the data requested by the client hits in the cache of the proxy service, the data can be directly returned from the cache, thereby avoiding remote communication with the OSD and greatly reducing the response time of the request.

The proxy service cache may adopt a multi-level cache architecture, including two levels of memory cache and disk cache. The memory has smaller cache capacity but high speed and is used for storing the most frequently accessed data. The disk has larger cache capacity but slow speed and is used for storing secondary hot spot data. When the requested data is not hit in the memory cache, the proxy service searches in the disk cache, and if the requested data is not hit in the disk cache, the proxy service acquires the data from the OSD node. The acquired data is cached in the proxy service for use by subsequent requests.

The management policy of the cache has an important impact on the performance of the proxy service. The proxy service may employ various cache replacement algorithms, such as LRU (LEAST RECENTLY Used), LFU (Least Frequently Used), etc., to dynamically adjust the cache content according to the frequency and time of access of the data. When the cache space is insufficient, the replacement algorithm can select to eliminate the data which is not used for the longest time or has the lowest access frequency, so that the effectiveness of the cache is ensured. In addition, the proxy service can realize an expiration mechanism of the cache, and periodically clean out expired cache data, so that the problem of inconsistent data is avoided.

In an alternative embodiment of the present invention,

The method further comprises the steps of:

Illustratively, the proxy service periodically checks its cached free space to determine if pre-read data is needed. And comparing the cached free space with a preset threshold value. The preset threshold value represents the size of the free space triggering the pre-reading operation, and can be set according to the actual situation of the system.

The proxy service records the history access record of the client, including the information of the file accessed by the client, the access time, the access sequence and the like. The mode of the client accessing the data is determined by analyzing the historical access record of the client. Common access modes include sequential access modes and random access modes.

If the access mode of the client is a sequential access mode, that is, the client accesses the files sequentially according to the sequence of the files, the proxy service reads the subsequent files of the files into the cache sequentially when the client accesses the files. This can improve the access efficiency of the subsequent files.

If the access pattern of the client is a random access pattern, i.e. the client randomly accesses different files, the proxy service pre-reads based on the frequency with which the client accesses the files. And reading the files with the access frequency higher than the threshold value in the preset duration into a cache so as to improve the access efficiency of the files.

When the cache of the proxy service reads a file, the timestamp of the file is recorded. The timestamp represents the time at which the file was read into the cache. If the file is already present in the cache, its timestamp is updated to the latest read time.

When the proxy service receives the object request, the corresponding object to be accessed is searched in the cache first. If the corresponding object to be accessed exists in the cache, the timestamp of the object in the cache is compared with the timestamp of the corresponding object in the storage node. If the timestamp of the object in the cache is less than the timestamp of the corresponding object in the storage node, it indicates that the object in the cache has expired, and the latest object data needs to be read again from the storage node.

If the object in the cache has expired, the proxy service sends an object request to the storage node requesting to read the latest object data. After receiving the object request, the storage node reads the requested object data from the local storage and returns the request to the proxy service. After receiving the object data returned by the storage node, the proxy service updates the object data into the cache, and updates the timestamp of the corresponding object.

Through the above detailed technical scheme, the intelligent caching mechanism of the proxy service is realized. When the cache space reaches a preset threshold, the proxy service determines the mode of accessing data by the client by analyzing the historical access record of the client, and pre-reads files which are likely to be accessed frequently in advance into the cache according to the access mode. Meanwhile, the proxy service records the time stamp of the file when the file is read by the cache, and compares the time stamp of the object in the cache with the time stamp of the object in the storage node when the object request is processed, so that the object in the cache is always the latest data. If the object in the cache has expired, the latest object data is read again from the storage node and the cache is updated. The intelligent caching mechanism can improve the efficiency of data access, reduce the access times to the storage nodes, and ensure the consistency and the real-time performance of the data.

Fig. 2 is a schematic structural diagram of a cluster file distributed management system according to an embodiment of the present invention, as shown in fig. 2, where the system includes:

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

a memory for storing processor-executable instructions;

In a fourth aspect of an embodiment of the present invention,

The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for distributed management of cluster files, comprising:

In response to the master metadata server receiving a file request sent by a client, implementing the file request on a corresponding storage node based on a proxy service between the client and the storage node, wherein the file request comprises any one or more of a file creation request, a file writing request and a file reading request, and the master metadata server is switched to the slave metadata server to receive the file request sent by the client under the condition that the master metadata server fails;

Deploying a metadata server cluster, comprising:

Taking the candidate node with the most votes as a new leader node and taking other candidate nodes as new follower nodes, wherein the leader node is a master metadata server in a metadata server cluster, and the follower nodes are slave metadata servers in the metadata server cluster;

2. The method of claim 1, wherein storing the plurality of objects on the plurality of storage nodes in the file system, respectively, based on the consistent hashing algorithm comprises:

The method further comprises the steps of:

3. The method of claim 2, wherein the metadata server generating a global load view based on the load index data on each storage node and determining a load status of each storage node, and selecting to migrate the object on the storage node based on the load status of each storage node, comprises:

4. The method of claim 1, wherein in response to the primary metadata server receiving a file request sent by a client, implementing the file request on a corresponding storage node based on a proxy service between the client and the storage node, comprising:

5. The method according to claim 4, wherein the method further comprises:

6. A cluster file distributed management system for implementing the method of any of the preceding claims 1-5, comprising:

7. An electronic device, comprising:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 5.

8. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 5.