[go: up one dir, main page]

CN109586989B - State checking method, device and cluster system - Google Patents

State checking method, device and cluster system Download PDF

Info

Publication number
CN109586989B
CN109586989B CN201710901666.8A CN201710901666A CN109586989B CN 109586989 B CN109586989 B CN 109586989B CN 201710901666 A CN201710901666 A CN 201710901666A CN 109586989 B CN109586989 B CN 109586989B
Authority
CN
China
Prior art keywords
state
node
status
thread
checking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710901666.8A
Other languages
Chinese (zh)
Other versions
CN109586989A (en
Inventor
鲁振华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710901666.8A priority Critical patent/CN109586989B/en
Publication of CN109586989A publication Critical patent/CN109586989A/en
Application granted granted Critical
Publication of CN109586989B publication Critical patent/CN109586989B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a state checking method, a state checking device and a cluster system; wherein the state checking method comprises: after the main thread is started, a state checking thread is established; and after the node executing the main thread receives the message for carrying out the state detection, the state detection thread feeds back the message for carrying out the state detection. At least one embodiment of the application can avoid the influence of the service condition on the availability detection result.

Description

State checking method, device and cluster system
Technical Field
The present invention relates to the field of computers, and in particular, to a method, an apparatus, and a cluster system for status check.
Background
The availability of a computer system is measured by the reliability and maintainability of the system. The mean time between failures is a measure of the reliability of the system, and the mean time between repairs is a measure of the maintainability of the system.
High availability of services is currently usually achieved by building a High Availability (HA) cluster. An HA may refer to a program, or a service, or the ability of a system to perform its functions without interruption.
A highly available cluster (also referred to as an HA cluster) includes a plurality of service nodes, a portion of which are active (may be referred to as primary nodes) and a portion of which are standby (may be referred to as standby nodes) for traffic. When the main node fails, the system activates the standby node to automatically take over the main node to provide service, and the original main node is upgraded into the standby node or a standby node is set up again.
Among them, the availability detection for the master node is one of the key steps. The detection of the availability of the master node may also be referred to as a health check, i.e., determining the health status of the master node, such as determining whether a program, or a service, or a system of the master node is currently in a failure state or in a normal operating state. Misjudgment or failure to judge the health state of the master node in time greatly reduces the availability of service.
Redis is an open-source high-performance Key-Value cache database system and comprises a plurality of service nodes; when receiving heartbeat packet, the service node in the Redis cluster replies the specified information, for example, when receiving a ping command, replies 'pong', and when receiving an info command, replies the configuration parameters and the statistical data of the server, such as the version of the server, the operating system, the starting time length and the like.
By utilizing the above characteristics of the service node, the operation and maintenance system in the Redis cluster can perform Redis availability detection, which generally includes the following technical solutions:
detecting the heartbeat: the health state of the service node is detected by periodically sending heartbeat packets to the service node. If none of the heartbeat packets receive a reply, the service node is considered to have failed.
And (3) Redis sentinels (Sentinel) sending a heartbeat packet every second to acquire the health state of the current service node, and sending an info command every N seconds (N is a positive integer) to acquire the node configuration state of Redis for subsequent fault processing.
According to the technical scheme, the health state of the current Redis service node is judged according to information returned by the Redis service node by sending a ping command (or other heartbeat packets) or an info command to the Redis service node. The technical scheme has the following defects:
the availability detection may be affected by a current service condition of the Redis service, for example, when the Redis service is busy (for example, when a task with a long duration is executed), a service node of the Redis cannot respond to a detected command or a heartbeat message in time, and is easily determined as a fault by mistake, or cannot determine the fault in time.
Disclosure of Invention
The application provides a state checking method, a state checking device and a cluster system, which can avoid the influence of service conditions on an availability detection result.
The technical scheme is as follows.
A status checking method, comprising:
after the main thread is started, a state checking thread is established;
and after the node executing the main thread receives the message for carrying out the state detection, the state inspection thread feeds back the message for carrying out the state detection.
The node executing the main thread may be a main node of the cluster.
Wherein, after creating the state check thread, the method may further include:
the state check thread monitors a state check port of a node executing the main thread; wherein the status check port is configured to receive the message for status detection.
Wherein the feeding back the message for performing the status detection by the status checking thread may include:
the state inspection thread acquires state information of the node;
and feeding back the message for state detection by using the acquired state information.
Wherein, after creating the state check thread, the method may further include:
and the state inspection thread periodically inspects the state of the node and generates the state information of the node according to the inspection result.
Wherein the periodically checking the state of the node by the state checking thread may include:
the status check thread performs one or more of the following operations:
performing disk reading and writing on the node every a first time length;
checking whether the authority, the size and the integrity of the directory file of the node are normal or not every second time length;
checking whether a predetermined secondary process exists in the node every third length of time.
Wherein, the event created by the status checking thread may include: port events and timer events;
the port event can be set to obtain and feed back the state information of the node after the node receives the message for state detection;
the timer may be configured to periodically check the status of the node, and generate status information of the node according to a check result.
Wherein the state information may be an identifier indicating a state of the node.
A status checking device comprising: a processor and a memory;
the memory is used for storing programs for providing services; the program for providing a service, when read and executed by the processor, performs the following operations:
starting a main thread;
when the main thread starting is executed by the processor, the following operations are carried out:
creating a state check thread;
the state check thread, when executed by the processor, performs the following:
and after receiving the message for carrying out the state detection, the node where the processor is located feeds back the message for carrying out the state detection.
A status checking device comprising:
the main service module is used for establishing a state checking module after being started;
the state checking module is used for feeding back the message for state detection after the node where the state checking device is located receives the message for state detection.
A status checking method, comprising:
after a main node in the cluster is started, executing a state checking process;
and after the main node receives the message for carrying out the state detection, the state checking process feeds back the message for carrying out the state detection.
A cluster system, comprising: one or more nodes; wherein at least one master node exists in the one or more nodes;
a state checking device;
after the master node is started, starting the state checking device;
the state checking device is used for feeding back the message for state detection after the master node receives the message for state detection.
In at least one embodiment of the present application, a dedicated state check thread is set in a node as a detection object to process state detection, so that the state detection is not interfered by the current service status, the state detection can be responded in time, and the occurrence of delay and false alarm conditions is avoided.
In an implementation manner of the embodiment of the present application, a dedicated port is used to detect the listening status, so that interference of a service status can be further avoided.
In an implementation manner of the embodiment of the application, when the state detection is fed back, the state information is fed back, which is more helpful for knowing the real state of the node.
In an implementation manner of the embodiment of the application, the dedicated state check thread is used for periodically generating the state information, so that normal processing of the service is not affected, and when the message for performing state detection is received, the state information of the generated node can be directly returned, thereby avoiding delay caused by temporarily collecting the state information; when the state information is generated, various states of the nodes can be comprehensively acquired according to a plurality of examination results.
In an implementation manner of the embodiment of the application, different states of the node are represented by different identifiers in the state information, so that the overhead of transmitting the state information can be reduced, and the state of the node can be refined into various different situations.
Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.
Drawings
FIG. 1 is a flow chart of a status checking method according to the first embodiment;
FIG. 2 is a diagram illustrating a main thread and a health check thread in an example of one embodiment;
FIG. 3 is a diagram illustrating operation of a health check thread in an example of one embodiment;
FIG. 4 is a schematic view of a state check apparatus according to a third embodiment;
fig. 5 is a schematic diagram of a cluster system according to a fifth embodiment.
Detailed Description
The technical solutions of the present application will be described in more detail below with reference to the accompanying drawings and embodiments.
It should be noted that, if not conflicting, different features in the embodiments and implementations of the present application may be combined with each other and are within the scope of protection of the present application. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
In one configuration, a computing device performing status checking may include one or more processors (CPUs), input/output interfaces, network interfaces, and memory (memory).
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. The memory may include one or more modules.
Computer-readable media include both non-transitory and non-transitory, removable and non-removable storage media that can implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.
In an embodiment, as shown in fig. 1, a status checking method includes steps S110 to S120:
s110, after the main thread is started, a state checking thread is established;
and S120, after the node executing the main thread receives the message for state detection, the state inspection thread feeds back the message for state detection.
In this embodiment, the node will include a main thread for processing a service request and a status check thread for responding to status detection; the state checking thread is independent of the main thread for business processing, so that the state checking thread is not influenced by the business condition.
In the conventional scheme, a main thread gives consideration to service processing and availability detection, service processing and availability detection requests are arranged in the same queue, and the main thread waits for processing; therefore, when the service is busy, if a plurality of service requests are arranged before the availability detection request, or if one service request is processed for a long time, the availability detection request cannot be processed in time, which may cause problems of misjudgment or untimely fault discovery. In one solution, detection is carried out by increasing detection intervals, synthesizing detection results for multiple times, detecting multiple nodes simultaneously and the like so as to avoid misjudgment, but the problem that fault discovery is not timely is easily caused.
In the embodiment, because the status check thread is used separately to respond to the status detection, the availability detection request or other messages for status detection do not need to be queued together with the service request; therefore, even if the business is busy, the processing efficiency of the main thread is only influenced, and the response speed of state detection is not influenced, so that the problems of misjudgment or untimely fault finding and the like can be avoided.
In this embodiment, the message for performing the status detection may be, but is not limited to, from an operation and maintenance system, such as from a client of the operation and maintenance system.
The message for status detection may be a predetermined message, a predetermined type of message, or a message containing a predetermined keyword.
The message for performing the status detection may include, but is not limited to, an availability detection request, such as a heartbeat message, an info command, and the like for performing the availability detection, and may further include other request messages or command messages configured to obtain the status of the node.
In this embodiment, the main thread may create a state check thread after initialization; in addition, the main thread may also recreate the stateful inspection thread when the stateful inspection thread terminates abnormally.
Generally, when a program is started, a process is created by the operating system, and at the same time, a thread is also run immediately, which is generally called the main thread of the program; for example, on a service node of a Redis cluster, when the service node is started or activated, a program for providing a Redis service is run, so that a main thread of the program is started; the code of the main thread may include code that creates a stateful inspection thread, which when executed will create the stateful inspection thread.
In this embodiment, if the program to which the main thread belongs crashes during the running, the state check thread is also caused to stop correspondingly; at this time, the message for performing the state detection may not receive a response, and after the response is not obtained for one or more times, the party performing the state detection may determine that the node is unavailable.
In this embodiment, the node may be a master node in a cluster; for standby nodes in the cluster, a main thread may be started when activated and the above-described stateful inspection thread may be created.
In this embodiment, the node may be a hardware device or a virtual device.
The method of the present embodiment may be, but is not limited to, applied to a master node of a Redis service.
In one implementation, the node executing the main thread is a master node of a cluster.
In conventional solutions, the master node of the cluster does not specifically handle the availability probe request or other messages for status probing, but rather handles it as if it were a service request.
In this implementation, a processing thread is specifically set up for the message for performing the status detection in the master node of the cluster, so as to respond to the status detection.
In an embodiment of this implementation, the cluster may be, but is not limited to, a Redis cluster.
In an implementation of this implementation, when feeding back a message for performing status detection, the status information of the node may be obtained for feedback.
In a conventional scheme, service nodes of a cluster do not provide state information, and only perform conventional response to an availability probe request, for example, feeding back pong to ping and feeding back corresponding data to info command, which are not information for describing the state of the nodes; accordingly, the operation and maintenance system only guesses whether the node works normally by using whether the normal response is normal or not, and cannot really know the state of the node.
In the embodiment, the state information of the cluster master node is added, and the state information is adopted to feed back the message for state detection, so that the real situation of the node is facilitated to be known.
In this embodiment, the state information may be generated by the state check thread itself, or may be generated by another thread or process and provided to the state check thread.
In one implementation, the state check thread may learn that the node receives a message for performing state detection through the listening state check port.
That is, after creating the state check thread, the method may further include:
the state check thread monitors a state check port of a node executing the main thread; wherein the status check port is configured to receive the message for status detection.
In this implementation, in the cluster where the node executing the main thread is located, the part (which may be software, hardware, or a combination of software and hardware) for transmitting the request may be distributed, the service request is sent to the main port, and the message for performing the state detection is sent to the state check port; wherein the master thread snoops the master port. This process may be transparent to the party initiating the status probe, i.e.: the party initiating the status probe may send the status probe in a conventional manner without distinguishing between the service request and the message used to perform the status probe.
In this implementation, a dedicated port is set to process the message for performing the status detection, so that the interaction between the service and the status detection can be further avoided.
In other implementations, both the service request and the message for performing the status detection may be sent to one port; in this case, the main thread and the status check thread may listen to the same port; when the state detection thread detects the message for state detection, the state information is acquired and fed back, and when other requests are detected, the state information is discarded; the main thread discards the message for status detection when it listens.
In one implementation, the feedback of the message for performing the status detection by the status checking thread may include:
the state check thread acquires the state information of the node;
and feeding back the message for state detection by using the acquired state information.
In this implementation, the state check thread acquiring the state information may refer to reading the state information from a predetermined location, or may refer to the state check thread itself, or triggering another thread or process to check the node state and generate the state information.
In the implementation mode, the state information can be generated periodically or after being triggered by a preset trigger condition; the generated state information can be generated before the message for state detection is received; if not, it may be generated temporarily upon receipt of a message for status detection. When the status information is generated in advance, there is no pressure for the response time limit, and therefore, a more comprehensive status check can be performed.
In this implementation, there may be one or more generated state information; for example, after new state information is generated, existing state information can be directly overwritten, so that only one and latest state information is generated; for example, after new state information is generated, the state information can be retained together with the existing state information, so that a plurality of pieces of state information can be obtained, and the state information with the generation time closest to the current time (namely, the latest state information) can be fed back, so that the problem of untimely fault discovery can be avoided as much as possible.
In this implementation, one or more status information may be fed back; it can also decide to feed back one or more state information according to the situation, for example, when receiving the first type of message for state detection, feed back the latest state information; and feeding back a plurality of state information when receiving the second type of information for state detection.
In other implementations, for the message for performing the status detection, the status information may not be fed back, and only the regular response is fed back.
In one implementation, after the creating the status checking thread, the method may further include:
and the state checking thread periodically checks the state of the node and generates the state information of the node according to the checking result.
In the implementation mode, the state information of the node is generated by the state check thread at regular time, and the main thread or other threads or processes do not need to participate, so that the influence of the processing of the detection event on the service processing can be avoided.
In this implementation, the generated state information may be stored in a predetermined location in the node or other device for reading when receiving a message for performing state detection; only the newly generated state information may be retained during the storage, or a predetermined number of pieces of state information or a batch of state information within a predetermined time period may be retained.
In this implementation, the periodic state check and the feedback of the message for state detection may be two parallel, staggered, and non-interfering processes; maintaining periodic status checks regardless of whether a message for status detection is received; after receiving the message for status detection, the generated status information can be fed back without interaction with the status checking process.
In other implementations, the state information of the node may also be periodically generated by the main thread, or other process, or other thread.
In other implementation manners, the state detection of the node may be triggered after the message for performing the state detection is received, and the state information may be generated.
In this implementation, the periodically checking the state of the node may include:
the state check thread performs one or more of the following operations:
performing disk reading and writing on the node every a first time length;
checking whether the authority, the size and the integrity of the directory file of the node are normal or not every second time length;
checking whether a predetermined secondary process exists in the node every third length of time.
In a conventional scheme, when a main thread processes a message for performing state detection, in order to respond in time, generally only the simplest information is fed back, for example, in a Redis cluster, a service node only feeds back "Pong" after receiving a Ping command to indicate that the service node is available; in the present implementation, the state information may include a disk state, a file state, an auxiliary process state, and the like of the service node, so that more comprehensive state information can be provided.
The first, second and third time lengths may be all or partially the same or different.
The disk reading and writing can be performed according to a preset rule, for example, several bytes of predetermined information, such as "FFFF", are written, several bytes of content are read from a random position of the hard disk, and the like; the main purpose of disk reading and writing is to determine whether the current read-write state of the disk is normal.
When the authority of the directory file is checked, the checking can be carried out according to the configuration file when the file authority in the node is configured by the user, and the condition that the file authority is inconsistent with the configuration file can be considered as abnormal.
When the size of the directory file is checked, whether the size of the directory file is in a reasonable range or a preset range can be judged, and if not, the size of the directory file is considered to be abnormal.
Wherein the auxiliary process may include one or more of: monitoring a data acquisition process; a daemon process, etc.
The periodically checking the state of the node may further include: periodically checking for other predetermined conditions, such as but not limited to including one or more of: whether a data volume mutation is present; judging whether the node is restarted or not according to the change of the process identifier; and judging whether the main thread enters a dead loop or not according to the count value of the main thread processing command.
In actual application, the means adopted in the process of checking the disk state and the file state can be designed by self. In addition, state items needing to be checked, such as states of other hardware except the disk, can be designed by self according to needs; that is, the status check thread can customize what status items are checked, what items are included in the status information, what parameters are used to check the status items, and the like.
Wherein, the generating of the state information of the node according to the checking result may include: state information is generated based on the results of the examination of the various items, including but not limited to one or more of disk state, file state, auxiliary process state, etc. The generated state information may directly include the inspection results of each item, or may be a total result of the inspection results of each item summarized according to a predetermined rule, for example, assuming that the predetermined rule is:
judging that the node is available if the process state is abnormal and other projects are normal; and if the disk state or the file state is abnormal, judging that the node is unavailable.
Status information indicating whether the node is available or unavailable may be obtained according to the result of the examination of the items and a predetermined rule.
The checking result of each item may not be limited to "normal" and "abnormal", where "abnormal" may further include more specific contents, for example, the checking result of the disk status may include three abnormal situations besides "both read and write are normal": the four types of abnormal reading and writing, abnormal reading and writing and abnormal writing are adopted. For example, the inspection result of the file status may include a case where one or more parameters are abnormal, in addition to "normal". For example, the checking result of the state of the auxiliary process may include the abnormal condition of one or more auxiliary processes besides "normal".
Accordingly, when the checking results are various, the predetermined rule and the node status can be correspondingly enriched, for example, the status of the node can be no longer limited to "available" and "unavailable", but can be set into various statuses according to different checking results of various items; as long as it is guaranteed that one state of the node can be uniquely determined according to the check result and a predetermined rule.
In this implementation, the events created by the state check thread may include a port event and a timer event;
the port event is set to obtain and feed back the state information of the node after the node receives the message for state detection;
the timer is set to periodically check the node state, and the state information of the node is generated according to the check result.
In one implementation, the state information may be an identifier that represents the state of the node.
In this implementation, the identifier may be a number, a letter, a character string, or the like; for example, "0" indicates that the node status is healthy or available, and "1" indicates that the node status is unhealthy or unavailable.
In this implementation, the node states may be two or more, and when there are only two, the node states may be "healthy" (or "available") and "unhealthy" (or "unavailable"), respectively.
In this implementation, when there are many inspection items, there may be more than two node states; in this case, more flags may be set to indicate various possible situations, such as "0" indicating that the node is healthy or available, "1" indicating that the node is available but the read-write status is abnormal, "2" indicating that the node is available but the directory status is abnormal, and "3" indicating that the node is available but the auxiliary process is abnormal; "4" means that the node is unavailable or unhealthy, etc.
In this implementation, the corresponding relationship between the identifier and the node state may be pre-stored in the state check thread or at other predetermined locations, and the node state may be converted into a corresponding identifier as state information according to the corresponding relationship when generating the state information.
In other implementations, the state information may also be text information or other forms of information for describing the state of the node.
The above embodiment is described below by way of an example. The present example is applied to a Redis cluster, where a message for performing state detection is an availability detection request, and a node executing a main thread is a service node in the Redis cluster, and is generally a main node.
As shown in fig. 2, in the present example, when the Redis service on the current service node is started (i.e. the main thread is started), the main thread, after initialization, creates a status check thread, referred to as a health check thread in this example, for Redis availability detection.
In this example, the health check thread listens to the designated health check port in response to a status detection event, namely: and feeding back the state information of the node aiming at the availability detection request sent by the client. The health check thread may create an event loop associated with the health check, register the health check event, and begin the event loop.
In this example, there are two possibilities for starting the Redis service on the current service node, one of which may be that the current service node is a master node of the Redis service, and the Redis service is started when the current service node is started or restarted; another possibility is that the current service node is a standby node of the Redis service, and the Redis service is started after being activated.
In this example, the main thread is further configured to monitor the main port, create an event loop, register a service-related event, and start the event loop; the main thread may not include the health check content in its event loop.
In this example, after receiving a client request, the Redis cluster sends an availability probe request to the health check port; if not, such as a service request, etc., to the primary port. As shown in FIG. 3, the health check events registered by the health check thread in this example may include two types of events, port events and timer events.
The port event is used for responding to the availability detection request of the client and returning the latest state information of the Redis current service node collected by the timer event;
the timer event is used for checking the overall health state of the current service node and generating state information according to the checking result.
In this example, the checking the overall health status of the current service node may specifically include:
and (3) magnetic disk inspection: performing disk reading and writing once every a preset length of time, and judging whether the disk reading and writing state of the current service node is normal or not;
file checking: checking whether the directory file authority, size, integrity and other states in the current service node are normal or not at intervals of a preset length;
and (4) process checking: checking whether a predetermined auxiliary process exists at intervals of a predetermined length;
other checks: the other specified status items are checked every predetermined length of time.
In this example, the state information generated by the timer event may be stored in a predetermined location in the current service node or in other devices, and only the latest state information may be stored during storage, that is, the state information is generated and then replaces the original state information; when storing, it is also possible to store a plurality of pieces of state information and record the generation time of each piece of state information.
In this example, after receiving an availability detection request from a client, a port event may read the latest status information from the predetermined location for feedback; when only the latest state information is stored, reading the state information for feedback; when multiple copies of state information are stored, the latest copy can be selected for feedback, and multiple copies of state information can also be fed back according to the requirement in the availability detection request.
In this example, the Redis service enables a special port and a thread for status detection (i.e. availability detection, or health check), which can avoid the interference of the status detection process with the current service status and can also avoid the influence of the status detection process on the normal processing of the service.
In this example, the Redis service enables a special timer event for status check, and can provide comprehensive status information including information such as the hardware status of the current service node.
In a second embodiment, a status checking apparatus includes: a processor and a memory;
the memory is used for storing a program for providing service; the program for providing a service, when read and executed by the processor, performs the following operations:
starting a main thread;
when the main thread starting is executed by the processor, the following operations are carried out:
creating a state check thread;
the state check thread, when executed by the processor, performs the following:
and after receiving the message for carrying out the state detection, the node where the processor is located feeds back the message for carrying out the state detection.
In one implementation, the node where the status checking device is located, or the node where the processor is located, may be a master node of the cluster.
In this implementation, the cluster may be, but is not limited to, a Redis cluster.
In one implementation, the state checking thread, when executed by the processor, may further perform the following: monitoring a state check port of a node where an execution processor is located; wherein the status check port is configured to receive the message for status detection.
In one implementation, feeding back the message for performing the status detection may include:
acquiring state information of a node;
and feeding back the message for state detection by using the acquired state information.
In one implementation, the state checking thread, when executed by the processor, may further perform the following:
and periodically checking the state of the node, and generating the state information of the node according to the checking result.
In this implementation, periodically checking the state of the node may include one or more of the following operations:
performing disk reading and writing on the node every a first time length;
checking whether the authority, the size and the integrity of the directory file of the node are normal or not every second time length;
checking whether a predetermined secondary process exists in the node every third length of time.
In this implementation, the event created by the state check thread may include: port events and timer events;
the port event is set to acquire and feed back the state information of the node after the node receives the message for state detection;
the timer is set to periodically check the state of the node, and the state information of the node is generated according to the check result.
In one implementation, the state information may be an identifier indicating a state of the node.
In this embodiment, the operations performed by the main thread and the status check thread when the main thread and the status check thread are read and executed by the processor correspond to steps S110 and S120 in the first embodiment, respectively; additional details of the operations performed by the program can be found in example one.
In a third embodiment, a status checking apparatus, as shown in fig. 4, includes:
a main service module 31 for creating a status check module 32 after startup;
the status checking module 32 is configured to, after receiving the message for performing status detection, feed back the message for performing status detection to the node where the status checking device is located.
In one implementation, the node where the status checking device is located may be a master node of a cluster.
In this implementation, the cluster may be, but is not limited to, a Redis cluster.
In one implementation, the status check module may be further configured to listen to a status check port of the executing node; wherein the status check port is configured to receive the message for status detection.
In one implementation, the feeding back the message for performing the status detection by the status checking module may include:
the state checking module acquires the state information of the node; and feeding back the message for state detection by using the acquired state information.
In one implementation, the state checking module may be further configured to periodically check the state of the node, and generate the state information of the node according to a check result.
In this implementation, the periodically checking the state of the node by the state checking module may include one or more of the following operations:
performing disk reading and writing on the node every a first time length;
checking whether the authority, the size and the integrity of the directory file of the node are normal or not every second time length;
checking whether a predetermined secondary process exists in the node every third length of time.
In one implementation, the state information may be an identifier indicating a state of the node.
In this embodiment, the operations performed by the main service module and the status check module correspond to steps S110 and S120 in the first embodiment, respectively, and other implementation details can be seen in the first embodiment.
In a fourth embodiment, a status checking method includes:
after a main node in the cluster is started, executing a state checking process;
and after the master node receives the message for carrying out the state detection, the state checking process feeds back the message for carrying out the state detection.
In this embodiment, the cluster may be, but is not limited to, a Redis cluster.
In this embodiment, the details of the operation performed by the status check process may refer to the details of the operation of the status check thread in the first embodiment.
Fifth embodiment, a cluster system is shown in fig. 5, and includes: one or more nodes 51; wherein at least one master node 511 is present in the one or more nodes;
a status check device 52;
after the master node 511 is started, the state checking device 52 is started;
the status checking device 52 is configured to, after receiving the message for performing status detection, feed back the message for performing status detection by the master node 511.
In this embodiment, the cluster may be, but is not limited to, a Redis cluster.
In this embodiment, the status checking device may be disposed in the master node.
In this embodiment, details of the operation performed by the master node may refer to details of the operation of the master thread in the first embodiment; the details of the operation performed by the status check device can refer to the details of the operation of the status check thread in the first embodiment.
It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by instructing the relevant hardware through a program, and the program may be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiments may be implemented in the form of hardware, and may also be implemented in the form of a software functional module. The present application is not limited to any specific form of hardware or software combination.
There are, of course, many other embodiments of the invention that can be devised without departing from the spirit and scope thereof, and it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the invention.

Claims (11)

1. A status checking method, comprising:
after the main thread is started, a state checking thread is established;
after a node executing the main thread receives a message for state detection, the state inspection thread feeds back the message for state detection;
and executing the main thread, wherein the node executing the main thread is a main node of the cluster.
2. The status checking method according to claim 1, wherein the creating of the status checking thread further comprises:
the state check thread monitors a state check port of a node executing the main thread; wherein the status check port is configured to receive the message for status detection.
3. The status checking method according to claim 1, wherein the status checking thread feeding back the message for status detection includes:
the state inspection thread acquires state information of the node;
and feeding back the message for state detection by using the acquired state information.
4. The status checking method according to claim 1, wherein the creating of the status checking thread further comprises:
and the state checking thread periodically checks the state of the node and generates the state information of the node according to the checking result.
5. The status checking method of claim 4, wherein the periodic checking of the status of the node by the status checking thread comprises:
the status check thread performs one or more of the following operations:
performing disk reading and writing on the node every a first time length;
checking whether the authority, the size and the integrity of the directory file of the node are normal or not every second time length;
checking whether a predetermined secondary process exists in the node every third length of time.
6. The status checking method of claim 4, wherein the events created by the status checking thread include: port events and timer events;
the port event is set to acquire and feed back the state information of the node after the node receives the message for state detection;
the timer is set to periodically check the state of the node, and the state information of the node is generated according to the check result.
7. A status checking method according to claim 3 or 4, characterized in that:
the state information is an identifier for indicating a state of the node.
8. A status checking device comprising: a processor and a memory;
the method is characterized in that:
the memory is used for storing a program for providing service; the program for providing a service, when read and executed by the processor, performs the following operations:
starting a main thread;
when the main thread starting is executed by the processor, the following operations are carried out:
creating a state check thread;
the state check thread, when executed by the processor, performs the following:
after receiving the message for carrying out state detection, the node where the processor is located feeds back the message for carrying out state detection;
and the node where the processor is located is a main node of the cluster.
9. A status check device, comprising:
the main service module is used for establishing a state checking module after being started;
the state checking module is used for feeding back the message for state detection after the node where the state checking device is located receives the message for state detection;
wherein, the node where the state checking device is located is a master node of the cluster.
10. A status checking method, comprising:
after a main node in the cluster is started, executing a state checking process;
and after the main node receives the message for carrying out the state detection, the state checking process feeds back the message for carrying out the state detection.
11. A cluster system, comprising: one or more nodes; wherein at least one master node exists in the one or more nodes;
it is characterized by also comprising:
state checking means provided in the master node;
after the master node is started, starting the state checking device;
the state checking device is used for feeding back the message for state detection after the master node receives the message for state detection.
CN201710901666.8A 2017-09-28 2017-09-28 State checking method, device and cluster system Active CN109586989B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710901666.8A CN109586989B (en) 2017-09-28 2017-09-28 State checking method, device and cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710901666.8A CN109586989B (en) 2017-09-28 2017-09-28 State checking method, device and cluster system

Publications (2)

Publication Number Publication Date
CN109586989A CN109586989A (en) 2019-04-05
CN109586989B true CN109586989B (en) 2022-09-20

Family

ID=65913974

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710901666.8A Active CN109586989B (en) 2017-09-28 2017-09-28 State checking method, device and cluster system

Country Status (1)

Country Link
CN (1) CN109586989B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111064639A (en) * 2019-12-19 2020-04-24 广东小天才科技有限公司 Service state announcement method, device, equipment and storage medium
CN111831455A (en) * 2020-07-02 2020-10-27 上海微亿智造科技有限公司 Distributed transaction processing system and method under industrial Internet of things
CN115033525B (en) * 2021-03-05 2023-06-09 荣耀终端有限公司 File system management method and electronic device
CN113784317B (en) * 2021-08-26 2023-11-21 上汽通用五菱汽车股份有限公司 Method, device, equipment and medium for preventing screen-throwing initialization from causing sleep failure of car machine

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005031752A (en) * 2003-07-07 2005-02-03 Sharp Corp Application management device
CN104317658A (en) * 2014-10-17 2015-01-28 华中科技大学 MapReduce based load self-adaptive task scheduling method
CN106817295A (en) * 2016-12-08 2017-06-09 努比亚技术有限公司 A kind of message processing apparatus and method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7206964B2 (en) * 2002-08-30 2007-04-17 Availigent, Inc. Consistent asynchronous checkpointing of multithreaded application programs based on semi-active or passive replication
CN101447003A (en) * 2007-11-26 2009-06-03 鸿富锦精密工业(深圳)有限公司 Computer security protection system and method therefor
US8280866B2 (en) * 2010-04-12 2012-10-02 Clausal Computing Oy Monitoring writes using thread-local write barrier buffers and soft synchronization
CN102833120B (en) * 2011-06-14 2017-06-13 中兴通讯股份有限公司 The abnormal method and system of NM server are processed in a kind of rapid automatized test
CN103365718A (en) * 2013-06-28 2013-10-23 贵阳朗玛信息技术股份有限公司 Thread scheduling method, thread scheduling device and multi-core processor system
CN104268055B (en) * 2014-09-01 2017-07-14 腾讯科技(深圳)有限公司 The monitoring method and device of a kind of program exception
CN104615497B (en) * 2015-02-13 2018-09-25 广州华多网络科技有限公司 A kind of processing method and processing device of thread suspension
CN105610621B (en) * 2015-12-31 2019-04-26 中国科学院深圳先进技术研究院 A method and device for dynamic adjustment of task-level parameters of distributed system architecture
CN105740326B (en) * 2016-01-21 2021-01-15 腾讯科技(深圳)有限公司 Thread state monitoring method and device for browser
CN106021399B (en) * 2016-05-12 2019-12-06 网易(杭州)网络有限公司 method and device for processing query request message
CN106485141A (en) * 2016-10-21 2017-03-08 天津海量信息技术股份有限公司 The detection of abnormal traffic thread and processing method under JAVA environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005031752A (en) * 2003-07-07 2005-02-03 Sharp Corp Application management device
CN104317658A (en) * 2014-10-17 2015-01-28 华中科技大学 MapReduce based load self-adaptive task scheduling method
CN106817295A (en) * 2016-12-08 2017-06-09 努比亚技术有限公司 A kind of message processing apparatus and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"GPU clusters for high-perfomance computing";Volodymyr V. Kindratenko,etc.;《2009 IEEE international conference on cluster comouting》;20090904;全文 *
基于Raft一致性协议的高可用性实现;张晨东等;《华东师范大学学报(自然科学版)》;20150531(第05期);全文 *
自适应大规模服务器集群监控系统的构建;薛正华等;《西安交通大学学报》;20080410(第04期);全文 *

Also Published As

Publication number Publication date
CN109586989A (en) 2019-04-05

Similar Documents

Publication Publication Date Title
CN102662821B (en) Method, device and system for auxiliary diagnosis of virtual machine failure
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
US11157373B2 (en) Prioritized transfer of failure event log data
CN109586989B (en) State checking method, device and cluster system
CN108287769B (en) Information processing method and device
CN108833190A (en) A kind of NFS service failure warning method, device and storage medium
JP5425720B2 (en) Virtualization environment monitoring apparatus and monitoring method and program thereof
CN111314443A (en) Node processing method, device and device and medium based on distributed storage system
CN106911519A (en) A kind of data acquisition monitoring method and device
CN111478792B (en) A method, system and device for processing cutover information
CN112069032A (en) Availability detection method, system and related device for virtual machine
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
US20140164851A1 (en) Fault Processing in a System
CN112231063A (en) Fault processing method and device
CN109446034B (en) Method and device for reporting crash event, computer equipment and storage medium
CN111552637A (en) Database state detection method and device, electronic equipment and storage medium thereof
CN113609199B (en) Database system, server, and storage medium
CN115037652A (en) Operation monitoring system for background module of sleeve protection system
CN112231280B (en) Big data cluster detection method and device, electronic device and storage medium
CN116450448A (en) DHCP process monitoring method and device
CN107682185A (en) MANO management methods and device
JP6674916B2 (en) Communication fault management device and communication system
CN117407282A (en) Application program warning method, device, equipment, storage medium and program product
CN116506325A (en) Method and equipment for detecting brain fracture condition of cloud host
CN115604135A (en) Service monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant