CN112882901B - Intelligent health state monitor of distributed processing system - Google Patents
Intelligent health state monitor of distributed processing system Download PDFInfo
- Publication number
- CN112882901B CN112882901B CN202110243326.7A CN202110243326A CN112882901B CN 112882901 B CN112882901 B CN 112882901B CN 202110243326 A CN202110243326 A CN 202110243326A CN 112882901 B CN112882901 B CN 112882901B
- Authority
- CN
- China
- Prior art keywords
- health monitoring
- health
- node
- functional module
- network switch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000036541 health Effects 0.000 title claims abstract description 64
- 238000012545 processing Methods 0.000 title claims abstract description 27
- 238000012544 monitoring process Methods 0.000 claims abstract description 81
- 238000004891 communication Methods 0.000 claims abstract description 12
- 238000001514 detection method Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 5
- 230000003862 health status Effects 0.000 claims description 5
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 239000007787 solid Substances 0.000 abstract description 2
- 238000013461 design Methods 0.000 description 1
- 238000000034 method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a health state intelligent monitor of a distributed processing system, which comprises a monitoring management node, a health monitoring data network switch and a health monitoring server, wherein the health monitoring data network switch is connected with the health monitoring server; the health monitoring system comprises a health monitoring server, a data communication network, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring management node, wherein the number of the monitoring management nodes corresponds to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network via the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data, diagnoses the cause of system failure and resumes work in the shortest time possible. The system has the advantages that the working states of various components such as a power supply, a CPU, a memory and a solid memory of the distributed processing system are monitored in real time, a system manager is assisted to rapidly diagnose the cause of system faults, the testability, maintainability and assurances of the system are effectively improved, and meanwhile the task processing capacity of the system is greatly improved.
Description
Technical Field
The invention belongs to the technical field of embedded computer system design, and particularly relates to an intelligent monitor for health status of a distributed processing system.
Background
The types and the number of the devices of the onboard embedded system are more and more, the processing system is more and more complex, the health condition of the system is more and more difficult to monitor, the problem cannot be accurately positioned by the traditional method only by means of BIT test of the main processor, the task processing function of the main processor is directly affected, and the operation efficiency of the system processing resources is reduced.
Disclosure of Invention
The invention aims to provide a distributed processing system health state intelligent monitor which is used for meeting the requirements of a high-performance aircraft system on the testability, maintainability and reliability of processing equipment.
In order to realize the tasks, the invention adopts the following technical scheme:
A distributed processing system health state intelligent monitor comprises a monitoring management node, a health monitoring data network switch and a health monitoring server; the health monitoring system comprises a health monitoring server, a data communication network, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring management node, wherein the number of the monitoring management nodes corresponds to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network via the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data, diagnoses the cause of system failure and resumes work in the shortest time possible.
Further, the health monitoring data network switch realizes monitoring data exchange, the data network is FC, AFDX or Ethernet, and the network communication rate is not lower than 1Gbps.
Further, the processor comprises a child node and a root node; the number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health states of all the functional modules including the functional modules of the active root node and the backup root node, detects the power failure and the extraction of the functional modules, reports events to the corresponding monitoring management nodes, receives the control instructions of the monitoring management nodes, executes proper operation to perform task scheduling of the functional modules, and prevents system faults.
Further, the physical bearers of the two root node data links are two I2C buses.
Further, the sub-node is an independent circuit unit located in the functional module, and the sub-node is used for collecting and uploading sensor data, CPU state data and self-checking data in the functional module, reading the module slot number and the equipment number, and controlling the powering on and powering off and resetting of the functional module.
Further, the micro controller in the child node runs module function software and is responsible for receiving an external root node command and uploading sensor data, CPU and software state data; the monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering on the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip and a memory; detecting the running state of a key application; and detecting the up-and-down line of the port of the switching chip.
Further, the child node monitors the voltage and the temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus;
When the voltage, the temperature or the working state of the functional module is abnormal, the sub-node sends an alarm to the root node, meanwhile, the information such as the voltage, the temperature and the working state of the functional module is reported to the root node in response to the query command of the root node, the root node receives the query request from the monitoring management node through the network, issues the query request such as the temperature, the voltage and the working state to the sub-node of each functional module, automatically reports the system alarm information to the monitoring management node and records the system working log.
Further, the monitoring management node, the root node and the child nodes are respectively powered by independent power supplies and are powered on before the functional circuits of the distributed processing system.
Compared with the prior art, the invention has the following technical characteristics:
Aiming at stronger demands of complex environments on embedded systems, the invention provides the intelligent monitor for the health state of the distributed processing system, which is used for realizing the real-time monitoring of the working states of all components such as a power supply, a CPU, a memory, a solid memory and the like of the distributed processing system, assisting a system manager to rapidly diagnose the reasons of system faults, recovering the work in the shortest time as much as possible, effectively improving the testability, maintainability and safeguarding performance of the system, and greatly improving the task processing capacity of the system.
Drawings
FIG. 1 is a distributed processing system health intelligent monitor;
Fig. 2 is a processor content monitor functional architecture.
Detailed Description
Referring to fig. 1, the intelligent monitor for health status of a distributed processing system provided by the invention comprises a monitoring management node, a health monitoring data network switch and a health monitoring server; the monitoring management nodes can be arranged in a plurality of modes according to the system requirement and correspond to the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor and transmits the information to the health monitoring server through the health monitoring data network switch by the data communication network, and the health monitoring server makes analysis decision on the health monitoring data, rapidly diagnoses the cause of system faults and resumes work in the shortest time possible. The health monitoring data network switch is used for realizing monitoring data exchange, the data network can be FC, AFDX, ethernet and the like, and the network communication rate is not lower than 1Gbps.
As shown in fig. 2, the processor includes child nodes and root nodes. The number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health states of all the functional modules including the functional modules of the active root node and the backup root node, detects the power failure and the extraction of the functional modules, reports events to the corresponding monitoring management nodes, receives the control instructions of the monitoring management nodes, executes proper operation to perform task scheduling of the functional modules, and prevents system faults; the physical bearers of the two root node data links are two I2C buses. The functional module is a module for realizing a certain function in the processor, such as a computing module, an output module and the like.
The child node is also an independent circuit unit positioned in the functional module, and is powered by an independent power supply; the child node is mainly responsible for collecting and uploading sensor data, CPU state data and self-checking data in the functional module where the child node is located, reading the module slot number and the device number, and controlling the powering on and powering off and resetting of the functional module. The micro controller in the child node runs module function software and is responsible for receiving the command of the external root node and uploading the sensor data, the CPU and the software state data. The monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering on the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip, a memory and the like; detecting the running state of a key application; and detecting the up-and-down line of the port of the switching chip.
The monitoring management node, the root node and the child nodes are respectively powered by independent power supplies and are powered on before the functional circuits of the distributed processing system. The sub-node monitors the voltage and temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus. When the voltage, the temperature or the working state of the functional module is abnormal, the child node sends an alarm to the root node; and simultaneously, the information such as the voltage, the temperature, the working state and the like of the module is reported to the root node in response to the query command of the root node, the root node can receive the query request from the monitoring management node through the network, and issues the query request such as the temperature, the voltage, the working state and the like to the sub-nodes of each functional module, and the system alarm information is automatically reported to the monitoring management node and the system working log is recorded.
The intelligent monitor is independent of functional components, automatically operates, saves processing resources, improves the task processing capacity of the system, solves the problems of insufficient health condition monitoring information and inaccurate problem positioning of a complex processing system, can assist a system manager to rapidly diagnose system faults, and can resume work in the shortest time possible, thereby effectively improving the testability, maintainability and safeguarding performance of the system.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced equally; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.
Claims (4)
1. The intelligent monitor for the health state of the distributed processing system is characterized by comprising a monitoring management node, a health monitoring data network switch and a health monitoring server; the health monitoring system comprises a health monitoring server, a health monitoring data network switch, a health monitoring management node, a health monitoring data network switch and a health monitoring data network switch, wherein the number of the monitoring management nodes corresponds to that of the monitored processors, each monitoring management node collects health state information of various functional modules in the corresponding processor, and transmits information to the health monitoring server through the data communication network through the health monitoring data network switch, and the health monitoring server makes analysis decision on the health monitoring data and diagnoses the cause of system faults;
The processor comprises a child node and a root node; the number of the root nodes is two, the root nodes are two independent circuits physically positioned in the functional module, and the root nodes form double backups with each other; one communication connection is always kept between two root nodes, one is an active root node, and the other is a backup root node; the active root node monitors the health status of all the functional modules, wherein the monitored functional modules comprise the active root node and the functional module where the corresponding backup root node is located; detecting power failure and extraction of the functional module, reporting an event to a corresponding monitoring management node, and receiving a control instruction of the monitoring management node to execute corresponding operation to perform task scheduling of the functional module;
The sub-node is an independent circuit unit positioned in the functional module, and is used for collecting and uploading sensor data, CPU state data and self-checking data in the functional module, reading the slot number and the equipment number of the functional module, and controlling the powering on and powering off and resetting of the functional module;
The microcontroller in the child node runs module function software and is responsible for receiving an external root node command and uploading sensor data, CPU state data and software state data; the monitoring process and the content of the child node comprise the following steps: before powering on, detecting the slot position, if the detection is correct, powering up the functional module normally, and if the detection is incorrect, reporting the slot position, wherein the functional module cannot supply power normally; powering up and resetting the functional module; detecting voltage, key circuit current and temperature; the core device running state detection comprises a CPU, a switching chip and a memory; detecting the running state of a key application; detecting the up-down line state of the port of the exchange chip;
the child node monitors the voltage and the temperature of the functional module and acquires the working state of the module from the CPU through the universal serial bus;
When the voltage, the temperature or the working state of the functional module is abnormal, the child node sends an alarm to the root node, meanwhile, the voltage, the temperature and the working state information of the functional module are reported to the root node in response to the query command of the root node, the root node receives the query request from the monitoring management node through the network, issues the temperature, the voltage and the working state query request to the child node of each functional module, automatically reports the system alarm information to the monitoring management node and records the system working log.
2. The intelligent monitor for health status of a distributed processing system according to claim 1, wherein the health monitoring data network switch implements monitoring data exchange, the data network is FC, AFDX, or ethernet, and the network communication rate is not lower than 1Gbps.
3. The distributed processing system health intelligent monitor of claim 1, wherein the physical bearers of the two root node data links are two I2C buses.
4. The intelligent monitor of health status of a distributed processing system according to claim 1, wherein the monitor management node, the root node, and the child nodes are powered by separate power sources, respectively, and are powered up prior to the functional circuitry of the distributed processing system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110243326.7A CN112882901B (en) | 2021-03-04 | 2021-03-04 | Intelligent health state monitor of distributed processing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110243326.7A CN112882901B (en) | 2021-03-04 | 2021-03-04 | Intelligent health state monitor of distributed processing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112882901A CN112882901A (en) | 2021-06-01 |
CN112882901B true CN112882901B (en) | 2024-06-18 |
Family
ID=76055397
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110243326.7A Active CN112882901B (en) | 2021-03-04 | 2021-03-04 | Intelligent health state monitor of distributed processing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112882901B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010379B (en) * | 2021-03-09 | 2024-03-15 | 爱瑟福信息科技(上海)有限公司 | Electronic equipment monitoring system |
CN113722012A (en) * | 2021-09-07 | 2021-11-30 | 超越科技股份有限公司 | Domestic system-level management system |
CN114172829B (en) * | 2022-02-10 | 2022-08-12 | 统信软件技术有限公司 | Server health monitoring method and system and computing equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109698775A (en) * | 2018-11-21 | 2019-04-30 | 中国航空工业集团公司洛阳电光设备研究所 | A kind of dual-machine redundancy backup system based on real-time status detection |
CN111880997A (en) * | 2020-07-29 | 2020-11-03 | 曙光信息产业(北京)有限公司 | Distributed monitoring system, monitoring method and device |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2750517B1 (en) * | 1996-06-27 | 1998-08-14 | Bull Sa | METHOD FOR MONITORING A PLURALITY OF OBJECT TYPES OF A PLURALITY OF NODES FROM A ADMINISTRATION NODE IN A COMPUTER SYSTEM |
US9063966B2 (en) * | 2013-02-01 | 2015-06-23 | International Business Machines Corporation | Selective monitoring of archive and backup storage |
GB2514833A (en) * | 2013-06-07 | 2014-12-10 | Ibm | Portable computer monitoring |
US9348573B2 (en) * | 2013-12-02 | 2016-05-24 | Qbase, LLC | Installation and fault handling in a distributed system utilizing supervisor and dependency manager nodes |
CN106126407B (en) * | 2016-06-22 | 2018-07-17 | 西安交通大学 | A kind of performance monitoring Operation Optimization Systerm and method for distributed memory system |
CN109144802A (en) * | 2018-09-12 | 2019-01-04 | 杭州智享新电科技有限公司 | Internet of Things module health control diagnostic method |
US10880434B2 (en) * | 2018-11-05 | 2020-12-29 | Nice Ltd | Method and system for creating a fragmented video recording of events on a screen using serverless computing |
CN110011829B (en) * | 2019-02-28 | 2021-11-19 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Comprehensive airborne task system health management subsystem |
-
2021
- 2021-03-04 CN CN202110243326.7A patent/CN112882901B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109698775A (en) * | 2018-11-21 | 2019-04-30 | 中国航空工业集团公司洛阳电光设备研究所 | A kind of dual-machine redundancy backup system based on real-time status detection |
CN111880997A (en) * | 2020-07-29 | 2020-11-03 | 曙光信息产业(北京)有限公司 | Distributed monitoring system, monitoring method and device |
Also Published As
Publication number | Publication date |
---|---|
CN112882901A (en) | 2021-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112882901B (en) | Intelligent health state monitor of distributed processing system | |
CN108089964A (en) | A kind of device and method by BMC monitoring server CPLD states | |
EP2093934B1 (en) | System, device, equipment and method for monitoring management | |
US20020152425A1 (en) | Distributed restart in a multiple processor system | |
CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
CN111831488B (en) | TCMS-MPU control unit with safety level design | |
US20120221885A1 (en) | Monitoring device, monitoring system and monitoring method | |
CN103544092A (en) | Health monitoring system of avionic electronic equipment based on ARINC653 standard | |
EP3306422B1 (en) | Arithmetic device and control apparatus | |
CN108287780A (en) | A kind of device and method of monitoring server CPLD states | |
CN111815118B (en) | A remote sensing satellite autonomous health management system | |
CN116483613B (en) | Processing method and device of fault memory bank, electronic equipment and storage medium | |
CN102495786B (en) | Server system | |
CN108633129A (en) | A kind of fault monitoring system and monitoring process method of LED tail lamp circuits | |
CN103365267A (en) | Bay level equipment with self-recovery function in substation and implementation method of bay level equipment | |
CN101110053A (en) | A Method for Realizing Computer Fault Alarm Control | |
CN112019455A (en) | A switch monitoring device and method based on programmable logic device | |
CN111880999B (en) | High-availability monitoring management device for high-density blade server and redundancy switching method | |
CN106407081B (en) | Case management system and server | |
CN208063515U (en) | A kind of fault monitoring system of LED tail lamp circuits | |
CN114153189B (en) | Automatic driving controller safety diagnosis and protection method, system and storage device | |
CN101741654B (en) | Operating system monitoring device and method | |
CN114559465A (en) | Robot abnormality automatic recovery method, robot and robot system | |
CN115801640B (en) | Mutual keep-alive system between BMC management board and network switch board based on ARM array server | |
CN104571454A (en) | Method for seamless monitoring and management of blade server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |