[go: up one dir, main page]

CN119025030A - A log storage method, BMC and computing device - Google Patents

A log storage method, BMC and computing device Download PDF

Info

Publication number
CN119025030A
CN119025030A CN202411045056.9A CN202411045056A CN119025030A CN 119025030 A CN119025030 A CN 119025030A CN 202411045056 A CN202411045056 A CN 202411045056A CN 119025030 A CN119025030 A CN 119025030A
Authority
CN
China
Prior art keywords
core
log
memory
service core
storage medium
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411045056.9A
Other languages
Chinese (zh)
Inventor
李剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
XFusion Digital Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XFusion Digital Technologies Co Ltd filed Critical XFusion Digital Technologies Co Ltd
Priority to CN202411045056.9A priority Critical patent/CN119025030A/en
Publication of CN119025030A publication Critical patent/CN119025030A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a log storage method, a BMC and a computing device, which are applied to a baseboard management controller BMC, wherein the BMC comprises a business core, a security core, a memory and a nonvolatile storage medium; the security core acquires the running state of the service core in real time; under the condition that the service core is in an abnormal operation state, the security core writes a first log in the memory into the nonvolatile storage medium; under the condition that the service core is restored to a normal running state, the service core writes a first log in the nonvolatile storage medium into the memory; the first log is generated by the service core in a normal running state and is written into the memory. The first log is not lost even if the service core is reset to the running state. When the service core is restored to the normal running state, the service core can restore the first log in the nonvolatile storage medium to the memory. The cause of the anomaly may be analyzed from the recovered first log to locate the problem of the problem.

Description

Log storage method, BMC and computing device
Technical Field
The present application relates to the field of server technologies, and in particular, to a log storage method, a BMC, and a computing device.
Background
A baseboard management controller (Baseboard Management Controller, BMC) is a dedicated microcontroller embedded on the motherboard of a computer system for monitoring and managing the operating status of servers, data centers, and other complex computer systems. During operation of a computer system, the BMC may encounter various exceptions, which are typically handled by the corresponding operating CPU core, as the CPU core is the critical component of the server responsible for executing instructions and processing data. For example, log data generated by the service core may be continuously written into the memory of the BMC, but in the case of an abnormal service core, the service core may not be able to store the site log in the current abnormal service core scenario. Therefore, in order to prevent the loss of log data in the event of a BMC exception or crash, the service core periodically dumps the log data in the memory to a nonvolatile storage medium (e.g., flash memory, hard disk, etc.). The nonvolatile storage medium is capable of retaining data after a system power down or reboot, and log data that has been dumped into the nonvolatile storage medium can be retained and accessed even if the system crashes or reboots.
Because the log data in the memory is periodically dumped into the nonvolatile storage medium, when the BMC is abnormal, if the abnormality occurs before the log reaches the dump cycle, that is, the log cannot be dumped into the nonvolatile storage medium in time in the memory, the log may be lost due to the volatility of the memory. For example, in the event of an abnormally severe to abnormally disabled start-up, it may be necessary to attempt to restore the BMC to normal by powering down (i.e., completely shutting down power) and then powering up again. Memory is a type of volatile memory that relies on a continuous power supply to hold stored data. When the power is suddenly cut off, the charge in the memory is quickly exhausted, resulting in immediate loss of the stored log data. The log typically contains critical information, such as error codes, stack tracking, system status, etc., that is critical to analyzing and resolving system anomalies when they occur. Under the condition that the log data is lost, the log data when the BMC abnormality occurs cannot be obtained, so that the reason for the abnormality cannot be determined, and the system abnormality is solved.
Disclosure of Invention
In view of the above, the embodiment of the application provides a log storage method, a BMC and a computing device, which aim to solve the problem that log data in a memory is easy to lose when the BMC is abnormal.
In a first aspect, the technical solution provided by the embodiment of the present application is as follows:
the first aspect of the embodiment of the application provides a log storage method, which is applied to a Baseboard Management Controller (BMC), wherein the BMC comprises a service core, a security core, a memory and a nonvolatile storage medium, and the method comprises the following steps:
the security core acquires the running state of the service core in real time;
under the condition that the service core is in an abnormal operation state, the security core writes a first log in the memory into the nonvolatile storage medium;
under the condition that the service core is restored to a normal running state, the service core writes a first log in the nonvolatile storage medium into the memory; the first log is generated by the service core in a normal running state and is written into the memory.
According to the method provided by the application, when the service core is in an abnormal state, the security core writes the first log in the memory into the nonvolatile storage medium. After the first log writing is completed, resetting the running state of the service core, so that when the service core is restored to the normal running state, the service core restores the first log in the nonvolatile storage medium to the memory. Since the security core operates independently of the service core, the first log can still be written to the nonvolatile storage medium when the service core is in an abnormal state, and the nonvolatile storage medium has the capability of persisting data, the first log is not lost even if the service core resets the operating state. The service core can restore the first log from the nonvolatile storage medium to the memory under the condition of being restored to the normal operation state. Further, the cause of the abort event can be analyzed from the recovered first log, and the problem can be located and processed accordingly. The problem that the log data when the BMC is abnormal cannot be obtained under the condition that the log data is lost due to service core faults, so that fault reasons cannot be located is solved.
In a possible implementation manner, the BMC further includes an abort module, and the secure core acquires an operation state of the service core in real time, including:
the security core acquires signals sent by an abnormal interrupt module in real time, and the abnormal interrupt module is used for judging whether the service core is in an abnormal operation state or not and sending interrupt signals to the security core when judging that the service core is in the abnormal operation state;
When the service core is in an abnormal operation state, the security core writes the first log in the memory into the nonvolatile storage medium, and the method comprises the following steps:
And under the condition that the safety core receives an interrupt signal sent by the abnormal interrupt module, the safety core writes a first log in the memory into the nonvolatile storage medium.
In one possible implementation manner, the restoration of the service core to the normal operation state includes:
And the service core is restored to a normal operation state by powering down and powering up again.
In one possible implementation manner, the restoration of the service core to the normal operation state includes:
and the security check executes a reset operation on the service core, and the reset operation is used for resetting the running state of the service core.
In one possible implementation, the method further includes:
and under the condition that the service core is not restored to the normal running state by the reset operation of the safety core, restoring the service core to the normal running state by powering down and powering up again.
In one possible implementation manner, in a case that the security check performs a reset operation on the service core and the service core returns to a normal running state, the service core writes a first log in the nonvolatile storage medium into the memory, including:
The service core compares a first log in a nonvolatile storage medium with a log in a memory;
When the comparison result shows that the first log in the nonvolatile storage medium is inconsistent with the log in the memory, the service core clears the log in the memory, and writes the first log in the nonvolatile storage medium into the memory.
In one possible implementation, the writing, by the service core, the first log in the nonvolatile storage medium to the memory includes:
The service core detects whether a first log in a nonvolatile storage medium accords with a preset validity condition and/or presets an integrity condition;
And writing the first log in the nonvolatile storage medium into a memory when the first log in the nonvolatile storage medium meets a preset validity condition and/or an integrity condition is preset.
In one possible implementation manner, after the service core writes the first log in the nonvolatile storage medium into the memory, the method further includes:
when the first log in the nonvolatile storage medium is completely written into the memory, the service core clears the first log in the nonvolatile storage medium.
In a second aspect, an embodiment of the present application provides a BMC, where the BMC includes a processor and a memory, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to implement the log storage method according to the foregoing first aspect.
In a third aspect, an embodiment of the present application provides a computing device, where the computing device includes a processor, a memory, and a BMC, where the BMC is configured to execute the log storage method according to the foregoing first aspect.
Drawings
In order to more clearly illustrate this embodiment or the technical solutions of the prior art, the drawings that are required for the description of the embodiment or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a BMC structure according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a server according to an embodiment of the present application;
fig. 3 is a schematic diagram of a BMC multi-core heterogeneous system according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an interrupt signal according to an embodiment of the present application;
FIG. 5 is a flowchart of a first log storage method according to an embodiment of the present application;
FIG. 6 is a flowchart of a second log storage method according to an embodiment of the present application;
FIG. 7 is a flowchart of a third log storage method according to an embodiment of the present application;
FIG. 8 is a flowchart of a fourth log storage method according to an embodiment of the present application;
FIG. 9 is a flowchart of a fifth log storage method according to an embodiment of the present application;
FIG. 10 is a flowchart of a sixth log storage method according to an embodiment of the present application;
FIG. 11 is a flowchart of a seventh log storage method according to an embodiment of the present application;
Fig. 12 is a schematic structural diagram of a chip according to an embodiment of the present application.
Detailed Description
It should be noted that the described embodiments of the present application are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In order to make the following embodiments clear, technical terms related to the present application will be described first.
Watchdog (WDT) is a software mechanism for monitoring the operation of the system, and is first introduced into a single-chip microcomputer system to prevent programs from falling into a dead loop due to interference, thereby causing the system to fail to work normally. The basic principle of the watch dog is to set up a timer, generally a number is given to the watch dog, the watch dog starts counting after the program starts running, if the program runs normally, the CPU should send out an instruction to make the watch dog set to zero after a period of time, if the watch dog increases to a set value, the program is considered to not work normally, the system is restarted, and therefore the system core process is guaranteed to be in a normal running state for most of the time.
BMC (Baseboard Management Controller), i.e., baseboard management controller, is a core component of server out-of-band management, and is a dedicated microcontroller independent of the server operating system and host processor, typically built-in on the motherboard of servers, network devices, and other complex computer systems. It is mainly responsible for monitoring and managing the state of system hardware, such as monitoring key parameters of system temperature, voltage, fan speed, power state, etc., and providing remote management and control functions, such as remote monitoring, remote restarting, etc.
Computer systems that employ BMCs for monitoring and management typically include, but are not limited to, servers, rack-mounted servers for data centers, high performance computing clusters, and some complex embedded systems. Such systems are characterized by relatively complex hardware configurations and highly integrated management requirements that require careful operational state monitoring, fault early warning, remote management and maintenance capabilities.
In current BMC designs, a multi-core architecture is employed to enhance functionality and security. Typically, a BMC is configured with 4 business cores and 1 security core, which are part of the hardware, each taking on a particular role and task. The service core is mainly responsible for management functions of the BMC and interaction with the server hardware devices, such as monitoring server states, executing management commands and the like. The security core is mainly responsible for providing security functions such as authentication, access control, encrypted communication, etc., so as to ensure the security and integrity of the BMC and the server hardware managed by the BMC.
The BMC includes separate power and I/O interfaces that can be run before the server's operating system, BIOS, etc. is started, thereby providing management and control functions for the server. When computer systems monitored and managed with BMCs experience abnormal interrupts, the interrupts are typically handled by the corresponding CPU core and appropriate action is taken. However, in some cases, particularly when an abnormality occurs in the CPU core itself, it cannot normally store log data. This is because an exception to the CPU core may cause its internal registers and cache data to be lost or corrupted, thereby affecting the log's storage function. Furthermore, if the exception is severe enough to cause the entire server's operating system to crash or fail to boot, even if a log storage mechanism exists, these logs may not be accessible because the server's operating system is not functioning properly. The inability of an administrator or technician to view a field log of the occurrence of a problem makes locating the root cause of a triggering system anomaly particularly difficult. Without this critical log information, it is difficult to determine which component or part presents a problem, and effective measures cannot be taken to solve the problem.
The embodiment of the application provides heterogeneous design of a secure core and a business core of a BMC (baseboard management controller), and a heterogeneous multi-core system refers to a processing core integrating a plurality of different types or functions on one physical chip. Each core can be used as an execution engine of a corresponding software system, so that parallel processing and functional separation are realized.
According to the method provided by the embodiment of the application, when the service core is in an abnormal state, the security core writes the first log in the memory into the nonvolatile storage medium. After the first log writing is completed, resetting the service core so as to restore the first log in the nonvolatile storage medium to the memory when the service core is restored to the normal running state. Since the first log can still be written to the non-volatile storage medium when the service core is in an abnormal state, and the non-volatile storage medium has the ability to persist data, further, the first log is not lost even if the service core resets the running state. The service core can restore the first log from the nonvolatile storage medium to the memory under the condition of being restored to the normal operation state. Furthermore, the reasons of the abnormal interruption event can be analyzed from the recovered first log, the difficult problem is positioned, and corresponding processing is carried out, so that the problem of log data loss caused by service core faults is avoided.
The embodiment of the application provides a structural schematic diagram of a BMC. As shown in fig. 1, fig. 1 is a schematic structural diagram of a BMC according to an embodiment of the present application, where the BMC may include: a processor 210, and a memory 220. The processor 210, the memory 220 may be connected by a bus or other manner, and the processor may include a plurality of cores including a service core for performing tasks related to the service and a security core for performing security initiation tasks.
In the embodiment of the present application, the processor 210 is a computing core and a control core of the BMC. In one example, the processor includes a security core and a service core, where the security core is configured to perform tasks related to BMC security, such as security startup and security check, in the BMC, and the service core is configured to perform tasks related to services in the BMC. Since each core has its own independent memory space and running context, they can run different systems independently.
Memory 220 (memory) is a memory device of the server for storing programs and data. It is understood that the memory 220 of this time may include memory and non-volatile storage media. Memory 220 provides storage space that stores an operating system and executable program code, which may include, but is not limited to: windows system, linux system, bare computer system, etc.
It will be appreciated that the architecture illustrated in FIG. 1 of the present application does not constitute a specific limitation on BMC. In other embodiments of the application, the BMC may include more or less components than shown, or may combine certain components, or may split certain components, or may have a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Based on the foregoing BMC structure, the embodiment of the present application further provides a server, referring to fig. 2, fig. 2 is a schematic structural diagram of the server provided by the embodiment of the present application, as shown in fig. 2, where the server includes: processor, BMC and power module. The BMC deployed on the server may be the BMC shown in fig. 1. The power module is responsible for providing a stable power supply for the components within the server.
In an alternative implementation, the security core and the service core may run different software systems, and it may be understood that the service core and the security core are the basis and the core for running respective software systems, and functions and tasks of the software systems need to be executed by the security core and the service core, that is, the security core and the service core serve as execution engines of respective corresponding software systems, and may execute program codes in different software systems.
Illustratively, the business core runs an embedded operating system to perform business related tasks, while the security core runs a bare metal system to perform security related tasks. The bare metal system runs directly on the hardware, has no middle layer of the traditional operating system, and can have higher security and reliability.
In an actual application scenario, for an operating system or a bare metal system to be run, relevant codes and data need to be loaded into a memory first. Each core may have its own memory space or cache, or may share a common memory space. Even if the memory is physically shared, the two cores can remain logically independent, running different systems and programs without interfering with each other.
For example, for an embedded operating system, a driver, related code, etc. may be loaded into memory allocated to the business core and an execution start point may be set. For bare metal systems, the relevant code and data may also be loaded into the memory allocated to the secure kernel and the corresponding execution start set. After memory is allocated for the service core and the embedded operating system is loaded, the service core may begin to process service related tasks using the embedded operating system. Likewise, after allocating memory for the secure core and loading the bare metal system, the secure core may begin processing security-related tasks with the bare metal system.
Referring to fig. 3, fig. 3 is a schematic diagram of a BMC multi-core heterogeneous system according to an embodiment of the present application, where a memory is a block of memory space that is partitioned in the BMC and can be shared for use by a security core and a service core, and may include a memory and a nonvolatile storage medium.
Wherein the service core, the memory and the nonvolatile storage medium operate in the same software system environment. This software system provides the necessary support and safeguards for them to work together to accomplish a variety of complex computing tasks. The business core, as the execution engine of this software system, may execute these instructions in accordance with the Instruction Set Architecture (ISA) of the software system. That is, the service core is the hardware portion that executes the software instructions, while the memory and non-volatile storage media provide the data storage function. These components may be connected together by a system bus, but it is the software system that actually enables them to work in concert. Logically they are integrated and managed by a software system (e.g., an operating system) that can coordinate these hardware resources through mechanisms such as process management, memory management, etc. I.e., the service core executes instructions provided by the software system, which determines how to use the service core, how to manage the hardware resources such as memory and non-volatile storage media.
While the security core typically operates in a separate software system, it may interact with the software system in which the business core resides through a particular interface or protocol. As shown in fig. 3, the secure kernel may have access to memory and non-volatile storage media. The security core is responsible for monitoring the operational status of the business core and intervenes as necessary to protect critical data and perform necessary security operations. The system can comprise functions of hardware state monitoring, safety strategy implementation, fault early warning, recovery and the like. The security core and the service core are not in the same software system, and can run independently of the service core, so that the security management function of the BMC multi-core heterogeneous system is still available when the service core is in an abnormal running state. The service core is responsible for executing service logic and data processing tasks. It has memory resources and can run an operating system sufficient to support complex business applications and data processing requirements. The service core and the security core can exchange information and cooperatively work through an internal communication mechanism, so that the safe execution of the service and the complete protection of data are ensured.
The abnormal interrupt triggering module is mainly responsible for detecting and responding to abnormal conditions of the service core in the process of sequentially executing the program instruction stream. These exceptions may originate from various events in execution by processors within the BMC, including but not limited to instruction exceptions, hardware faults, and the like. When the abort triggering module detects an abort condition, it can immediately abort the currently executing program and send an abort signal to both the secure core and the service core.
In an alternative implementation, the triggering may be edge triggering, which refers to triggering an interrupt when the interrupt signal changes from one state to another (e.g., from low to high or vice versa). As shown in fig. 4, fig. 4 is a schematic diagram of an interrupt signal provided in an embodiment of the present application, including a rising edge and a falling edge. Wherein the rising edge corresponds to a change from a low level to a high level and the falling edge corresponds to a change from a high level to a low level.
The level trigger mechanism continuously triggers an interrupt based on the signal remaining at a particular level (e.g., high or low). However, this mechanism may suffer from problems when handling variable frequency triggers. For example, when the service core performs an interrupt clearing operation after the interrupt trigger, the interrupt signal may disappear, which may affect the security core to receive the interrupt signal. Because the level triggering is dependent on the specific level of the signal, once the signal disappears, the interrupt cannot be triggered again until the signal reaches the trigger level again.
In contrast, the edge triggering method is more suitable for interrupt triggering in the embodiment of the application. Edge triggering relies on a change in the state of the signal to trigger an interrupt, rather than a specific level of the signal. Therefore, even after the interrupt is cleared, the interrupt is re-triggered whenever the interrupt signal changes state again (e.g., changes from low to high, or from high to low). This means that, whether it is a service core or a security core, after processing its own interrupt, its own interrupt state can be cleared independently without affecting interrupt processing of other cores.
This mechanism helps to reduce collisions and misoperations in interrupt handling. In a heterogeneous multi-core system, multiple processor cores may handle different interrupt sources simultaneously. If the level triggering mode is adopted, interruption loss or repeated triggering can be caused by signal disappearance or level change. The edge triggering mode can ensure that the interrupt can be accurately triggered every time the signal state changes, thereby improving the concurrency performance and the processing efficiency of the system.
The security core receives the interrupt signal, and then knows that the current service core has abnormal conditions in the running process, and further writes the first log in the memory into the nonvolatile storage medium. Such a non-volatile storage medium has the ability to persist data without the first log being lost even in the event of a system power failure or a reboot. After the first log writing is completed, in order to restore the normal operation of the service core and avoid system breakdown or instability caused by a dog timeout event, the security core may execute an operation of restarting the service core, and attempt to reset the service core. Since the first log has been written to the nonvolatile storage medium by the security core, the log can be recovered from the nonvolatile storage medium even after power is lost. Therefore, the reason of the dog timeout event can be analyzed from the first log restored to the memory, the problem of the difficulty is positioned, and corresponding processing is performed.
The BMC multi-core heterogeneous system integrating a plurality of processing cores of different types has the following beneficial effects:
by separating the security core from the service core, the security core can still operate normally and take corresponding measures even if the service core is abnormal or attacked. The security core can monitor the state of the service core and dump the first log in response to the abort, ensuring the integrity and accessibility of critical data and logs. The design improves the safety and reliability of the BMC multi-core heterogeneous system.
The abnormal interrupt triggering module and the service core operate in the same software system, so that the real-time performance of monitoring the operation state of the service core can be realized, the interaction between the service core and the service core is tighter and more efficient, the real-time monitoring mechanism is helpful for reducing the time of fault processing, and the abnormal interrupt triggering module and the service core can respond rapidly when the service core has errors or abnormal behaviors, so that the faults are prevented from being further expanded, and the reliability and the stability of the BMC multi-core heterogeneous system are improved.
And after the service core is powered down and powered up again to restore the normal running state, the service core accesses the first log backed up in the nonvolatile storage medium and restores the first log to the memory. The three components are running in the same software system, and in a first aspect they may share the same memory address space, data format and communication protocol. This reduces the conversion and formatting operations required for data transmission, thereby reducing the complexity of the communication. In a second aspect, data may be passed directly between these components without going through additional software layers (e.g., a network protocol stack). This reduces the hierarchy and complexity of the software stack and reduces communication latency.
Based on the structural framework of the BMC multi-core heterogeneous system, the log storage method is described by taking the security core in the BMC multi-core heterogeneous system as an execution main body, referring to fig. 5, fig. 5 is a flowchart of a first log storage method provided by an embodiment of the present application, where the method includes:
s51, the security core acquires the running state of the service core in real time.
After the security core is started, the running state of the service core is monitored in real time. These states may include normal operating states and abnormal operating states. When the service core is in an abnormal operation state, the security core can timely detect the state change of the service core. This may be accomplished by an exception signal sent by the secure core, a timer interrupt, or other mechanism.
S52, under the condition that the service core is in an abnormal operation state, the security core writes a first log in the memory into the nonvolatile storage medium; and under the condition that the service core is restored to the normal running state, the service core writes a first log in the nonvolatile storage medium into the memory.
The first log is generated by the service core in a normal running state and is written into the memory.
When the security core detects that the service core is in an abnormal state, the security core can write the first log in the memory into the nonvolatile storage medium, so that the first log is ensured not to be lost after the BMC is powered off or powered on and powered off for restarting.
In an actual application scenario, after the service core is in an abnormal state, the service core may be reset, so that the service core is restored to a normal operation state. The business core then checks whether the complete first log is present in the non-volatile storage medium. If the first log exists, the business core reads the first log from the nonvolatile storage medium. The business core restores the first log to memory for subsequent analysis, diagnosis, or re-execution of the relevant operations.
Therefore, the security core mainly bears security management and monitoring tasks of the BMC multi-core heterogeneous system. The method can be operated independently of the service core, and ensures that the safety management function of the BMC multi-core heterogeneous system is still available when the service core is in an abnormal operation state. And the nonvolatile storage medium has the ability to persist data without the first log being lost even in the event that the service core needs to reset the running state.
Based on the structural framework of the BMC multi-core heterogeneous system and interactions between the components, a log storage method when an abnormality occurs in a service core is described, and fig. 6 is a flowchart of a second log storage method provided by an embodiment of the present application, where the method includes:
and S61, when the service core is in an abnormal operation state, the safety core and the service core receive an interrupt signal sent by the abnormal interrupt triggering module.
The BMC is responsible for monitoring and managing various resources and functions of the server, and can collect various state information of the server through a hardware interface, such as CPU utilization rate, memory occupation, hard disk space, temperature, fan speed and the like. The state information may be accessed and used by the service core for further service logic processing, such as monitoring server state, executing management commands, etc. In the process, the service core can continuously generate logs, and the logs are mainly used for recording the process, state and abnormal information of the service core for executing service operation and are used for subsequent service analysis, problem investigation and performance optimization. By way of example, the log content may include various operations of the business core during execution, such as interface calls, request responses, data processing, and the like. These records provide rich contextual information for subsequent anomaly cause localization.
In an actual application scenario, after the service core is started, various state information of the server can be accessed in real time through IPMI or other protocols, and the state information is analyzed based on a first software system operated by the service core.
The first software system may be a Windows system, a Linux system, or the like. The first software system running on the business core may be used to be responsible for managing various resources in the BMC business core including, but not limited to, processors, memory, network interfaces, etc. It can allocate and reclaim resources according to the BMC's needs to ensure that the BMC can operate efficiently. For example, when a business core needs to perform a certain task, the first software system may allocate enough processor time and memory resources for it; when the task is completed, the first software system may reclaim these resources for use by other tasks.
And the first software system can ensure that the BMC can respond to various events of the server in real time, such as hardware faults, insufficient memory and the like through task scheduling, and timely take corresponding processing measures. For different types of state information, the first software system can provide corresponding monitoring tasks for the service cores. The monitoring tasks can correspond to different execution programs, and the service core can monitor the change of the state information in real time when executing the monitoring tasks, and alarm and generate logs according to a preset threshold value.
Thus, the first software system of the business core running is an important guarantee of the stable running and real-time management of the server. The method can ensure that the BMC can monitor and manage the server in real time by efficiently managing resources, scheduling tasks, processing interrupts, realizing inter-process communication and the like.
The memory is a volatile medium, such as a Dynamic Data Rate (DDR). The log data in the memory is gradually increased with time, and the first log refers to the log data stored in the memory. The first log may contain detailed information of the event, such as a time stamp, event type, related parameters, etc. The specific event may be preset according to actual requirements, such as a server start, a user request, etc. Through the process, the specific event of the server is converted into the traceable and analyzable log data, so that a basis is provided for timely finding and solving the problem.
The memory is a storage medium with very high reading and writing speed in the electronic system, and the first log is written into the memory to ensure the timely storage of the log. Meanwhile, the access speed of the memory is far higher than that of other storage media such as a hard disk, so that the delay caused by waiting for storage can be reduced by writing the first log into the memory, and the service core can be ensured to run continuously and efficiently. Writing the first log into memory may also facilitate subsequent log processing and analysis. Once the logs are written into the memory, the first logs can be quickly read out when needed, and filtering, analyzing, aggregating and the like are performed to extract useful information. This is very helpful in troubleshooting, performance optimization, system improvement, and the like.
The service core is responsible for monitoring and managing the various resources and functions of the server. During normal operation, the service core periodically sends a "feed dog" signal to the Watchdog (abort trigger module) to reset the timer of the Watchdog. When a service core experiences an exception, such as a Watchdog Timeout (Watchdog Timeout), while executing a certain monitoring task, the service core falls into a dead loop or fails to work properly for some reason, it will fail to send a "feed dog" signal to the Watchdog on time, which means that the current process or task on the service core is not completed within a predetermined time, in which case it may trigger an interrupt signal, i.e. "dog Timeout interrupt", and then the abort triggering module will send to both the security core and the service core.
In an actual application scenario, the service core watchdog timeout exception may be triggered by the following cases:
Application or service suspension: an application running on the service core stops responding for some reason and fails to complete its periodic tasks or the sending of heartbeat signals within a specified time, resulting in a watchdog timeout.
Hardware failure: failure of the CPU, memory, or I/O subsystem prevents the business core from executing tasks normally, resulting in a timeout.
Resource competition: resource contention, such as insufficient memory, cache locking, or I/O bottlenecks, prevents the business core from completing the task currently being performed in a timely manner.
Software errors: programming errors, deadlocks, or infinite loops cause processes on the business core to fail to complete on time, causing a timeout.
Over-frequency problem: if the service core is over-run, it may cause a watchdog timeout due to overheating or stability problems.
Power supply problem: an unstable power supply may affect the clock frequency of the CPU, resulting in an extended task completion time.
S62, responding to the interrupt signal, and writing the first log in the memory into the nonvolatile storage medium by the security core.
After the security core is started, the running state of the service core is monitored in real time. These states may include normal operating states (e.g., executing applications, processing data, etc.) and abnormal operating states (e.g., crashing, trapping into endless loops, etc.). When the service core is in an abnormal operation state, the security core can timely detect the state change of the service core. This may be accomplished by an exception signal, timer interrupt, or other mechanism. For example, the abnormal running state of the service core may be notified to the security core through the watchdog.
Upon receiving the interrupt signal, the security kernel immediately acts: first, first logs in a memory are written into a shared nonvolatile storage medium area so as to ensure that the important first logs are not lost under the condition of power failure or restarting; then, the security core performs a reset operation on the service core, where the reset operation is a means for attempting to restore the normal operation state of the service core. By resetting, the error state or abnormality in the service core can be cleared, so that the service core can be restarted and restored to a normal operation state, and the BMC multi-core heterogeneous system containing the current service core is tried to be quickly restored, and the downtime caused by the abnormality of the service core is reduced.
S63, resetting the service core in response to the first log in the memory being completely written into the nonvolatile storage medium.
After the first log is written into the nonvolatile storage medium, the security kernel can check and compare according to the state of the writing operation, and confirm that the first log in the memory has been completely dumped to the nonvolatile storage medium. And resetting the service core when the first log in the memory is completely written into the nonvolatile storage medium.
Before resetting the service core, the security core can communicate with the service core, so that the service core can complete necessary cleaning work and preparation work, such as closing an opened file, releasing occupied resources and the like, before resetting.
Wherein, the resetting the service core may include the following resetting methods:
In a first exemplary reset mode, the secure core may initiate a reset operation to the service core after the secure core confirms that the first log in the memory has been completely written to the nonvolatile storage medium. The reset operation means resetting the operation state of the service core, and restoring the service core in the abnormal operation state to the initial state or the preset safety state so as to clear possible error states or potential safety risks. After the reset operation of the service core is completed, the security core needs to monitor the state of the service core continuously. And if the service core is restored to the normal running state, dumping the first log in the nonvolatile storage medium. Namely, under the condition that the service core is reset by the security core and returns to the normal running state, the service core writes the first log in the nonvolatile storage medium into the memory.
In a second exemplary reset mode, the security core performs a reset operation and reboots the service core by powering down and powering back up in order of priority. Firstly, the security check is carried out on the service core to execute the reset operation, and if the service core still cannot be restored to the normal running state after the reset operation is finished, the service core is required to be powered down and powered up again, and the service core is restarted. And when the business core is restored to the normal running state, dumping the first log in the nonvolatile storage medium. The service core is not restored to the normal running state by the reset operation, and then the service core writes the first log in the nonvolatile storage medium into the memory under the condition that the service core is restored to the normal running state by powering down and powering up again.
In a third exemplary reset mode, after the security kernel confirms that the first log in the memory has been completely written into the nonvolatile storage medium, the security kernel may directly attempt to restore the service kernel to the normal running state by powering down and powering up again, and dump the first log in the nonvolatile storage medium when the service kernel is restored to the normal running state. And under the condition that the service core is recovered to a normal running state by powering down and powering up again, the service core writes a first log in the nonvolatile storage medium into the memory.
S64, the service core determines that a power-on mark exists in the BMC, the nonvolatile storage medium is provided with a first log, and the service core writes the first log in the nonvolatile storage medium into a memory.
The power-on flag is a flag bit existing in a BMC register and is used for judging whether the BMC finishes the actions of powering down and powering up again. When the BMC is powered up, the value in the register may be automatically cleared by the hardware to a default value. This default value may be 0 or some particular non-zero value. The value of the register is lost only when the BMC is powered down, so for the second exemplary reset mode or the third exemplary reset mode, the service core may further determine whether the current BMC has a power-on flag by reading the flag bit in the register.
Illustratively, when the BMC is powered down and powered back up, the service core reads the value of the register. If the read value is the default value, the service core can judge that the current BMC is new to be powered on once, and the current register has a power-on flag. If the read value is not the default value, the service core can judge that the current BMC is powered on before, and does not complete the actions of powering down and powering up again. In this case, the manager may re-power down and re-power up the service core.
Powering down and powering up again is a fault removal means, which can clear the memory state of the BMC, reset all hardware components to the initial state, and try to restore the normal running state of the service core, so as to solve some problems caused by configuration errors or software conflicts. This operation can solve temporary faults or software problems, and restore the service core to normal operation.
From the hardware level, the BMC multi-core heterogeneous system is powered down and powered up again, and the BMC multi-core heterogeneous system can be realized through a power supply module of the BMC. The BMC is a management subsystem independent of the server's host computing node for deploying, diagnosing and managing the server, and may have an independent power module for controlling the BMC's power state. This means that the power up and power down operation of the BMC does not directly affect the power state of the server host computing node. In an actual application scene, the BMC is connected with the power supply unit of the BMC, and the BMC can be powered down by disconnecting the BMC from the power supply unit of the BMC. The BMC can be powered on by connecting the BMC with a power supply unit of the BMC.
In another alternative implementation manner, the entire server may be powered down and powered up again, so as to achieve the effect of resetting the running state of the service core in the BMC. The server may configure its own power switch that allows an administrator to directly control the power state of the server. When an administrator operates a power switch of a server, it is actually in a state of controlling a power supply module of the server to supply power to the server. The administrator may turn the power to the server on or off by pressing or toggling this switch.
And in the execution process of power-down and power-up operation again, the service core judges whether a power-up mark exists in the current BMC multi-core heterogeneous system, and checks whether a dumped first log exists in the nonvolatile storage medium. If the two conditions are met, the service core restores the first log in the nonvolatile storage medium to the memory, and clears the first log and the power-on mark in the nonvolatile storage medium, so that confusion or errors caused by the old first log are avoided when the BMC multi-core heterogeneous system is started next time.
Therefore, aiming at the scene that the service core needs to be powered down, the safe storage and recovery of the first log are ensured by combining the cooperative work of the service core and the safety core and utilizing the mechanisms such as the overtime interrupt of the watchdog, the nonvolatile storage medium, the mark judgment and the like, thereby facilitating the subsequent positioning and system maintenance of the problematic problem.
In an alternative implementation, the business core may also perform some additional operations to ensure the stability and security of the BMC. As shown in fig. 7, fig. 7 is a flowchart of a third log storage method according to an embodiment of the present application, and after step S64, the method further includes:
S65, when the first log in the nonvolatile storage medium is completely written into the memory, the service core clears the first log in the nonvolatile storage medium.
Old first logs in the nonvolatile storage medium that are no longer needed can be purged by the service core to free up storage space and avoid data confusion. In addition, the service core can record the reset event so as to maintain and check the problem of the BMC later.
In an optional implementation manner, after resetting the service core by powering down and powering up again in S63, when the service core is restored to a normal running state, the service core may verify whether the first log in the nonvolatile storage medium is available, as shown in fig. 8, fig. 8 is a flowchart of a fourth log storage method provided by an embodiment of the present application, and after step S63, further includes:
and S66, when the service core is recovered to a normal operation state through power-down and power-up again and the nonvolatile storage medium is provided with a first log, the service core detects the first log.
The service core is restored to a normal operating state by powering down and powering up again, and the service core may perform at least one of the following steps to check and verify whether the first log in the memory or the nonvolatile storage medium is available when the first log is present in the nonvolatile storage medium:
Step A1, detecting whether the first log accords with a preset integrity condition.
The preset integrity condition is a limiting condition for verifying whether the first log is complete, and the limiting condition can be obtained according to the hash value, the log length, the log size, the log format and the like. In an actual application scenario, the service core reads a first log in the nonvolatile storage medium and verifies the integrity of the first log according to a preset integrity condition, and the process may include at least one of the following steps:
The hash value of the first log is checked to ensure that the data has not been tampered with or corrupted during transmission or storage. It is checked whether the length or size of the first log meets expectations to prevent data truncation or deletion. The format and structure of the first log are checked for correctness to ensure that they can be properly parsed and analyzed.
And A2, detecting whether the first log accords with a preset validity condition.
The preset validity condition is a limiting condition for verifying whether the first log is valid, the limiting condition can be obtained according to a time stamp, a serial number, a generating sequence, a generating time and the like, and in an actual application scene, the service core detects the validity of the first log based on the preset validity condition and can comprise at least one of the following steps:
the time stamps or sequence numbers in the first log are checked to confirm whether they were generated in the correct order and time. The content of the first log is analyzed to find out whether an abnormality, error or warning information exists.
Based on the integrity and validity checks described above, the security kernel may determine the reliability of the first log in the current memory or nonvolatile storage medium. If the first logs are complete and valid, the security kernel may consider that these first logs may be used for system anomaly cause localization.
Step S66 is executed if the detection result is abnormal, and step S67 is executed if the detection result is not abnormal.
S67, under the condition that the detection result is abnormal, the service core generates alarm information.
The alarm information is used for informing the manager that the first log in the nonvolatile storage medium is not available currently, so that the situation of misjudgment caused by misjudgment of the first log is avoided.
It should be noted that, when the detection result shows that the abnormality indicates that the data in the first log is unavailable, an abnormality exists; in the actual application scenario, the step A1 and the step A2 may be sequential or performed simultaneously.
And S68, under the condition that the detection result is not abnormal, the service core writes the first log in the nonvolatile storage medium into a memory.
The detection result shows that no abnormality occurs to indicate that the data in the first log is available, so that the service core can read the first log in the nonvolatile storage medium and write the first log into the memory completely.
On the one hand, after the service core is abnormal, the first journal is restored to the memory from the nonvolatile storage medium, so that the service core can be helped to quickly restore to the operation state before the service core fails by reading the first journal, and the service core can directly perform fault diagnosis through the first journal in the memory, thereby avoiding the delay of repeatedly reading the journal from the nonvolatile memory.
On the other hand, write and read operations of non-volatile storage media are typically much slower than memory operations, and frequent disk I/O operations can increase the load and power consumption of the system. The access speed of the memory is far higher than that of nonvolatile storage media such as a hard disk, a flash memory and the like. The service core restores the first log in the nonvolatile storage medium to the memory because the log data is required to be read subsequently to analyze the abnormal reasons, so that the access times to the nonvolatile storage medium can be reduced, and the delay of data access can be obviously reduced.
And S69, when the first log in the nonvolatile storage medium is completely written into the memory, the service core clears the first log in the nonvolatile storage medium.
When the first log in the nonvolatile storage medium is completely written into the memory, the first log in the nonvolatile storage medium is identical to the first log in the memory, and in order to release the storage space in the nonvolatile storage medium, only the first log in the memory can be reserved for subsequent fault cause analysis.
In an alternative implementation manner, when the service core is reset in the first exemplary reset manner, that is, the service core performs the reset operation through the security core, the service core is restored to the normal running state after the security core performs the reset operation, at this time, power-down and power-up are not required, and the first log in the memory is not lost, that is, the logs exist in the memory and the nonvolatile storage medium area. In this case, the first log in the memory or nonvolatile storage medium may be directly used for abnormality cause localization. Referring to fig. 9, fig. 9 is a flowchart of a fifth log storage method according to an embodiment of the present application, and after steps S61 to S63, the method further includes:
And S70, after the service core is reset to a normal running state by the security core, the service core clears the first log in the nonvolatile storage medium.
In other words, in the case that the service core is reset to the normal running state by the security core by adopting the first exemplary reset mode, the first log in the memory is not lost, that is, the logs exist in the memory and the nonvolatile storage medium area. In this case only the first log in memory may be reserved for subsequent failure cause analysis. The first log in the non-volatile storage medium is purged to ensure that the non-volatile storage medium has sufficient space available.
In another optional implementation manner, under the condition that the service core is restored to the normal running state after the security core executes the reset operation, the logs exist in the memory and the nonvolatile storage medium area, so that in order to ensure the accuracy and the security of the logs, the service core can compare the logs in the nonvolatile storage medium with the logs in the memory, and when the comparison is inconsistent, the logs in the nonvolatile storage medium are subject to the comparison. As shown in fig. 10, fig. 10 is a flowchart of a sixth log storage method according to an embodiment of the present application, and further includes, after steps S61 to S63:
S81, responding to successful resetting of the service core by the security core, and comparing the first log in the nonvolatile storage medium with the log in the memory by the service core.
Prior to the comparison, the business core may pre-process the first log read from two different sources. The process may include format conversion of the data, time stamp alignment, data cleansing, etc., to ensure that the two data sets can be compared in the same manner.
The business core compares the logs in the nonvolatile storage medium and the memory piece by piece. This typically involves comparing key information such as time stamps, event descriptions, parameter values, etc. for each log. During the comparison, the security kernel detects a first log of inconsistencies or conflicts. These inconsistencies may manifest themselves as differences in time stamps, inconsistencies in event descriptions, or changes in parameter values.
When the comparison result represents that the first log in the nonvolatile storage medium is consistent with the log in the memory, executing step S82; when the comparison result indicates that the first log in the nonvolatile storage medium is inconsistent with the log in the memory, step S83 is executed.
S82, when the comparison result shows that the first log in the nonvolatile storage medium is consistent with the log in the memory, the service core clears the first log in the nonvolatile storage medium.
And clearing the first log in the nonvolatile storage medium, and only reserving the first log in the memory for subsequent fault cause analysis to ensure that the nonvolatile storage medium has enough available space.
S83, when the comparison result shows that the first log in the nonvolatile storage medium is inconsistent with the log in the memory, the service core clears the log in the memory, and writes the first log in the nonvolatile storage medium into the memory.
If the first log is confirmed to be inconsistent, the business core takes the log in the nonvolatile storage medium as the reference, clears the log in the memory, writes the log in the nonvolatile storage medium into the memory, and is used for positioning the abnormality cause.
In an alternative implementation, the service core may trigger the alert mechanism immediately upon detecting that the log in the nonvolatile storage medium is inconsistent with the log in the memory. Alert information may be sent to system administrators or designated security teams informing them that the dump log is inconsistent. The alert information may include information of inconsistent details, location, time stamp, etc. so that an administrator can quickly locate and resolve the problem.
Meanwhile, the service core can record the inconsistent event of the comparison. The business core stores inconsistent logs, comparison results and related information triggering alarms in the nonvolatile storage medium into the memory. Thus, a system administrator or security team can subsequently conduct detailed reviews and analyses to understand the cause and effect of the discrepancy.
In an alternative implementation, the service core may further perform a comparison verification on the first log in the nonvolatile medium and the log in the memory. That is, without erasing the log in the memory, the first log in the nonvolatile medium is written into the memory, and the logs from the two storage media are placed in different storage areas and compared to eliminate possible read errors or data corruption problems. If the verification result still shows inconsistency, the business core can analyze the inconsistent part in detail.
The service core can query the key information such as the time stamp, the event description, the parameter value and the like of each inconsistent log. By comparing logs from both storage media, the business core can initially determine the nature and possible cause of the inconsistency, e.g., due to data transmission errors, storage media failure, malicious tampering, etc.
For verifying the reasons of the inconsistency, the service core may adopt a preset technical means. For example, the business core may verify a log written in the nonvolatile storage medium using a hash algorithm to confirm whether the data was tampered with during transmission or storage. In addition, the security kernel may also utilize digital signature techniques to verify the integrity and authenticity of the data.
Besides technical means, the service core can also be combined with the context information and service logic of the system for comprehensive analysis. For example, it may look at other relevant logs of the business core over the period of time, learn the running state of the BMC and the sequence of events, and further infer the likely sources and effects of inconsistent logs.
According to the analysis and verification results in the process, the security kernel can formulate corresponding processing measures. If the inconsistency is confirmed to be caused by system faults or data transmission errors, corresponding repairing measures can be adopted; if malicious tampering or an aggressive behavior is suspected, corresponding security measures may be taken to secure the system.
Referring to fig. 11, fig. 11 is a flowchart of a seventh log storage method according to an embodiment of the present application, and the following describes contents of each part in the drawing:
The service core is mainly responsible for management functions of the BMC and interaction with the server hardware devices, such as monitoring server states, executing management commands and the like. In the figure, "1", the service core normally operates to generate a log ", the service core executes tasks and continuously generates the log in the running process to write in the memory. Illustratively, the memory DDR (Double Data Rate) may be a high-speed dynamic random access memory for storing data generated during system operation, including logs generated by service cores. In the figure, DDR is used as a temporary storage medium to store logs generated by a service core.
The service core watchdog refers to the watchdog used for monitoring the service core working condition. If the service core does not send a signal to the service core watchdog within a specified time, the service core watchdog triggers a timeout interrupt, and simultaneously sends interrupt signals to the security core and the service core, namely '2' and 'dog timeout interrupt' in the figure. In the actual application process, the interrupt signal may be shielded by the service core in an abnormal operation state.
After the service core is overtime and sends an interrupt signal to the security core, the security core responds to the interrupt signal to dump the first log in the DDR, such as '3 in the figure, writing the first log in the memory into the nonvolatile storage medium', and writing the first log in the volatile storage medium into the nonvolatile storage medium so as to ensure the durability of the data. Wherein the non-volatile storage medium may be EMMC.
After the dump is completed, the security core can try to reset the service core, and when the reset is unsuccessful, the service core is started by powering on and powering off after the reset failure of the service core is performed in '4' of the execution diagram, namely, the BMC is required to be restarted by powering on and powering off, so that the service core is restarted and is restored to a normal running state. In the service core execution diagram, "5", it is judged that the power-on flag exists in the BMC, the first log is arranged in the nonvolatile storage medium, the first log of the nonvolatile storage medium is written into the memory ", the service core judges that the power-on flag exists in the BMC, the first log is arranged in the nonvolatile storage medium, and the service core writes the first log in the nonvolatile storage medium into the memory.
Therefore, the embodiment of the application has the following beneficial effects:
In a BMC multi-core heterogeneous system, a security core is responsible for saving a first log in a business core memory to a nonvolatile storage medium when an abnormal interrupt occurs. Even if the service core fails to reset, the system is powered down or restarted, and the first log is not lost. After the system is restarted, the first log may be recovered from the non-volatile storage medium and loaded into memory. The reason of the abnormal interrupt event can be found out by analyzing the recovered first log, and corresponding processing measures are adopted. This approach helps to improve system stability and reliability, while also helping to reduce the risk of data loss due to system failure.
On the basis of the method described in the above embodiment, correspondingly, the application further provides a computer readable storage medium, which stores a computer program, which when run on a processor causes the processor to execute the method in the above embodiment.
Based on the method in the above embodiments, an embodiment of the present application provides a computer program product, characterized in that the computer program product, when run on a processor, causes the processor to perform the method in the above embodiments.
Based on the method in the above embodiment, the embodiment of the application provides a computing device, which includes a motherboard and a chip. The chip is integrated on the main board and comprises at least one memory for storing programs; at least one processor for executing the programs stored in the memory, the processor being adapted to perform the methods of the embodiments described above when the programs stored in the memory are executed. In an embodiment of the present application, the computing device may be a network device such as a server, a host, or the like. The chip may be a BMC, a chip storing BIOS, or the like. The type of computing device and the type of chip are not limited in embodiments of the application.
It should be noted that in other embodiments, the BMC may be referred to differently in different computing devices, e.g., the BMC of the server, the BMC of the hyper fusion server, the BMC of the HPE server, the BMC of the DELL server, and the BMC of the iLO server, respectively, may be referred to as iDRAC.
Based on the method in the above embodiment, the embodiment of the present application further provides a chip. Referring to fig. 12, fig. 12 is a schematic structural diagram of a chip according to an embodiment of the application. As shown in fig. 12, the chip 1100 includes one or more processors 1101, interface circuitry 1102, and memory 1103. The processor 1101 may also include a traffic core and a security core. The memory may include memory and non-volatile storage media. The processor 1101, interface circuit 1102, and memory 1103 may be connected by a bus or other means.
The processor 1101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware in the processor 1101 or instructions in software. The processor 1101 described above may be a general purpose processor, a digital communicator (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The methods and steps disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The interface circuit 1102 may be used for transmitting or receiving data, instructions or information, and the processor 1101 may process using the data, instructions or other information received by the interface circuit 1102 and may transmit process completion information through the interface circuit 1102.
The functions corresponding to the processor 1101 and the interface circuit 1102 may be implemented by a hardware design, a software design, or a combination of hardware and software, which is not limited herein.
It should be understood that, the sequence number of each step in the foregoing embodiment does not mean the execution sequence, and the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application. In addition, in some possible implementations, each step in the foregoing embodiments may be selectively performed according to practical situations, and may be partially performed or may be performed entirely, which is not limited herein.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (solidstatedisk, SSD)), etc.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The foregoing description of the exemplary embodiments of the application is merely illustrative of the application and is not intended to limit the scope of the application.

Claims (10)

1. A log storage method, applied to a baseboard management controller BMC, the BMC including a service core, a security core, a memory and a nonvolatile storage medium, the method comprising:
the security core acquires the running state of the service core in real time;
under the condition that the service core is in an abnormal operation state, the security core writes a first log in the memory into the nonvolatile storage medium;
under the condition that the service core is restored to a normal running state, the service core writes a first log in the nonvolatile storage medium into the memory; the first log is generated by the service core in a normal running state and is written into the memory.
2. The method of claim 1, wherein the BMC further comprises an abort module, and wherein the secure core obtains the operational status of the service core in real time, comprising:
the security core acquires signals sent by an abnormal interrupt module in real time, and the abnormal interrupt module is used for judging whether the service core is in an abnormal operation state or not and sending interrupt signals to the security core when judging that the service core is in the abnormal operation state;
When the service core is in an abnormal operation state, the security core writes the first log in the memory into the nonvolatile storage medium, and the method comprises the following steps:
And under the condition that the safety core receives an interrupt signal sent by the abnormal interrupt module, the safety core writes a first log in the memory into the nonvolatile storage medium.
3. The method of claim 1, wherein the restoration of the service core to a normal operating state comprises:
And the service core is restored to a normal operation state by powering down and powering up again.
4. The method of claim 1, wherein the restoration of the service core to a normal operating state comprises:
and the security check executes a reset operation on the service core, and the reset operation is used for resetting the running state of the service core.
5. The method according to claim 4, wherein the method further comprises:
and under the condition that the service core is not restored to the normal running state by the reset operation of the safety core, restoring the service core to the normal running state by powering down and powering up again.
6. The method of claim 4, wherein the service core writing the first log in the non-volatile storage medium to the memory if the security core performs a reset operation to the service core and the service core returns to a normal operating state, comprising:
The service core compares a first log in a nonvolatile storage medium with a log in a memory;
When the comparison result shows that the first log in the nonvolatile storage medium is inconsistent with the log in the memory, the service core clears the log in the memory, and writes the first log in the nonvolatile storage medium into the memory.
7. The method of any of claims 1 to 5, wherein the business core writing a first log in the non-volatile storage medium to the memory comprises:
The service core detects whether a first log in a nonvolatile storage medium accords with a preset validity condition and/or presets an integrity condition;
And writing the first log in the nonvolatile storage medium into a memory when the first log in the nonvolatile storage medium meets a preset validity condition and/or an integrity condition is preset.
8. The method of any of claims 1 to 6, wherein after the service core writes the first log in the nonvolatile storage medium to the memory, further comprising:
when the first log in the nonvolatile storage medium is completely written into the memory, the service core clears the first log in the nonvolatile storage medium.
9. A BMC, characterized in that it comprises a processor and a memory, said memory being adapted to store a computer program, said processor being adapted to execute said computer program to implement the log storage method according to any of claims 1 to 8.
10. A computing device comprising a processor, a memory, and a BMC to perform the log storage method of any of claims 1 to 8.
CN202411045056.9A 2024-07-31 2024-07-31 A log storage method, BMC and computing device Pending CN119025030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411045056.9A CN119025030A (en) 2024-07-31 2024-07-31 A log storage method, BMC and computing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411045056.9A CN119025030A (en) 2024-07-31 2024-07-31 A log storage method, BMC and computing device

Publications (1)

Publication Number Publication Date
CN119025030A true CN119025030A (en) 2024-11-26

Family

ID=93528122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411045056.9A Pending CN119025030A (en) 2024-07-31 2024-07-31 A log storage method, BMC and computing device

Country Status (1)

Country Link
CN (1) CN119025030A (en)

Similar Documents

Publication Publication Date Title
US5948112A (en) Method and apparatus for recovering from software faults
US9043656B2 (en) Securing crash dump files
JP6333410B2 (en) Fault processing method, related apparatus, and computer
US7516361B2 (en) Method for automatic checkpoint of system and application software
JP4345334B2 (en) Fault tolerant computer system, program parallel execution method and program
Panda et al. {IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services
US7363546B2 (en) Latent fault detector
WO2020239060A1 (en) Error recovery method and apparatus
CN100383748C (en) Policy-based responses to system errors that occur during OS runtime
Jha et al. Resiliency of hpc interconnects: A case study of interconnect failures and recovery in blue waters
CN107133130A (en) Computer operational monitoring method and apparatus
CN112650610B (en) Linux system crash control method, system and medium
JP2017078998A (en) Information processor, log management method, and computer program
JP2007133544A (en) Failure information analysis method and apparatus for implementing the same
JP7389877B2 (en) Network optimal boot path method and system
Sultan et al. Nonintrusive remote healing using backdoors
CN119025030A (en) A log storage method, BMC and computing device
CN116627702A (en) Method and device for restarting virtual machine in downtime
US11797368B2 (en) Attributing errors to input/output peripheral drivers
US12367092B2 (en) Attributing errors to input/output peripheral drivers
TWI781452B (en) System and method for monitoring and recovering heterogeneous components
JP2785992B2 (en) Server program management processing method
Sollom Cray’s node health checker: an overview
US20240320012A1 (en) Method, electronic device, and computer program product for data processing
CN108415788B (en) Data processing apparatus and method for responding to non-responsive processing circuitry

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination