WO2011054861A1 - Monitoring and management of heterogeneous network events - Google Patents
Monitoring and management of heterogeneous network events Download PDFInfo
- Publication number
- WO2011054861A1 WO2011054861A1 PCT/EP2010/066730 EP2010066730W WO2011054861A1 WO 2011054861 A1 WO2011054861 A1 WO 2011054861A1 EP 2010066730 W EP2010066730 W EP 2010066730W WO 2011054861 A1 WO2011054861 A1 WO 2011054861A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- network
- events
- fault
- correlation
- event
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0604—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/50—Network service management, e.g. ensuring proper service fulfilment according to agreements
- H04L41/5061—Network service management, e.g. ensuring proper service fulfilment according to agreements characterised by the interaction between service providers and their network customers, e.g. customer relationship management
- H04L41/5074—Handling of user complaints or trouble tickets
Definitions
- the present invention applies to the field of telecommunications and, more specifically, to the monitoring, operation and maintenance in telecommunications networks.
- European patent EP-0686329-B1 discloses a method for event correlation occurring in a telecommunications network formed by individual network elements. At least one of the network elements is represented in a computer by means of individual program modules. Each program module has a data set related to the network elements which represents and includes the data set of the module, for example, a list of network elements having input associations with respect to the event correlation with the network element represented by the module.
- Spanish patent application ES-2216727 describes a system for automated remote access based on a centralized management system.
- the system has means of linking with transmitter with remote concentrator cards (RCC) located in the clients' headquarters, in which the equipment maintenance interfaces can be connected.
- RCC remote concentrator cards
- European patent EP-0686336-B1 discloses a system which automatically processes alarm signals and performs their storage and correlation. To that end, this system uses an empirical approach for identifying alarm-type events. It specifically proposes a method for automatically processing alarm signals in a telecommunications network, in which alarm-type event signals are transmitted from the telecommunications network to a network management system. For automatic processing, the historical data relating to the times in which the alarm conditions occur in the network are stored for a certain time, the alarm conditions which occur in a certain temporal window are identified, the identified alarm conditions are correlated by analyzing historical data in order to determine the probability that pairs of identified alarm conditions randomly occur within the same temporal window and these probabilities are viewed.
- European patent EP-1031208-B1 discloses a telecommunications network monitoring method comprising a large number of network elements. Said method comprises the steps of collecting, in a centralized manner, alarms coming from the network elements, displaying the alarms as network-element-specific graphical presentations on a graphical user interface, collecting, in a centralized manner, network-element-specific performance information of each network element and the visual representation simultaneously with the alarms of the respective network element's performance on said graphical user interface.
- the present invention seeks to cover this deficiency by means of automating the treatment of events and solution of possible faults.
- a monitoring and operating method comprising the steps of: receiving and normalizing a plurality of heterogeneous events generated by at least one network element of a telecommunications network, obtaining a plurality of events with a normalized field structure; for each event with a normalized field structure, generating a flow of information of attribute-value pairs, after having applied one or more filtering criteria; from each flow of information formed by a set of attribute-value pairs, generating a correlation object; correlating one or more correlation objects by means of at least one criterion included in an inference engine, determining, if at least one correlation criterion is met, that a fault has occurred in the network; executing a task associated with the treatment of said fault; executing an agent associated with said task, such that: if the fault can be resolved remotely, acting on the network; if the fault cannot be resolved remotely, sending a trouble ticket message so that a person can manually repair the fault.
- the step of generating a flow of information from each event with a normalized field structure preferably comprises filtering said events by comparison between the fields of the event and previously configured filtering patterns, such that only the events which have passed said filtering are provided to the step of correlation.
- the step of generating a flow of information from each event with a normalized field structure further comprises at least one of the following actions: inhibiting events, temporarily retaining events and enriching the information comprised in events. If the fault cannot be resolved remotely and a trouble ticket message is sent so that a person can manually repair the fault, said trouble ticket is sent to a trouble ticketing system so that there is a record of said task.
- This step of controlling the trouble tickets comprises at least one of the following activities: distributing trouble tickets by different criteria; interface with the trouble ticketing system for creating the trouble tickets; storing and displaying the trouble tickets in a WEB interface.
- correlation criterion included in a configuration file used by the inference engine which is preferably chosen from among the following: association of activities and terminations of correlation objects generated from alarm-type events, association by different criteria of objects generated from events, counting of objects generated from the flow of information formed by attribute-value pairs for ordering the execution of a task, intermittence of the objects generated from alarms in a time period and decision-making by time slot.
- the step of generating a flow of information from each event with a normalized field structure furthermore preferably comprises storing said events with a normalized field structure in a repository.
- the step of launching a message comprising an identifier of an agent capable of executing a task associated with said fault is performed after at least one of the following actions: controlling the limit of tasks which can be executed on the network; retaining the jobs while determined criteria are met; inhibiting jobs on the network; and establishing priorities between the tasks.
- the method is furthermore preferably configured to hierarchically find the root cause of the trouble tickets.
- the method is preferably configured to execute tasks programmed in the network without having to receive any external stimulus.
- a monitoring and operating system configured to carry out the steps of the aforementioned method.
- This monitoring and operating system preferably comprises at least one server.
- the system comprises a trouble ticketing module.
- a computer program comprising computer program code means suitable for performing the steps of the method described above when the mentioned program is executed in a computer is provided, a digital signal processor, a field-programmable gate array, an application- specific integrated circuit, a microprocessor, a microcontroller or any other type of programmable hardware, even in a distributed manner.
- the invention provides a structured monitoring and operating system for managing agents for any type of network, and consisting of a set of processes capable of treating very heterogeneous network fault events, characterized by having a method capable of correlating the events and attempting to detect and, where appropriate, to resolve the fault automatically, and to otherwise generate a trouble ticket for being treated manually by specialized technicians.
- the system and method is susceptible to incorporating new types of jobs to the system dynamically, by simply configuring the job in the database, and configuring the rule change in order to launch it.
- the network monitoring system and method is susceptible to adding information referring to the fault in the manual actuations so that it is easier to resolve.
- This information that is inserted in the trouble ticket is extracted from the information of the event, and from queries made from the agent on the network element.
- the network monitoring system and method can be applied to any type of network which has as an output source events indicating the status of the network. It is further scalable in different architectures, depending on the type of the networks to be managed.
- It also has the capacity to hierarchically find the root cause of the trouble tickets, i.e., it can relate a fault in broadband that it is a consequence of a fault in the transport network.
- the network monitoring system and method also have the special capacity to execute jobs programmed in the network without having to receive any external stimulus. It allows auto-diagnosing the working of the processes making up the system in order to attempt to assure the correct working thereof, as well as a capacity to export events to other systems.
- Figure 1 shows a flow chart of the steps of execution of the method according to an embodiment of the present invention.
- agent must be understood as a specific process in charge of hosting and implementing the logic for trying and attempting to solve a specific type of fault or for generating a trouble ticket in the event that the fault has not been able to be automatically solved and it requires manual intervention.
- This agent can be implemented, for example, as a C++ executable, or a script written in KSH.
- task must be understood as the set of actions involved in performing an activity, and that are defined in the agent itself.
- Figure 1 shows a flow chart of the steps of execution 1 10 120 130 140 150
- FIG. 160 of the method of the invention.
- the method is implemented in a monitoring and operating system which is executed in a distributed manner in a set of servers.
- reference numbers 1 10 to 160 indicate the steps of the method.
- Figure 1 further shows a block 101 which schematically represents the services on which said steps 1 10 120 130 140 150 160 are supported, a block 300 which represents the telecommunications network or networks under monitoring (operation and maintenance) and a block 200 which represents a trouble ticketing system of the network.
- the method which facilitates automation in the operation and maintenance of telecommunications networks 300, comprises seven steps 1 10 120 130 140 150 160, with their corresponding functional blocks.
- the first step 1 10 is receiving and normalizing the events 1 received from the different monitored networks 300. These events 1 are received in one or more servers forming the monitoring and operating system. There are three types of events 1 :
- Alarm-type event Message with activation and termination.
- Non-limiting examples of these protocols used are those based on Socket TCP/IP (Transmission Control Protocol and Internet Protocol), SNMP (Simple Network Management Protocol) and TEMIP (Telecommunications Management Information Platform).
- networks 300 from which events are received are: mobile access networks, such as, GSM and UMTS technologies, receiving the events by means of TCP/IP protocol; switching networks receiving the events by means of TCP/IP protocol; broadband networks, such as ATM (Asynchronous Transfer Mode) or the RIMA network (Telefonica's IP network), by means of TCP/IP protocol; transport networks, such as the Synchronous Digital
- SDH Hierarchy
- ANAS Automatic Network Answering Service
- Normalized events 2 which have a homogenous structure are formed with the same structure from the events 1 of the different networks 300, such that they can be analyzed.
- the fields that are normalized are:
- sender indicates the module of the system that sends the alarm to the alarm receiving module.
- id_event_bd unique identifier of the alarm in the platform.
- ⁇ family technology of the network element.
- the preceding fields represent a non-limiting example. In other words, it is up to the manager or supervisor of the operating and maintenance platform to modify, by reducing or increasing, the number of fields that are normalized, the alarm-type event or any other event.
- the normalization of the events 1 is performed by applying patterns or rules to the information received in the event 1 , which are used to complete the aforementioned normalized fields. These rules are based on the knowledge of a network expert, whereby the normalization is parameterized.
- a network expert whereby the normalization is parameterized.
- alarms from a transport network and alarms from a broadband network which can be related in the event of a fault, can be pointed out.
- a cutoff in a PDH transmission line is related to a cutoff in broadband communications, or mobile communications, going over said transmission line.
- the different alarms can be related through one or more normalized fields.
- the different events 1 of this step 1 10 have a different configuration for each of the domains (understanding domain to be a technology or a particularization of a type of network; for example, within the broadband network, the ATM access network domain, the IP access network domain, and the IP network domain), hence the reasons for which it is necessary to normalize events 1 .
- the result of this step 1 10 (the events with a normalized field structure 2) is reported to the process of the second step 120. It is further possible to receive the events through different connection channels, one supporting the other, eliminating event repetitions by means of comparing the received text.
- the second step 120 is the step of pre-processing and filtering the normalized events 2 which are received from the previous step 1 10.
- Preprocessing is understood as one or more actions carried out on normalized events 2, among those which are described below.
- the step of pre-processing 120 is useful so that the monitoring and operating system is not saturated with insignificant events.
- Filtering is understood as the action of non- progression of determined events which are not considered important for the remaining steps or for the operation and maintenance of the network 300.
- a particular case of filtering is inhibition, which relates to a more temporary filtering situation, for example, the non-progression of determined events known to be received, for example, due to performing a job in the network.
- the criteria for filtering events 2 are the following: soc, origin, type, text (allows regular expressions), text_missing (allows regular expressions, if the text does not appear in the text field of the event, it is filtered), numberSequence, family, group, category, class, nature, observations, text (explains and summarizes so that the filter works), key_cluster, progression, data_auxiliary (wild filtering field), expiration.
- This filtering is performed by comparison between the fields of the normalized event 2 and previously configured filtering patterns (for example, it can be configured as a filtering pattern for filtering the events of a determined origin). This task allows filtering normalized events 2 before beginning the subsequent step of correlation 130.
- the configuration of the filtering criteria is performed through forms available by means of a WEB interface.
- c. Enriching the information of the events. This consists of adding information which is not included in the event which is received from the network 300.
- the actions of enriching the information are configured in a data file reflecting mainly: (a) the queries to the inventory or repository that must be made in order to obtain more complementary information, and (b) the commands which must be executed on the corresponding network for the same purpose (queries to the network).
- query when an event arrives from a transmission element, a query can be made in the inventory as to the circuits passing through said element.
- the file is particular for each domain, and for each "type" of event within the domain. The management of this configuration is performed by the system administrator.
- As an example of a command when an event arrives from a router, the information received in the event can be enriched with the petition to the router for the version by means of the command "show version". This enrichment is optional.
- the third step 130 is the step of correlation of the events. More specifically, this step comprises creating correlation objects and correlating said objects.
- correlation objects are automatically generated from the data received (a flow of information formed by attribute-value pairs 3).
- These correlation objects are classes, the structure of attributes of which is defined in configuration files, and the values of which are completed from the data received in the previous step.
- the correlation objects are generated according to a series of criteria included in the configuration files, indicating which correlation object must be formed and how.
- An inference engine is used during this step.
- the inference engine is a reasoning module which allows applying a series of reasoning criteria (hereinafter referred to as "rules") for making different decisions, depending on the input data.
- the inference engine is based on rules defined in a data file.
- the inference engine is implemented by means of a UNIX process. It is implemented by means of a conventional RETE algorithm, the content of which is outside the scope of the invention.
- the inference engine is in charge of correlating said objects, always for the purpose of reducing the manual treatment of events generated by the network.
- These correlation objects are particular for each network domain.
- an order is generated directed to the following step 140 for performing a task, task being understood as the set of actions involved in performing an activity.
- the result of this step 130 is a message 4 in string form comprising an order to launch a task.
- This message 4 which is sent to the following step comprises the parameters which identify the task that must be executed, as well as some parameters which serve for the execution thereof. As stated, this message is sent as a string.
- Correlation criteria (from the inference engine) are applied to the correlation objects generated.
- the following correlation criteria are always applied:
- a. - Association of activations and terminations of correlation objects generated from the alarms (alarm-type events).
- alarm activations and terminations are correlated.
- a general-type event for example, an informative event
- the two alarm-type events are ruled out because the problem is considered to be concluded.
- the relationship between the activation and the termination of the correlation objects is done from a field called "Key Event”.
- b. - Association by different criteria of objects generated from events. In this case, tasks are executed as the result of relating a determined set of objects.
- time slot f. - Decision-making by time slot. For example, a different severity is given to the faults depending on the time slot in which they occur, and thus they can be prioritized for their resolution.
- the fourth step 140 is the step of task management. According to the correlation performed in the previous step, in which several events 1 generally participate which are symptoms of the occurrence of a determined network fault, a task is launched which attempts to solve the fault. Specifically, in this step 140 a message 5 is launched which identifies an agent capable of executing a task, action or activity. There is a univocal, not biunivocal, relationship between tasks which must be taken on and agents configured for executing those tasks. In other words, there is an agent for each task. The possible tasks are a consequence of a determined object correlation (and since the correlation objects come from events 1 , it can be said that the possible tasks are a consequence of a determined event correlation). Before launching the task, certain checks are performed in this step which prevent possible problems in the network due to launching these tasks. For that purpose, during this step 140, the following functions will be performed:
- Control of limits of tasks on the network i.e., it checks that the maximum number of tasks that can be performed on a network element is not exceeded.
- Retentions It allows retaining a determined task on the network, while certain criteria are met, such as during a predetermined time interval for example.
- the configuration of the different tasks that can be carried out is included in the repository or database and is configurable by the administrator system.
- the corresponding agent is identified. For that purpose, a message 5 is sent to the following step 150 with the identifier of the agent to be launched. Since there is direct correspondence between tasks and agents, said correspondence in the database is read in this step 140.
- the fifth step 150 correspond to receiving the identifier of the agent to be launched and its subsequent launching or execution. For launching it, a process is started up with the executable of the corresponding agent. The sequence of actions performing a determined task is implemented in this agent.
- This agent is preferably a C++ executable or a KSH script, which starts up from step 150 by means of a fork operating system calling.
- the agents are implemented by experts, who know the sequence of actions necessary for solving a fault.
- the purpose of an agent is to attempt to automatically repair the fault by means of action 6 on the network 300 (for example, resetting equipment), and optionally, in the event that it is not possible, sending a message indicative of trouble 7 to the following step 160, so that an operator can manually treat of the fault.
- the message indicative of trouble 7 is preferably enriched with all the complementary information that the agent may have obtained on the fault when executing the step 150.
- the agent provides all the information with the result of the actions executed on the network 300 (for example, the set of commands and responses), and all the information it has queried on the data repositories (by means of queries for obtaining information related to the faulty network element).
- the agents are different in each of the network domains, and also specific for the different faults.
- An example of an agent is the one implemented for unblocking a user in the Intelligent Network Platform. All the agents use the same repository query facilities, facilities for performing actions on the network 300 and trouble ticket creation petition facilities (all being optional).
- the logic defined in the agents is structured in the following parts:
- the sixth step 160 is optional and corresponds to the control functionality of the aforementioned trouble tickets 7. This step is only executed if the fault has not been automatically repaired in step 150. This step 160 is also carried out by means of agents (different from those described above intended for attempting to execute a task or activity), which are in charge of completing the different fields of the trouble ticket that must be created, depending on the network domain, and causing the sending of the petition to create the trouble ticket 8 to the trouble ticketing system 200, through the corresponding interface (for example Web Services).
- the information used for creating the trouble ticket is read from a database, previously stored by the agent.
- the root cause of the trouble tickets is also deduced, for example, a transmission problem, also generating an alarm in the mobile access network.
- the two trouble tickets are created but such that it is deduced that the transmission trouble is the root cause of the mobile access trouble.
- the inventory information linking the mobile access circuits which go through the transmission link is used.
- Database query service which allows obtaining information from input data.
- a query can be made to these services 101 in order to obtain the circuits affected by an event, or in order to enrich a certain event, or in order to encapsulate the connection with the network, or in order to encapsulate a connection with the trouble ticketing system, etc.
- the method of the invention provides the following advantages over conventional methods: launching tasks automatically, which facilitates a high degree of automation of the operations on the networks; the capacity to act on the network, which allows resolving faults automatically, and if this is not possible, providing greater information on the fault so that it can be manually resolved; the possibility of the user configuring the correlation rules in files, which have significant dependence on the business and on the network, or of describing them in agents, such that the modifications are easy to perform and have less impact on the rest of the system; since it has a modular architecture, the ease of integrating new networks according to the implementation of the protocols for receiving events from the network and protocols for acting on it, and to the files and agents describing the new business algorithms; and it is a scalable architecture according to the different networks or domains integrated in the system.
- the present invention provides the execution of operating and maintenance automatisms from receiving events from different networks which may indicate faults, in order to attempt to solve them, and if this is not possible, generating a warning so that those faults are solved manually by a specialized technician.
- the method and system of the invention use event normalizing techniques, correlation and agent launching, all in an automatic and configurable manner.
- the method and system have, as input information, the events generated in the network.
- This input information can be an alarm that can indicate a fault (for example, a fault in a communications line card) or a spontaneous message (for example, a message informing about a status change in the plant).
- the method and system decides which agent should treat the fault. That agent, by means of a certain programming, attempts to solve it automatically and, if it cannot, or if there is no possible treatment or automatic action, it notifies so that a manual action is performed on the network or affected network element.
- BSCs base station controllers
- agents between around 670 agents, of greater to lesser complexity, which are defined for facilitating the operation and maintenance of the networks.
- Step 1 the format of the alarm which is received from the network manager is shown. In this case it is an alarm which is received through the network element manager (Mobile SOC) for mobile access network Motorola technology.
- This information is the input information 1 of step 1 10. It basically translates the information which serves to identify the fault in the equipment.
- step 1 10 The output 2 of step 1 10 is shown below. In this case, it is the alarm in normalized format in which there are no empty fields. The important information of the alarm is basically maintained, formatting the information in determined fields.
- the correlation object which is automatically created in step 130 as a result of receiving the alarm is shown below.
- This correlation object is the one that will serve for performing the correlation in the inference engine by means of applying the rules previously in a file.
- configuration information of the alarms which the system previously has is added.
- An example of this is the persistence, i.e., the time that the motor retains the alarm while waiting to receive a termination, before launching a task, in this case 7,200 seconds.
- the data which will serve step 140 for launching the task are shown below. These data are those which are received from step 130. Among these data, there are some which serve as information to be taken into account when the task is launched, for example, in this case the field indicating "REPETITIONS".
- This invention can apply in networks of any type which require monitoring, operation and maintenance. It has been tested and provides excellent results in telecommunication networks, particularly in the operating and maintenance processes for said networks for telecommunications operators. It also is extensible to other technological areas which can provide an input stimulus to the system and/or which can be operated on, such as, for example, energy management control by means of analyzing the energy data received through this system.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Business, Economics & Management (AREA)
- General Business, Economics & Management (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Monitoring And Testing Of Exchanges (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention relates to a monitoring and operating method, characterized in that it comprises the steps of : receiving and normalizing (110) a plurality of heterogeneous events (1) generated by at least one network element of a telecommunications network (300), obtaining a plurality of events with a normalized field structure (2); for each event with a normalized field structure (2), generating a flow of information of attribute-value pairs (3), after having applied one or more filtering criteria; from each flow of information formed by a set of attribute-value pairs (3), generating a correlation object; correlating (130) one or more correlation objects by means of at least one criterion included in an inference engine, determining, if said at least one correlation criterion is met, that a fault has occurred in the network (300); executing a task associated with the treatment of said fault; executing (150) an agent associated with said task, such that: if the fault can be resolved remotely, acting (6) on the network (300); if the fault cannot be resolved remotely, sending a trouble ticket message (7) so that a person can manually repair the fault.
Description
MONITORING AND MANAGEMENT OF HETEROGENEOUS
NETWORK EVENTS
Field of the Invention
The present invention applies to the field of telecommunications and, more specifically, to the monitoring, operation and maintenance in telecommunications networks.
Background of the Invention
Until now, the operation and maintenance initiatives in telecommunications networks have a high manual component due in larger part to the diversity of events they generate and to the complexity involved in relating them, both within a network and between different networks.
In this field, European patent EP-0686329-B1 discloses a method for event correlation occurring in a telecommunications network formed by individual network elements. At least one of the network elements is represented in a computer by means of individual program modules. Each program module has a data set related to the network elements which represents and includes the data set of the module, for example, a list of network elements having input associations with respect to the event correlation with the network element represented by the module.
Spanish patent application ES-2216727 describes a system for automated remote access based on a centralized management system. The system has means of linking with transmitter with remote concentrator cards (RCC) located in the clients' headquarters, in which the equipment maintenance interfaces can be connected.
European patent EP-0686336-B1 discloses a system which automatically processes alarm signals and performs their storage and correlation. To that end, this system uses an empirical approach for identifying alarm-type events. It specifically proposes a method for automatically processing alarm signals in a telecommunications network, in which alarm-type event signals are transmitted from the telecommunications network to a network management system. For automatic processing, the historical data relating to the times in which the alarm conditions occur in the network are stored for a certain time, the alarm conditions which occur in a certain temporal window are identified, the identified alarm conditions are correlated by analyzing historical data in order to determine the probability that pairs of identified alarm conditions randomly occur within the same temporal window and these probabilities are viewed.
Finally, European patent EP-1031208-B1 discloses a telecommunications network monitoring method comprising a large number of network elements. Said method comprises the steps of collecting, in a centralized manner, alarms coming from the network elements, displaying the alarms as network-element-specific graphical presentations on a graphical user interface, collecting, in a centralized manner, network-element-specific performance information of each network element and the visual representation simultaneously with the alarms of the respective network element's performance on said graphical user interface.
As can be observed, the mentioned patents are aimed at correlating alarms, but they have the deficiency that they do not contemplate automating the treatment of the alarms and the solution of the faults.
Summary of the Invention
The present invention seeks to cover this deficiency by means of automating the treatment of events and solution of possible faults.
Thus, in one aspect of the present invention, a monitoring and operating method is provided, comprising the steps of: receiving and normalizing a plurality of heterogeneous events generated by at least one network element of a telecommunications network, obtaining a plurality of events with a normalized field structure; for each event with a normalized field structure, generating a flow of information of attribute-value pairs, after having applied one or more filtering criteria; from each flow of information formed by a set of attribute-value pairs, generating a correlation object; correlating one or more correlation objects by means of at least one criterion included in an inference engine, determining, if at least one correlation criterion is met, that a fault has occurred in the network; executing a task associated with the treatment of said fault; executing an agent associated with said task, such that: if the fault can be resolved remotely, acting on the network; if the fault cannot be resolved remotely, sending a trouble ticket message so that a person can manually repair the fault.
The step of generating a flow of information from each event with a normalized field structure preferably comprises filtering said events by comparison between the fields of the event and previously configured filtering patterns, such that only the events which have passed said filtering are provided to the step of correlation.
The step of generating a flow of information from each event with a normalized field structure further comprises at least one of the following actions: inhibiting events, temporarily retaining events and enriching the information comprised in events.
If the fault cannot be resolved remotely and a trouble ticket message is sent so that a person can manually repair the fault, said trouble ticket is sent to a trouble ticketing system so that there is a record of said task. This step of controlling the trouble tickets comprises at least one of the following activities: distributing trouble tickets by different criteria; interface with the trouble ticketing system for creating the trouble tickets; storing and displaying the trouble tickets in a WEB interface.
There is at least one correlation criterion included in a configuration file used by the inference engine which is preferably chosen from among the following: association of activities and terminations of correlation objects generated from alarm-type events, association by different criteria of objects generated from events, counting of objects generated from the flow of information formed by attribute-value pairs for ordering the execution of a task, intermittence of the objects generated from alarms in a time period and decision-making by time slot.
The step of generating a flow of information from each event with a normalized field structure furthermore preferably comprises storing said events with a normalized field structure in a repository.
The step of launching a message comprising an identifier of an agent capable of executing a task associated with said fault is performed after at least one of the following actions: controlling the limit of tasks which can be executed on the network; retaining the jobs while determined criteria are met; inhibiting jobs on the network; and establishing priorities between the tasks.
The method is furthermore preferably configured to hierarchically find the root cause of the trouble tickets.
The method is preferably configured to execute tasks programmed in the network without having to receive any external stimulus.
In another aspect of the present invention, a monitoring and operating system configured to carry out the steps of the aforementioned method is provided. This monitoring and operating system preferably comprises at least one server. In a particular embodiment, the system comprises a trouble ticketing module.
Finally, in another aspect of the invention, a computer program is provided, comprising computer program code means suitable for performing the steps of the method described above when the mentioned program is executed in a computer is provided, a digital signal processor, a field-programmable gate array, an application- specific integrated circuit, a microprocessor, a microcontroller or any other type of programmable hardware, even in a distributed manner.
In summary, the invention provides a structured monitoring and operating
system for managing agents for any type of network, and consisting of a set of processes capable of treating very heterogeneous network fault events, characterized by having a method capable of correlating the events and attempting to detect and, where appropriate, to resolve the fault automatically, and to otherwise generate a trouble ticket for being treated manually by specialized technicians.
The system and method is susceptible to incorporating new types of jobs to the system dynamically, by simply configuring the job in the database, and configuring the rule change in order to launch it.
Furthermore, the network monitoring system and method is susceptible to adding information referring to the fault in the manual actuations so that it is easier to resolve. This information that is inserted in the trouble ticket is extracted from the information of the event, and from queries made from the agent on the network element.
Furthermore, the network monitoring system and method can be applied to any type of network which has as an output source events indicating the status of the network. It is further scalable in different architectures, depending on the type of the networks to be managed.
It also has the capacity to hierarchically find the root cause of the trouble tickets, i.e., it can relate a fault in broadband that it is a consequence of a fault in the transport network.
The network monitoring system and method also have the special capacity to execute jobs programmed in the network without having to receive any external stimulus. It allows auto-diagnosing the working of the processes making up the system in order to attempt to assure the correct working thereof, as well as a capacity to export events to other systems.
These and other advantages will be apparent in view of the detailed description of the invention.
Brief Description of the Drawings
For the purpose of aiding to better understand the features of the invention according a preferred practical embodiment thereof and in order to complement this description, the following figure is attached as an integral part thereof with an illustrative and non-limiting character:
Figure 1 shows a flow chart of the steps of execution of the method according to an embodiment of the present invention.
Detailed Description of the Invention
In the context of the present invention, the term "agent" must be understood as
a specific process in charge of hosting and implementing the logic for trying and attempting to solve a specific type of fault or for generating a trouble ticket in the event that the fault has not been able to be automatically solved and it requires manual intervention. This agent can be implemented, for example, as a C++ executable, or a script written in KSH.
In the context of the present invention, the term "task" must be understood as the set of actions involved in performing an activity, and that are defined in the agent itself.
In the context of the present invention, it must be understood that the term "approximately" and the terms of its family (such as "approximate", "approximation", etc.) indicate values or forms very close to those accompanying the aforementioned term. In other words, a deviation within reasonable limits with respect to an exact value or form must be accepted because the person skilled in the art will understand that such deviation with respect to the indicated values or forms is inevitable due to measuring inaccuracies, etc. The same is applied to the terms "around" and "about".
In this text, the term "comprises" and derivations thereof (such as "comprising", etc.) must not be understood in an excluding sense, i.e., these terms must not be interpreted as excluding the possibility that what is described and defined may include more elements, steps, etc.
Figure 1 shows a flow chart of the steps of execution 1 10 120 130 140 150
160 of the method of the invention. The method is implemented in a monitoring and operating system which is executed in a distributed manner in a set of servers. In this figure, reference numbers 1 10 to 160 indicate the steps of the method. Figure 1 further shows a block 101 which schematically represents the services on which said steps 1 10 120 130 140 150 160 are supported, a block 300 which represents the telecommunications network or networks under monitoring (operation and maintenance) and a block 200 which represents a trouble ticketing system of the network. The method, which facilitates automation in the operation and maintenance of telecommunications networks 300, comprises seven steps 1 10 120 130 140 150 160, with their corresponding functional blocks.
The first step 1 10 is receiving and normalizing the events 1 received from the different monitored networks 300. These events 1 are received in one or more servers forming the monitoring and operating system. There are three types of events 1 :
· Alarm-type event: Message with activation and termination.
• Spontaneous-type event: Informative message without an associated
termination.
• Status-type event: Information about equipment status.
In this step 1 10, the following tasks are performed:
a. By means of different communication protocols, the events 1 of the different networks 300 are received in the system. Non-limiting examples of these protocols used are those based on Socket TCP/IP (Transmission Control Protocol and Internet Protocol), SNMP (Simple Network Management Protocol) and TEMIP (Telecommunications Management Information Platform). Non-limiting examples of networks 300 from which events are received are: mobile access networks, such as, GSM and UMTS technologies, receiving the events by means of TCP/IP protocol; switching networks receiving the events by means of TCP/IP protocol; broadband networks, such as ATM (Asynchronous Transfer Mode) or the RIMA network (Telefonica's IP network), by means of TCP/IP protocol; transport networks, such as the Synchronous Digital
Hierarchy (SDH) network, by means of WEB Services protocol; narrowband access network by means of TCP/IP and SNMP protocols; and speech platforms such as ANAS (Automatic Network Answering Service) by means of SNMP protocol. Each of these networks 300 has one or more network elements.
b. Normalizing and adapting all the events 1 received from the different networks 300. Normalized events 2 which have a homogenous structure are formed with the same structure from the events 1 of the different networks 300, such that they can be analyzed. As an example, for alarm- type events, the fields that are normalized are:
• sender: indicates the module of the system that sends the alarm to the alarm receiving module.
• soc: manager sending the alarm.
• origin: network element of the network 300 generating the alarm.
• numberSequence: sequence number of the alarm.
• key_event: identifier used for the association of activation- termination.
• id_event_bd: unique identifier of the alarm in the platform. · family: technology of the network element.
• category: category of the alarm (severity)
• class: class of the alarm (momentary, i.e. without termination, or permanent, with termination)
date: date of generating the alarm,
hour: hour of generating the alarm,
group: cluster to which the reported problem belongs type: type to which the reported problem belongs,
nature: nature or activation,
observations: observations of the alarm
key_cluster: has no defined usefulness
• progression: indicates if it is sent to another system by SNMP data_auxiliary: has no defined usefulness expiration: time in which the alarm must be automatically terminated
• text: text of the alarm
The preceding fields represent a non-limiting example. In other words, it is up to the manager or supervisor of the operating and maintenance platform to modify, by reducing or increasing, the number of fields that are normalized, the alarm-type event or any other event.
The normalization of the events 1 is performed by applying patterns or rules to the information received in the event 1 , which are used to complete the aforementioned normalized fields. These rules are based on the knowledge of a network expert, whereby the normalization is parameterized. As an example of the relationship between the alarms, alarms from a transport network and alarms from a broadband network, which can be related in the event of a fault, can be pointed out. A cutoff in a PDH transmission line is related to a cutoff in broadband communications, or mobile communications, going over said transmission line. The different alarms can be related through one or more normalized fields.
The different events 1 of this step 1 10 have a different configuration for each of the domains (understanding domain to be a technology or a particularization of a type of network; for example, within the broadband network, the ATM access network domain, the IP access network domain, and the IP network domain), hence the reasons for which it is necessary to normalize events 1 . The result of this step 1 10 (the events with a normalized field structure 2) is reported to the process of the second step 120. It is further possible to receive the events through different
connection channels, one supporting the other, eliminating event repetitions by means of comparing the received text.
The second step 120 is the step of pre-processing and filtering the normalized events 2 which are received from the previous step 1 10. "Preprocessing" is understood as one or more actions carried out on normalized events 2, among those which are described below. The step of pre-processing 120 is useful so that the monitoring and operating system is not saturated with insignificant events.
a. - Filtering events. Filtering is understood as the action of non- progression of determined events which are not considered important for the remaining steps or for the operation and maintenance of the network 300. A particular case of filtering is inhibition, which relates to a more temporary filtering situation, for example, the non-progression of determined events known to be received, for example, due to performing a job in the network. As an example, the criteria for filtering events 2 are the following: soc, origin, type, text (allows regular expressions), text_missing (allows regular expressions, if the text does not appear in the text field of the event, it is filtered), numberSequence, family, group, category, class, nature, observations, text (explains and summarizes so that the filter works), key_cluster, progression, data_auxiliary (wild filtering field), expiration.
This filtering is performed by comparison between the fields of the normalized event 2 and previously configured filtering patterns (for example, it can be configured as a filtering pattern for filtering the events of a determined origin). This task allows filtering normalized events 2 before beginning the subsequent step of correlation 130. The configuration of the filtering criteria is performed through forms available by means of a WEB interface.
b. -Temporarily retaining events by means of criteria configurable. The events 2 are retained and, therefore, they do not progress to the following steps until the criteria defined by configuration are no longer met. This task affects the decisions made in the subsequent step of correlation 130, because it modifies the order of the events 1 which are received from the network 300. This temporary retention is optional.
c. -Enriching the information of the events. This consists of adding information which is not included in the event which is received from the network 300. The actions of enriching the information are configured in a
data file reflecting mainly: (a) the queries to the inventory or repository that must be made in order to obtain more complementary information, and (b) the commands which must be executed on the corresponding network for the same purpose (queries to the network). As an example of query, when an event arrives from a transmission element, a query can be made in the inventory as to the circuits passing through said element. The file is particular for each domain, and for each "type" of event within the domain. The management of this configuration is performed by the system administrator. As an example of a command, when an event arrives from a router, the information received in the event can be enriched with the petition to the router for the version by means of the command "show version". This enrichment is optional.
d. - Storing the normalized events 2 in the repository of the system. e. - Sending the data necessary for generating the correlation object to the following step of correlation 130. These data are passed on in a string, such as a list of FIELD/VALUE tuple records. In other words, for each event with a normalized field structure 2, a flow of information formed by a set of attribute-value pairs is passed to the following step 130.
The third step 130 is the step of correlation of the events. More specifically, this step comprises creating correlation objects and correlating said objects. In this step, correlation objects are automatically generated from the data received (a flow of information formed by attribute-value pairs 3). These correlation objects are classes, the structure of attributes of which is defined in configuration files, and the values of which are completed from the data received in the previous step. In other words, the correlation objects are generated according to a series of criteria included in the configuration files, indicating which correlation object must be formed and how. An inference engine is used during this step. The inference engine is a reasoning module which allows applying a series of reasoning criteria (hereinafter referred to as "rules") for making different decisions, depending on the input data. These decisions can have a certain degree of uncertainty, i.e., they are not 100% reliable. The inference engine is based on rules defined in a data file. The inference engine is implemented by means of a UNIX process. It is implemented by means of a conventional RETE algorithm, the content of which is outside the scope of the invention.
Once the correlation objects are created, the inference engine is in charge of correlating said objects, always for the purpose of reducing the manual treatment of
events generated by the network. These correlation objects are particular for each network domain. In this step 130, according to the business logic configured in the rules, an order is generated directed to the following step 140 for performing a task, task being understood as the set of actions involved in performing an activity. In other words, the result of this step 130 is a message 4 in string form comprising an order to launch a task. This message 4 which is sent to the following step comprises the parameters which identify the task that must be executed, as well as some parameters which serve for the execution thereof. As stated, this message is sent as a string.
Correlation criteria (from the inference engine) are applied to the correlation objects generated. The following correlation criteria are always applied:
a. - Association of activations and terminations of correlation objects generated from the alarms (alarm-type events). In other words, alarm activations and terminations are correlated. It must be observed that a general-type event, for example, an informative event, does not have activation and termination, so it makes no sense to apply this correlation criterion to said events other than alarms. Assuming that receiving the termination of the corresponding alarm after the arrival of an activation of an alarm is met, the two alarm-type events are ruled out because the problem is considered to be concluded. The relationship between the activation and the termination of the correlation objects is done from a field called "Key Event". b. - Association by different criteria of objects generated from events. In this case, tasks are executed as the result of relating a determined set of objects.
Other optional correlation criteria are shown below. The following list is not exhaustive:
c. Persistence of objects generated from any type of event during a determined time. This scenario allows executing a task from which the persistence of determined objects exists for a determined time interval.
d. - Counting of objects generated from data received 3 by any attribute of the objects defining the domain, for example, the attribute Origin" or "Manager", extracted from the event that caused the creation of the object. This scenario allows ordering the execution of tasks according to the number of times an object occurs, or according to certain criteria, such as for example, the number of objects caused in an origin.
e. - Intermittence (activations/terminations) of the objects generated
from the alarms in a time period. This is the case in which a network element has an intermittent fault, i.e., it is activated and deactivated in a time period in a continuous manner.
f. - Decision-making by time slot. For example, a different severity is given to the faults depending on the time slot in which they occur, and thus they can be prioritized for their resolution.
The fourth step 140 is the step of task management. According to the correlation performed in the previous step, in which several events 1 generally participate which are symptoms of the occurrence of a determined network fault, a task is launched which attempts to solve the fault. Specifically, in this step 140 a message 5 is launched which identifies an agent capable of executing a task, action or activity. There is a univocal, not biunivocal, relationship between tasks which must be taken on and agents configured for executing those tasks. In other words, there is an agent for each task. The possible tasks are a consequence of a determined object correlation (and since the correlation objects come from events 1 , it can be said that the possible tasks are a consequence of a determined event correlation). Before launching the task, certain checks are performed in this step which prevent possible problems in the network due to launching these tasks. For that purpose, during this step 140, the following functions will be performed:
· Control of limits of tasks on the network, i.e., it checks that the maximum number of tasks that can be performed on a network element is not exceeded.
• Retentions. It allows retaining a determined task on the network, while certain criteria are met, such as during a predetermined time interval for example.
• Inhibitions. It allows not executing a determined task on the network.
• Handling priorities between tasks, first executing the highest priority task by configuration.
The configuration of the different tasks that can be carried out is included in the repository or database and is configurable by the administrator system.
In this step, after having executed the previous functions, and provided that some or all of the previously listed functions or controls have been passed, the corresponding agent is identified. For that purpose, a message 5 is sent to the following step 150 with the identifier of the agent to be launched. Since there is direct correspondence between tasks and agents, said correspondence in the database is read in this step 140.
The fifth step 150 correspond to receiving the identifier of the agent to be launched and its subsequent launching or execution. For launching it, a process is started up with the executable of the corresponding agent. The sequence of actions performing a determined task is implemented in this agent. This agent is preferably a C++ executable or a KSH script, which starts up from step 150 by means of a fork operating system calling. The agents are implemented by experts, who know the sequence of actions necessary for solving a fault. The purpose of an agent is to attempt to automatically repair the fault by means of action 6 on the network 300 (for example, resetting equipment), and optionally, in the event that it is not possible, sending a message indicative of trouble 7 to the following step 160, so that an operator can manually treat of the fault. In this latter case, the message indicative of trouble 7 is preferably enriched with all the complementary information that the agent may have obtained on the fault when executing the step 150. For that purpose, the agent provides all the information with the result of the actions executed on the network 300 (for example, the set of commands and responses), and all the information it has queried on the data repositories (by means of queries for obtaining information related to the faulty network element). The agents are different in each of the network domains, and also specific for the different faults. An example of an agent is the one implemented for unblocking a user in the Intelligent Network Platform. All the agents use the same repository query facilities, facilities for performing actions on the network 300 and trouble ticket creation petition facilities (all being optional). The logic defined in the agents is structured in the following parts:
• Start, where the agent starts in terms of the platform services, in order to subsequently be able to invoke them.
• Termination of the agent indicating the result of the execution.
Furthermore, optionally:
• Data query in repositories, in order to obtain events or information necessary for executing the task.
· Action on the network to solve the fault or to obtain network data. For example, if the fault is hardware, the agent does not usually attempt to act, but rather it is limited to creating a trouble ticket. If, in contrast, the fault is software, the agent first attempts to act and if it is unable to fix it, it creates a trouble ticket.
· Petition to create/modify a trouble ticket.
The sixth step 160 is optional and corresponds to the control functionality of
the aforementioned trouble tickets 7. This step is only executed if the fault has not been automatically repaired in step 150. This step 160 is also carried out by means of agents (different from those described above intended for attempting to execute a task or activity), which are in charge of completing the different fields of the trouble ticket that must be created, depending on the network domain, and causing the sending of the petition to create the trouble ticket 8 to the trouble ticketing system 200, through the corresponding interface (for example Web Services). The information used for creating the trouble ticket is read from a database, previously stored by the agent. Some or all of the following functions can be performed in this step 160:
• Control of the trouble tickets so as to not repeat them, as a result of different alarms received which are related. This function is based on previously checking the previously created trouble tickets.
• Distributing trouble tickets by user/task group, depending on the hour, shift, holiday, province, ...
• Interface with external systems, by means of WEB Services protocols, API ARUSER, SMS, or LOTUS mail for creating trouble tickets 9.
• Storing and displaying the trouble tickets in the WEB interface, which allows, as a contingency to an external system crash (trouble ticketing system, for example), to continue treating said trouble tickets, and when the trouble ticketing system has recovered, sending all the trouble tickets that had been stored during the crash.
In this step, the root cause of the trouble tickets is also deduced, for example, a transmission problem, also generating an alarm in the mobile access network. In this case, the two trouble tickets are created but such that it is deduced that the transmission trouble is the root cause of the mobile access trouble. To that end, the inventory information linking the mobile access circuits which go through the transmission link is used.
All the steps defined above make use 10 of a series of services 101 available for the method of the invention and for the platform or system on which it is implemented, which are defined below:
• Database query service, which allows obtaining information from input data.
• Service for action on the network, which allows sending commands on the network, in order to obtain more information or to attempt to solve a fault. This service in turn has the interface with the network 300 for sending said
commands, and receiving the responses, for example, action via Telnet. • Service for creating a message indicative of trouble, in the moment that a warning that the fault requires manual action is to be generated.
By way of example, in step 120 a query can be made to these services 101 in order to obtain the circuits affected by an event, or in order to enrich a certain event, or in order to encapsulate the connection with the network, or in order to encapsulate a connection with the trouble ticketing system, etc.
Through the previous steps, the method of the invention provides the following advantages over conventional methods: launching tasks automatically, which facilitates a high degree of automation of the operations on the networks; the capacity to act on the network, which allows resolving faults automatically, and if this is not possible, providing greater information on the fault so that it can be manually resolved; the possibility of the user configuring the correlation rules in files, which have significant dependence on the business and on the network, or of describing them in agents, such that the modifications are easy to perform and have less impact on the rest of the system; since it has a modular architecture, the ease of integrating new networks according to the implementation of the protocols for receiving events from the network and protocols for acting on it, and to the files and agents describing the new business algorithms; and it is a scalable architecture according to the different networks or domains integrated in the system.
As can be observed, the present invention provides the execution of operating and maintenance automatisms from receiving events from different networks which may indicate faults, in order to attempt to solve them, and if this is not possible, generating a warning so that those faults are solved manually by a specialized technician.
To that end, the method and system of the invention use event normalizing techniques, correlation and agent launching, all in an automatic and configurable manner.
As has been seen, the method and system have, as input information, the events generated in the network. This input information can be an alarm that can indicate a fault (for example, a fault in a communications line card) or a spontaneous message (for example, a message informing about a status change in the plant). From configuration parameters and correlation rule files, the method and system decides which agent should treat the fault. That agent, by means of a certain programming, attempts to solve it automatically and, if it cannot, or if there is no possible treatment or
automatic action, it notifies so that a manual action is performed on the network or affected network element.
An example of network domain integrated in the system is shown below. It is the mobile access network domain. The network elements of this domain are the base station controllers (BSCs).
In this example, all the steps described above are performed. The following can be underlined as an example of agents for this domain:
• AgtGsmMotoGclk Treel . This is the agent which treats determined GCLK alarms of the BSS and RCXCDR systems for decision tree 1 , which consists of, if determined alarms are repeated 5 times in one hour, a trouble ticket being generated for its manual treatment. In this case, after receiving the alarms, the agent automatically creates a trouble ticket.
• AgtGsmMotoGclk_Tree4. This is the agent which treats determined GCLK alarms of the BSS and RCXCDR systems for the decision tree 4, which consists of, if determined alarms are repeated 3 times in one hour, remaining active for more than 5 minutes, an action being performed on the XBL device (reset). If this action is successful and the alarm disappears, nothing else has to be done. If, in contrast, the alarm persists, a trouble ticket will be generated for its manual treatment. In this case, if the action has gone well, the problem will have been solved automatically, and otherwise, all the information will have been available in the trouble ticket for its manual resolution.
These are only two examples of agents, between around 670 agents, of greater to lesser complexity, which are defined for facilitating the operation and maintenance of the networks.
An example of the different data and flows between steps is shown below for better understanding of the invention:
• First, the format of the alarm which is received from the network manager is shown. In this case it is an alarm which is received through the network element manager (Mobile SOC) for mobile access network Motorola technology. This information is the input information 1 of step 1 10. It basically translates the information which serves to identify the fault in the equipment.
- Example of Motorola Alarm:
AMBER START EXTENSION START CELSIG 08-41 -1 1 -01 -1 END CELSIG
START CGI 214-07-4101 -16064 END CGI
START TYP E_U N KOWN ELEMENT END TYPE ELEMENT
START TYPE PROBLEM *NONE*. equipmentFailureEvent
END TYPE PROBLEM
START ADMINISTRATIVE NUMBER 41 100001018169
END ADMINISTRATIVE NUMBER
START NETWORK ID 2-23-25 END NETWORK ID
AMBER END EXTENSION
#6643 - NEW - *NONE*. equipmentFailureEvent - IAS - BSS-SE-
Bellavista-ll(SE:Ruben_Dario_DCS-25): 25 IAS 0 - 19/09/2007
12:33:36. [105] Fan Tray 2 Failure - FMIC - Investigate -/-. SITE 25 :
(SE:Ruben_Dario_DCS-25): Investigate
The output 2 of step 1 10 is shown below. In this case, it is the alarm in normalized format in which there are no empty fields. The important information of the alarm is basically maintained, formatting the information in determined fields.
- Normalized Alarm :
ID_EVENT. 37093906
DATE_EVENT. 1 190198022
SOC. OMC_SEVILLA
ORIGIN. BSS-SE-Bellavista-ll
DATE. 19/09/2007 12:33:36
TYPE. IAS
TEXT. AMBER START EXTENSION
START CELSIG 08-41 -1 1 -01 -1 END CELSIG
START CGI 214-07-4101 -16064 END CGI
ST A RT T Y P E_U N KO W N ELEMENT END TYPE ELEMENT
START PROBLEM TYPE *NONE*. equipmentFailureEvent
END PROBLEM TYPE
START ADMINISTRATIVE NUMBER 41 100001018169
END ADMINISTRATIVE NUMBER
START NETWORKJD 2-23-25 END NETWORKJD
AMBER END EXTENSION
#6643 - NEW - *NONE*. equipmentFailureEvent - IAS - BSS-SE- Bellavista-ll(SE:Ruben_Dario_DCS-25): 25 IAS 0 - 19/09/2007 12:33:36. [105] Fan Tray 2 Failure - FMIC - Investigate -/-. SITE 25 :
(SE:Ruben_Dario_DCS-25): Investigate
NUMBERSEQUENCE. 6643
FAMILY. GSM MOTOROLA
GROUP. equipmentFailureEvent
CATEGORY. 1
CLASS. P
NATURE. A
The correlation object which is automatically created in step 130 as a result of receiving the alarm is shown below. This correlation object is the one that will serve for performing the correlation in the inference engine by means of applying the rules previously in a file. Besides the information contained in the alarm, configuration information of the alarms which the system previously has is added. An example of this is the persistence, i.e., the time that the motor retains the alarm while waiting to receive a termination, before launching a task, in this case 7,200 seconds.
Object Correlation Fields:
TYPE_EVENT. A
CLASS_EVENT. GSM_MOTOROLA_IAS
ORIGIN_EVENT. BSS-SE-Bellavista-ll
FAMIL Y_EVENT. GSM MOTOROLA
KEY_EVENT. OMC_SEVILLA_6643
ID_EVENT. 37093906:1 190198022
GROUP_EVENT. C
DATE_EVENT. 19/09/2007
HOUR_EVENT. 12:33:36
NATURE. A
ORIGIN. BSS-SE-Bellavista-ll
BTS. SE:Ruben_Dario_DCS-25
ID_BTS. 25
ID_EVENT_OMC. 6643
GRI. YES
TREAT. YES
PARAMETERS. 105
PERSISTENCE. 7200
SHORTPERIOD. 3600
SHORTREP. 8
LONGPERIOD. 86400
LONGREP. 200
VERYLONGPERIOD. 0
VERYLONGREP. 0
BLOCK_CLUSTER. IAS
BLOCK_REPET. BSS-SE-Bellavista-ll_25_105_IAS_0
AGENT. AgtGsmMotolas
DISP_COMPLETE: 0
The data which will serve step 140 for launching the task are shown below. These data are those which are received from step 130. Among these data, there are some which serve as information to be taken into account when the task is launched, for example, in this case the field indicating "REPETITIONS".
- Task request
ID_ORIGIN. 8135130:1256293829
ORIGIN. EOC_MOVILES
CENTRAL. BSS-SE-Bellavista-ll
SYSTEM. GSM MOTOROLA
GROUP. CORRECTIVE
TYPE. AgtGsmMotolas
PRIORITY. 0
ID_POSTAGE. 19
ID_EXTERNAL. 8135130:1256293829
DATJEXTt
DATJEXT2
DATJEXT3
DAT_EXT4
DATJEXT5
DATJEXT&
DATJEXT7. REPETITIONS
DATJEXT8.
DATJEXT9.
D47"_EAT70:MANAGER_FAILURES_OO1
The following is the result of the start up of the AgtGsmMotolas process by step 150. The outlines show, for example, the time it has taken to execute the task (20 seconds), as well as the Process Identifier (Pid).
-Launch of the agent:
[22/10/2009 16:01 :00] Pid(22558): Launching
D YN AM I C_AG E NT 004 (pid 15951 )
../OmegaGsmMotorola bin/AgtGsmMotolas 1 1278205 8135130:1256293829
[22/10/2009 16:01 :20] Pid(22558): The execution of
D YN AM I C_AG E NT 004 (pid=15951 ) has concluded
This invention can apply in networks of any type which require monitoring, operation and maintenance. It has been tested and provides excellent results in telecommunication networks, particularly in the operating and maintenance processes for said networks for telecommunications operators. It also is extensible to other technological areas which can provide an input stimulus to the system and/or which can be operated on, such as, for example, energy management control by means of analyzing the energy data received through this system.
In view of this description and drawing, the person skilled in the art will understand that the invention has been described according to several preferred embodiments thereof, but that multiple variations can be introduced in said preferred embodiments without departing from the object of the invention as it has been claimed.
Claims
1 . Monitoring and operating method, characterized in that it comprises the steps of:
- receiving and normalizing (1 10) a plurality of heterogeneous events (1 ) generated by at least one network element of a telecommunications network (300), obtaining a plurality of events with a normalized field structure (2);
- for each event with a normalized field structure (2), generating a flow of information of attribute-value pairs (3), after having applied one or more filtering criteria;
-from each flow of information formed by a set of attribute-value pairs (3), generating a correlation object;
- correlating (130) one or more correlation objects by means of at least one criterion included in an inference engine, determining, if said at least one correlation criterion is met, that a fault has occurred in the network (300);
- executing a task associated with the treatment of said fault;
- executing (150) an agent associated with said task, such that:
- if the fault can be resolved remotely, acting (6) on the network (300);
- if the fault cannot be resolved remotely, sending a trouble ticket message (7) so that a person can manually repair the fault.
2. Method of claim 1 , wherein said step of generating a flow of information
(120) from each event with a normalized field structure (2) comprises filtering said events (2) by comparison between the fields of the event (2) and previously configured filtering patterns, such that only the events which have passed said filtering are provided to the step of correlation.
3. Method of any of the previous claims, wherein said step of generating a flow of information (120) from each event with a normalized field structure (2) further comprises at least one of the following actions: inhibiting events (2), temporarily retaining events (2) and enriching the information comprised in events (2).
4. Method of any of the previous claims, wherein if the fault cannot be resolved remotely and a trouble ticket message (7) is sent so that a person can manually repair the fault, sending (8) said trouble ticket to a trouble ticketing system (200) so that there is a record of said task.
5. Method of claim 4, wherein said step of controlling the trouble tickets (160) comprises at least one of the following activities:
- distributing trouble tickets by different criteria;
- interface with the trouble ticketing system (200) for creating the trouble tickets;
- storing and displaying the trouble tickets in a WEB interface.
6. Method of any of the previous claims, wherein said at least one correlation criterion included in a configuration file used by the inference engine is chosen from among the following: association of activities and terminations of correlation objects generated from alarm-type events, association by different criteria of objects generated from events, counting of objects generated from the flow of information formed by attribute-value pairs for ordering the execution of a task, intermittence of the objects generated from alarms in a time period and decision-making by time slot.
7. Method of any of the previous claims, wherein said step of generating a flow of information (120) from each event with a normalized field structure (2) further comprises storing said events with a normalized field structure (2) in a repository.
8. Method of any of the previous claims, wherein said step of sending (140) a message (4) comprising an identifier of an agent capable of executing a task associated with said fault is performed after at least one of the following actions:
- controlling the limit of tasks which can be executed on the network (300);
- retaining the jobs while determined criteria are met;
- inhibiting jobs on the network (300); and
- establishing priorities between the tasks.
9. Method of any of the previous claims, configured to hierarchically find the root cause of the trouble tickets.
10. Method of any of the previous claims, configured to execute tasks programmed in the network (300), without having to receive any external stimulus.
1 1 . System for monitoring and operation configured to carry out the steps of the method of any of claims 1 to 10.
12. System for operation and maintenance of claim 1 1 , comprising at least one server.
13. System for operation and maintenance of claim 1 1 or 12, comprising a trouble ticketing module (200).
14. A computer program comprising computer program code means suitable for performing the steps of the method of any of claims 1 to 10 when the mentioned program is executed in a computer, a digital signal processor, a field- programmable gate array, an application-specific integrated circuit, a microprocessor, a microcontroller or any other type of programmable hardware, even in a distributed manner.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
ESP200930945 | 2009-11-03 | ||
ES200930945A ES2376212B1 (en) | 2009-11-03 | 2009-11-03 | METHOD AND SYSTEM OF SUPERVISION AND OPERATION OF TELECOMMUNICATIONS NETWORKS. |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011054861A1 true WO2011054861A1 (en) | 2011-05-12 |
Family
ID=43477913
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2010/066730 WO2011054861A1 (en) | 2009-11-03 | 2010-11-03 | Monitoring and management of heterogeneous network events |
Country Status (4)
Country | Link |
---|---|
AR (1) | AR078877A1 (en) |
ES (1) | ES2376212B1 (en) |
UY (1) | UY32984A (en) |
WO (1) | WO2011054861A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017000690A1 (en) * | 2015-06-29 | 2017-01-05 | 中兴通讯股份有限公司 | Worksheet processing method and device and terminal network management system |
CN112953825A (en) * | 2021-01-26 | 2021-06-11 | 中山大学 | Attribute heterogeneous network embedding method, device, equipment and medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0686329B1 (en) | 1993-02-23 | 1997-09-03 | BRITISH TELECOMMUNICATIONS public limited company | Event correlation in telecommunications networks |
EP0686336B1 (en) | 1993-02-23 | 1998-05-20 | BRITISH TELECOMMUNICATIONS public limited company | Event correlation |
ES2216727A1 (en) | 2004-05-17 | 2004-10-16 | Teytel, S.A. | Automated remote access system for electronic devices, has maintenance part connected with custom link token ring concentrator, and interfaces connected with equipment maintenance part to manage corresponding signals of alarms |
EP1031208B1 (en) | 1997-10-14 | 2005-09-21 | Nokia Corporation | Network monitoring method for telecommunications network |
US20080082661A1 (en) * | 2006-10-02 | 2008-04-03 | Siemens Medical Solutions Usa, Inc. | Method and Apparatus for Network Monitoring of Communications Networks |
US7607169B1 (en) * | 2002-12-02 | 2009-10-20 | Arcsight, Inc. | User interface for network security console |
-
2009
- 2009-11-03 ES ES200930945A patent/ES2376212B1/en not_active Withdrawn - After Issue
-
2010
- 2010-10-28 UY UY0001032984A patent/UY32984A/en not_active Application Discontinuation
- 2010-11-02 AR ARP100104053A patent/AR078877A1/en unknown
- 2010-11-03 WO PCT/EP2010/066730 patent/WO2011054861A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0686329B1 (en) | 1993-02-23 | 1997-09-03 | BRITISH TELECOMMUNICATIONS public limited company | Event correlation in telecommunications networks |
EP0686336B1 (en) | 1993-02-23 | 1998-05-20 | BRITISH TELECOMMUNICATIONS public limited company | Event correlation |
EP1031208B1 (en) | 1997-10-14 | 2005-09-21 | Nokia Corporation | Network monitoring method for telecommunications network |
US7607169B1 (en) * | 2002-12-02 | 2009-10-20 | Arcsight, Inc. | User interface for network security console |
ES2216727A1 (en) | 2004-05-17 | 2004-10-16 | Teytel, S.A. | Automated remote access system for electronic devices, has maintenance part connected with custom link token ring concentrator, and interfaces connected with equipment maintenance part to manage corresponding signals of alarms |
US20080082661A1 (en) * | 2006-10-02 | 2008-04-03 | Siemens Medical Solutions Usa, Inc. | Method and Apparatus for Network Monitoring of Communications Networks |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017000690A1 (en) * | 2015-06-29 | 2017-01-05 | 中兴通讯股份有限公司 | Worksheet processing method and device and terminal network management system |
CN112953825A (en) * | 2021-01-26 | 2021-06-11 | 中山大学 | Attribute heterogeneous network embedding method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
UY32984A (en) | 2010-12-31 |
ES2376212A1 (en) | 2012-03-12 |
ES2376212B1 (en) | 2013-01-29 |
AR078877A1 (en) | 2011-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710544B (en) | Process monitoring method of database system and rail transit comprehensive monitoring system | |
Gardner et al. | Methods and systems for alarm correlation | |
US6446058B1 (en) | Computer platform alarm and control system | |
US7971106B2 (en) | Method and apparatus for maintaining the status of objects in computer networks using virtual state machines | |
US7296194B1 (en) | Method and apparatus for maintaining the status of objects in computer networks using virtual state machines | |
EP3131234A1 (en) | Core network analytics system | |
US20060248407A1 (en) | Method and system for providing customer controlled notifications in a managed network services system | |
CN110166290A (en) | Alarm method and device based on journal file | |
CN113434327B (en) | Fault processing system, method, equipment and storage medium | |
WO2020228276A1 (en) | Network alert method and device | |
US20050044215A1 (en) | System for automatic import, analysis, and reporting of network configuration and status information | |
WO2000017763A1 (en) | Interface system for integrated monitoring and management of network devices in a telecommunications network | |
US20210357280A1 (en) | Systems and methods for application operational monitoring | |
CN112527484A (en) | Workflow breakpoint continuous running method and device, computer equipment and readable storage medium | |
Kharchenko et al. | Security and availability models for smart building automation systems | |
CN108337108A (en) | A kind of cloud platform failure automation localization method based on association analysis | |
CN111682976B (en) | Method for ensuring distributed multi-machine communication monitoring | |
CN109634814A (en) | Fault early warning method, equipment, storage medium and device based on log stream | |
CN111031000B (en) | Processing method, device and system of business wind control system and storage medium | |
WO2011054861A1 (en) | Monitoring and management of heterogeneous network events | |
EP4264903A1 (en) | Telecommunication network alarm management | |
CN112835794B (en) | Method and system for positioning and monitoring code execution problem based on Swoole | |
CN117413227A (en) | Methods and systems and inspection equipment for safe execution of control applications | |
CN112686644A (en) | Project operation state monitoring method, system, equipment and storage medium | |
CN116599822B (en) | Fault alarm treatment method based on log acquisition event |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10778959 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10778959 Country of ref document: EP Kind code of ref document: A1 |