WO2011054861A1

WO2011054861A1 - Monitoring and management of heterogeneous network events

Info

Publication number: WO2011054861A1
Application number: PCT/EP2010/066730
Authority: WO
Inventors: Juan Pedro Alcaide Sanz; Juan Antonio Guzman Sanchez; José Alberto ESTEBAN GOMEZ; José María ALAMEDA GARCIA; Isidro De Ancos Tantes; Fernando Esquinas Lozano; Luis Montejo Calvo; Vicente Martinez Gallardo; Ruben Garcia Azuara; Pablo Manzano Garcia
Original assignee: Telefonica, S.A.
Priority date: 2009-11-03
Filing date: 2010-11-03
Publication date: 2011-05-12
Also published as: UY32984A; ES2376212A1; ES2376212B1; AR078877A1

Abstract

The invention relates to a monitoring and operating method, characterized in that it comprises the steps of : receiving and normalizing (110) a plurality of heterogeneous events (1) generated by at least one network element of a telecommunications network (300), obtaining a plurality of events with a normalized field structure (2); for each event with a normalized field structure (2), generating a flow of information of attribute-value pairs (3), after having applied one or more filtering criteria; from each flow of information formed by a set of attribute-value pairs (3), generating a correlation object; correlating (130) one or more correlation objects by means of at least one criterion included in an inference engine, determining, if said at least one correlation criterion is met, that a fault has occurred in the network (300); executing a task associated with the treatment of said fault; executing (150) an agent associated with said task, such that: if the fault can be resolved remotely, acting (6) on the network (300); if the fault cannot be resolved remotely, sending a trouble ticket message (7) so that a person can manually repair the fault.

Description

MONITORING AND MANAGEMENT OF HETEROGENEOUS

NETWORK EVENTS

Field of the Invention

The present invention applies to the field of telecommunications and, more specifically, to the monitoring, operation and maintenance in telecommunications networks.

Background of the Invention

Until now, the operation and maintenance initiatives in telecommunications networks have a high manual component due in larger part to the diversity of events they generate and to the complexity involved in relating them, both within a network and between different networks.

In this field, European patent EP-0686329-B1 discloses a method for event correlation occurring in a telecommunications network formed by individual network elements. At least one of the network elements is represented in a computer by means of individual program modules. Each program module has a data set related to the network elements which represents and includes the data set of the module, for example, a list of network elements having input associations with respect to the event correlation with the network element represented by the module.

Spanish patent application ES-2216727 describes a system for automated remote access based on a centralized management system. The system has means of linking with transmitter with remote concentrator cards (RCC) located in the clients' headquarters, in which the equipment maintenance interfaces can be connected.

European patent EP-0686336-B1 discloses a system which automatically processes alarm signals and performs their storage and correlation. To that end, this system uses an empirical approach for identifying alarm-type events. It specifically proposes a method for automatically processing alarm signals in a telecommunications network, in which alarm-type event signals are transmitted from the telecommunications network to a network management system. For automatic processing, the historical data relating to the times in which the alarm conditions occur in the network are stored for a certain time, the alarm conditions which occur in a certain temporal window are identified, the identified alarm conditions are correlated by analyzing historical data in order to determine the probability that pairs of identified alarm conditions randomly occur within the same temporal window and these probabilities are viewed. Finally, European patent EP-1031208-B1 discloses a telecommunications network monitoring method comprising a large number of network elements. Said method comprises the steps of collecting, in a centralized manner, alarms coming from the network elements, displaying the alarms as network-element-specific graphical presentations on a graphical user interface, collecting, in a centralized manner, network-element-specific performance information of each network element and the visual representation simultaneously with the alarms of the respective network element's performance on said graphical user interface.

As can be observed, the mentioned patents are aimed at correlating alarms, but they have the deficiency that they do not contemplate automating the treatment of the alarms and the solution of the faults.

Summary of the Invention

The present invention seeks to cover this deficiency by means of automating the treatment of events and solution of possible faults.

Thus, in one aspect of the present invention, a monitoring and operating method is provided, comprising the steps of: receiving and normalizing a plurality of heterogeneous events generated by at least one network element of a telecommunications network, obtaining a plurality of events with a normalized field structure; for each event with a normalized field structure, generating a flow of information of attribute-value pairs, after having applied one or more filtering criteria; from each flow of information formed by a set of attribute-value pairs, generating a correlation object; correlating one or more correlation objects by means of at least one criterion included in an inference engine, determining, if at least one correlation criterion is met, that a fault has occurred in the network; executing a task associated with the treatment of said fault; executing an agent associated with said task, such that: if the fault can be resolved remotely, acting on the network; if the fault cannot be resolved remotely, sending a trouble ticket message so that a person can manually repair the fault.

The step of generating a flow of information from each event with a normalized field structure preferably comprises filtering said events by comparison between the fields of the event and previously configured filtering patterns, such that only the events which have passed said filtering are provided to the step of correlation.

The step of generating a flow of information from each event with a normalized field structure further comprises at least one of the following actions: inhibiting events, temporarily retaining events and enriching the information comprised in events. If the fault cannot be resolved remotely and a trouble ticket message is sent so that a person can manually repair the fault, said trouble ticket is sent to a trouble ticketing system so that there is a record of said task. This step of controlling the trouble tickets comprises at least one of the following activities: distributing trouble tickets by different criteria; interface with the trouble ticketing system for creating the trouble tickets; storing and displaying the trouble tickets in a WEB interface.

There is at least one correlation criterion included in a configuration file used by the inference engine which is preferably chosen from among the following: association of activities and terminations of correlation objects generated from alarm-type events, association by different criteria of objects generated from events, counting of objects generated from the flow of information formed by attribute-value pairs for ordering the execution of a task, intermittence of the objects generated from alarms in a time period and decision-making by time slot.

The step of generating a flow of information from each event with a normalized field structure furthermore preferably comprises storing said events with a normalized field structure in a repository.

The step of launching a message comprising an identifier of an agent capable of executing a task associated with said fault is performed after at least one of the following actions: controlling the limit of tasks which can be executed on the network; retaining the jobs while determined criteria are met; inhibiting jobs on the network; and establishing priorities between the tasks.

The method is furthermore preferably configured to hierarchically find the root cause of the trouble tickets.

The method is preferably configured to execute tasks programmed in the network without having to receive any external stimulus.

In another aspect of the present invention, a monitoring and operating system configured to carry out the steps of the aforementioned method is provided. This monitoring and operating system preferably comprises at least one server. In a particular embodiment, the system comprises a trouble ticketing module.

Finally, in another aspect of the invention, a computer program is provided, comprising computer program code means suitable for performing the steps of the method described above when the mentioned program is executed in a computer is provided, a digital signal processor, a field-programmable gate array, an application- specific integrated circuit, a microprocessor, a microcontroller or any other type of programmable hardware, even in a distributed manner.

In summary, the invention provides a structured monitoring and operating system for managing agents for any type of network, and consisting of a set of processes capable of treating very heterogeneous network fault events, characterized by having a method capable of correlating the events and attempting to detect and, where appropriate, to resolve the fault automatically, and to otherwise generate a trouble ticket for being treated manually by specialized technicians.

The system and method is susceptible to incorporating new types of jobs to the system dynamically, by simply configuring the job in the database, and configuring the rule change in order to launch it.

Furthermore, the network monitoring system and method is susceptible to adding information referring to the fault in the manual actuations so that it is easier to resolve. This information that is inserted in the trouble ticket is extracted from the information of the event, and from queries made from the agent on the network element.

Furthermore, the network monitoring system and method can be applied to any type of network which has as an output source events indicating the status of the network. It is further scalable in different architectures, depending on the type of the networks to be managed.

It also has the capacity to hierarchically find the root cause of the trouble tickets, i.e., it can relate a fault in broadband that it is a consequence of a fault in the transport network.

The network monitoring system and method also have the special capacity to execute jobs programmed in the network without having to receive any external stimulus. It allows auto-diagnosing the working of the processes making up the system in order to attempt to assure the correct working thereof, as well as a capacity to export events to other systems.

These and other advantages will be apparent in view of the detailed description of the invention.

Brief Description of the Drawings

For the purpose of aiding to better understand the features of the invention according a preferred practical embodiment thereof and in order to complement this description, the following figure is attached as an integral part thereof with an illustrative and non-limiting character:

Figure 1 shows a flow chart of the steps of execution of the method according to an embodiment of the present invention.

Detailed Description of the Invention

In the context of the present invention, the term "agent" must be understood as a specific process in charge of hosting and implementing the logic for trying and attempting to solve a specific type of fault or for generating a trouble ticket in the event that the fault has not been able to be automatically solved and it requires manual intervention. This agent can be implemented, for example, as a C++ executable, or a script written in KSH.

In the context of the present invention, the term "task" must be understood as the set of actions involved in performing an activity, and that are defined in the agent itself.

In the context of the present invention, it must be understood that the term "approximately" and the terms of its family (such as "approximate", "approximation", etc.) indicate values or forms very close to those accompanying the aforementioned term. In other words, a deviation within reasonable limits with respect to an exact value or form must be accepted because the person skilled in the art will understand that such deviation with respect to the indicated values or forms is inevitable due to measuring inaccuracies, etc. The same is applied to the terms "around" and "about".

In this text, the term "comprises" and derivations thereof (such as "comprising", etc.) must not be understood in an excluding sense, i.e., these terms must not be interpreted as excluding the possibility that what is described and defined may include more elements, steps, etc.

Figure 1 shows a flow chart of the steps of execution 1 10 120 130 140 150

160 of the method of the invention. The method is implemented in a monitoring and operating system which is executed in a distributed manner in a set of servers. In this figure, reference numbers 1 10 to 160 indicate the steps of the method. Figure 1 further shows a block 101 which schematically represents the services on which said steps 1 10 120 130 140 150 160 are supported, a block 300 which represents the telecommunications network or networks under monitoring (operation and maintenance) and a block 200 which represents a trouble ticketing system of the network. The method, which facilitates automation in the operation and maintenance of telecommunications networks 300, comprises seven steps 1 10 120 130 140 150 160, with their corresponding functional blocks.

The first step 1 10 is receiving and normalizing the events 1 received from the different monitored networks 300. These events 1 are received in one or more servers forming the monitoring and operating system. There are three types of events 1 :

· Alarm-type event: Message with activation and termination.

• Spontaneous-type event: Informative message without an associated termination.

• Status-type event: Information about equipment status.

In this step 1 10, the following tasks are performed:

a. By means of different communication protocols, the events 1 of the different networks 300 are received in the system. Non-limiting examples of these protocols used are those based on Socket TCP/IP (Transmission Control Protocol and Internet Protocol), SNMP (Simple Network Management Protocol) and TEMIP (Telecommunications Management Information Platform). Non-limiting examples of networks 300 from which events are received are: mobile access networks, such as, GSM and UMTS technologies, receiving the events by means of TCP/IP protocol; switching networks receiving the events by means of TCP/IP protocol; broadband networks, such as ATM (Asynchronous Transfer Mode) or the RIMA network (Telefonica's IP network), by means of TCP/IP protocol; transport networks, such as the Synchronous Digital

Hierarchy (SDH) network, by means of WEB Services protocol; narrowband access network by means of TCP/IP and SNMP protocols; and speech platforms such as ANAS (Automatic Network Answering Service) by means of SNMP protocol. Each of these networks 300 has one or more network elements.

b. Normalizing and adapting all the events 1 received from the different networks 300. Normalized events 2 which have a homogenous structure are formed with the same structure from the events 1 of the different networks 300, such that they can be analyzed. As an example, for alarm- type events, the fields that are normalized are:

• sender: indicates the module of the system that sends the alarm to the alarm receiving module.

• soc: manager sending the alarm.

• origin: network element of the network 300 generating the alarm.

• numberSequence: sequence number of the alarm.

• key_event: identifier used for the association of activation- termination.

• id_event_bd: unique identifier of the alarm in the platform. · family: technology of the network element.

• category: category of the alarm (severity) • class: class of the alarm (momentary, i.e. without termination, or permanent, with termination)

date: date of generating the alarm,

hour: hour of generating the alarm,

group: cluster to which the reported problem belongs type: type to which the reported problem belongs,

nature: nature or activation,

observations: observations of the alarm

key_cluster: has no defined usefulness

• progression: indicates if it is sent to another system by SNMP data_auxiliary: has no defined usefulness expiration: time in which the alarm must be automatically terminated

• text: text of the alarm

The preceding fields represent a non-limiting example. In other words, it is up to the manager or supervisor of the operating and maintenance platform to modify, by reducing or increasing, the number of fields that are normalized, the alarm-type event or any other event.

The normalization of the events 1 is performed by applying patterns or rules to the information received in the event 1 , which are used to complete the aforementioned normalized fields. These rules are based on the knowledge of a network expert, whereby the normalization is parameterized. As an example of the relationship between the alarms, alarms from a transport network and alarms from a broadband network, which can be related in the event of a fault, can be pointed out. A cutoff in a PDH transmission line is related to a cutoff in broadband communications, or mobile communications, going over said transmission line. The different alarms can be related through one or more normalized fields.

The different events 1 of this step 1 10 have a different configuration for each of the domains (understanding domain to be a technology or a particularization of a type of network; for example, within the broadband network, the ATM access network domain, the IP access network domain, and the IP network domain), hence the reasons for which it is necessary to normalize events 1 . The result of this step 1 10 (the events with a normalized field structure 2) is reported to the process of the second step 120. It is further possible to receive the events through different connection channels, one supporting the other, eliminating event repetitions by means of comparing the received text.

The second step 120 is the step of pre-processing and filtering the normalized events 2 which are received from the previous step 1 10. "Preprocessing" is understood as one or more actions carried out on normalized events 2, among those which are described below. The step of pre-processing 120 is useful so that the monitoring and operating system is not saturated with insignificant events.

a. - Filtering events. Filtering is understood as the action of non- progression of determined events which are not considered important for the remaining steps or for the operation and maintenance of the network 300. A particular case of filtering is inhibition, which relates to a more temporary filtering situation, for example, the non-progression of determined events known to be received, for example, due to performing a job in the network. As an example, the criteria for filtering events 2 are the following: soc, origin, type, text (allows regular expressions), text_missing (allows regular expressions, if the text does not appear in the text field of the event, it is filtered), numberSequence, family, group, category, class, nature, observations, text (explains and summarizes so that the filter works), key_cluster, progression, data_auxiliary (wild filtering field), expiration.

This filtering is performed by comparison between the fields of the normalized event 2 and previously configured filtering patterns (for example, it can be configured as a filtering pattern for filtering the events of a determined origin). This task allows filtering normalized events 2 before beginning the subsequent step of correlation 130. The configuration of the filtering criteria is performed through forms available by means of a WEB interface.

b. -Temporarily retaining events by means of criteria configurable. The events 2 are retained and, therefore, they do not progress to the following steps until the criteria defined by configuration are no longer met. This task affects the decisions made in the subsequent step of correlation 130, because it modifies the order of the events 1 which are received from the network 300. This temporary retention is optional.

c. -Enriching the information of the events. This consists of adding information which is not included in the event which is received from the network 300. The actions of enriching the information are configured in a data file reflecting mainly: (a) the queries to the inventory or repository that must be made in order to obtain more complementary information, and (b) the commands which must be executed on the corresponding network for the same purpose (queries to the network). As an example of query, when an event arrives from a transmission element, a query can be made in the inventory as to the circuits passing through said element. The file is particular for each domain, and for each "type" of event within the domain. The management of this configuration is performed by the system administrator. As an example of a command, when an event arrives from a router, the information received in the event can be enriched with the petition to the router for the version by means of the command "show version". This enrichment is optional.

d. - Storing the normalized events 2 in the repository of the system. e. - Sending the data necessary for generating the correlation object to the following step of correlation 130. These data are passed on in a string, such as a list of FIELD/VALUE tuple records. In other words, for each event with a normalized field structure 2, a flow of information formed by a set of attribute-value pairs is passed to the following step 130.

The third step 130 is the step of correlation of the events. More specifically, this step comprises creating correlation objects and correlating said objects. In this step, correlation objects are automatically generated from the data received (a flow of information formed by attribute-value pairs 3). These correlation objects are classes, the structure of attributes of which is defined in configuration files, and the values of which are completed from the data received in the previous step. In other words, the correlation objects are generated according to a series of criteria included in the configuration files, indicating which correlation object must be formed and how. An inference engine is used during this step. The inference engine is a reasoning module which allows applying a series of reasoning criteria (hereinafter referred to as "rules") for making different decisions, depending on the input data. These decisions can have a certain degree of uncertainty, i.e., they are not 100% reliable. The inference engine is based on rules defined in a data file. The inference engine is implemented by means of a UNIX process. It is implemented by means of a conventional RETE algorithm, the content of which is outside the scope of the invention.

Once the correlation objects are created, the inference engine is in charge of correlating said objects, always for the purpose of reducing the manual treatment of events generated by the network. These correlation objects are particular for each network domain. In this step 130, according to the business logic configured in the rules, an order is generated directed to the following step 140 for performing a task, task being understood as the set of actions involved in performing an activity. In other words, the result of this step 130 is a message 4 in string form comprising an order to launch a task. This message 4 which is sent to the following step comprises the parameters which identify the task that must be executed, as well as some parameters which serve for the execution thereof. As stated, this message is sent as a string.

Correlation criteria (from the inference engine) are applied to the correlation objects generated. The following correlation criteria are always applied:

a. - Association of activations and terminations of correlation objects generated from the alarms (alarm-type events). In other words, alarm activations and terminations are correlated. It must be observed that a general-type event, for example, an informative event, does not have activation and termination, so it makes no sense to apply this correlation criterion to said events other than alarms. Assuming that receiving the termination of the corresponding alarm after the arrival of an activation of an alarm is met, the two alarm-type events are ruled out because the problem is considered to be concluded. The relationship between the activation and the termination of the correlation objects is done from a field called "Key Event". b. - Association by different criteria of objects generated from events. In this case, tasks are executed as the result of relating a determined set of objects.

Other optional correlation criteria are shown below. The following list is not exhaustive:

c. Persistence of objects generated from any type of event during a determined time. This scenario allows executing a task from which the persistence of determined objects exists for a determined time interval.

d. - Counting of objects generated from data received 3 by any attribute of the objects defining the domain, for example, the attribute Origin" or "Manager", extracted from the event that caused the creation of the object. This scenario allows ordering the execution of tasks according to the number of times an object occurs, or according to certain criteria, such as for example, the number of objects caused in an origin.

e. - Intermittence (activations/terminations) of the objects generated from the alarms in a time period. This is the case in which a network element has an intermittent fault, i.e., it is activated and deactivated in a time period in a continuous manner.

f. - Decision-making by time slot. For example, a different severity is given to the faults depending on the time slot in which they occur, and thus they can be prioritized for their resolution.

The fourth step 140 is the step of task management. According to the correlation performed in the previous step, in which several events 1 generally participate which are symptoms of the occurrence of a determined network fault, a task is launched which attempts to solve the fault. Specifically, in this step 140 a message 5 is launched which identifies an agent capable of executing a task, action or activity. There is a univocal, not biunivocal, relationship between tasks which must be taken on and agents configured for executing those tasks. In other words, there is an agent for each task. The possible tasks are a consequence of a determined object correlation (and since the correlation objects come from events 1 , it can be said that the possible tasks are a consequence of a determined event correlation). Before launching the task, certain checks are performed in this step which prevent possible problems in the network due to launching these tasks. For that purpose, during this step 140, the following functions will be performed:

· Control of limits of tasks on the network, i.e., it checks that the maximum number of tasks that can be performed on a network element is not exceeded.

• Retentions. It allows retaining a determined task on the network, while certain criteria are met, such as during a predetermined time interval for example.

• Inhibitions. It allows not executing a determined task on the network.

• Handling priorities between tasks, first executing the highest priority task by configuration.

The configuration of the different tasks that can be carried out is included in the repository or database and is configurable by the administrator system.

In this step, after having executed the previous functions, and provided that some or all of the previously listed functions or controls have been passed, the corresponding agent is identified. For that purpose, a message 5 is sent to the following step 150 with the identifier of the agent to be launched. Since there is direct correspondence between tasks and agents, said correspondence in the database is read in this step 140. The fifth step 150 correspond to receiving the identifier of the agent to be launched and its subsequent launching or execution. For launching it, a process is started up with the executable of the corresponding agent. The sequence of actions performing a determined task is implemented in this agent. This agent is preferably a C++ executable or a KSH script, which starts up from step 150 by means of a fork operating system calling. The agents are implemented by experts, who know the sequence of actions necessary for solving a fault. The purpose of an agent is to attempt to automatically repair the fault by means of action 6 on the network 300 (for example, resetting equipment), and optionally, in the event that it is not possible, sending a message indicative of trouble 7 to the following step 160, so that an operator can manually treat of the fault. In this latter case, the message indicative of trouble 7 is preferably enriched with all the complementary information that the agent may have obtained on the fault when executing the step 150. For that purpose, the agent provides all the information with the result of the actions executed on the network 300 (for example, the set of commands and responses), and all the information it has queried on the data repositories (by means of queries for obtaining information related to the faulty network element). The agents are different in each of the network domains, and also specific for the different faults. An example of an agent is the one implemented for unblocking a user in the Intelligent Network Platform. All the agents use the same repository query facilities, facilities for performing actions on the network 300 and trouble ticket creation petition facilities (all being optional). The logic defined in the agents is structured in the following parts:

• Start, where the agent starts in terms of the platform services, in order to subsequently be able to invoke them.

• Termination of the agent indicating the result of the execution.

Furthermore, optionally:

• Data query in repositories, in order to obtain events or information necessary for executing the task.

· Action on the network to solve the fault or to obtain network data. For example, if the fault is hardware, the agent does not usually attempt to act, but rather it is limited to creating a trouble ticket. If, in contrast, the fault is software, the agent first attempts to act and if it is unable to fix it, it creates a trouble ticket.

· Petition to create/modify a trouble ticket.

The sixth step 160 is optional and corresponds to the control functionality of the aforementioned trouble tickets 7. This step is only executed if the fault has not been automatically repaired in step 150. This step 160 is also carried out by means of agents (different from those described above intended for attempting to execute a task or activity), which are in charge of completing the different fields of the trouble ticket that must be created, depending on the network domain, and causing the sending of the petition to create the trouble ticket 8 to the trouble ticketing system 200, through the corresponding interface (for example Web Services). The information used for creating the trouble ticket is read from a database, previously stored by the agent. Some or all of the following functions can be performed in this step 160:

• Control of the trouble tickets so as to not repeat them, as a result of different alarms received which are related. This function is based on previously checking the previously created trouble tickets.

• Distributing trouble tickets by user/task group, depending on the hour, shift, holiday, province, ...

• Interface with external systems, by means of WEB Services protocols, API ARUSER, SMS, or LOTUS mail for creating trouble tickets 9.

• Storing and displaying the trouble tickets in the WEB interface, which allows, as a contingency to an external system crash (trouble ticketing system, for example), to continue treating said trouble tickets, and when the trouble ticketing system has recovered, sending all the trouble tickets that had been stored during the crash.

In this step, the root cause of the trouble tickets is also deduced, for example, a transmission problem, also generating an alarm in the mobile access network. In this case, the two trouble tickets are created but such that it is deduced that the transmission trouble is the root cause of the mobile access trouble. To that end, the inventory information linking the mobile access circuits which go through the transmission link is used.

All the steps defined above make use 10 of a series of services 101 available for the method of the invention and for the platform or system on which it is implemented, which are defined below:

• Database query service, which allows obtaining information from input data.

• Service for action on the network, which allows sending commands on the network, in order to obtain more information or to attempt to solve a fault. This service in turn has the interface with the network 300 for sending said commands, and receiving the responses, for example, action via Telnet. • Service for creating a message indicative of trouble, in the moment that a warning that the fault requires manual action is to be generated.

By way of example, in step 120 a query can be made to these services 101 in order to obtain the circuits affected by an event, or in order to enrich a certain event, or in order to encapsulate the connection with the network, or in order to encapsulate a connection with the trouble ticketing system, etc.

Through the previous steps, the method of the invention provides the following advantages over conventional methods: launching tasks automatically, which facilitates a high degree of automation of the operations on the networks; the capacity to act on the network, which allows resolving faults automatically, and if this is not possible, providing greater information on the fault so that it can be manually resolved; the possibility of the user configuring the correlation rules in files, which have significant dependence on the business and on the network, or of describing them in agents, such that the modifications are easy to perform and have less impact on the rest of the system; since it has a modular architecture, the ease of integrating new networks according to the implementation of the protocols for receiving events from the network and protocols for acting on it, and to the files and agents describing the new business algorithms; and it is a scalable architecture according to the different networks or domains integrated in the system.

As can be observed, the present invention provides the execution of operating and maintenance automatisms from receiving events from different networks which may indicate faults, in order to attempt to solve them, and if this is not possible, generating a warning so that those faults are solved manually by a specialized technician.

To that end, the method and system of the invention use event normalizing techniques, correlation and agent launching, all in an automatic and configurable manner.

As has been seen, the method and system have, as input information, the events generated in the network. This input information can be an alarm that can indicate a fault (for example, a fault in a communications line card) or a spontaneous message (for example, a message informing about a status change in the plant). From configuration parameters and correlation rule files, the method and system decides which agent should treat the fault. That agent, by means of a certain programming, attempts to solve it automatically and, if it cannot, or if there is no possible treatment or automatic action, it notifies so that a manual action is performed on the network or affected network element.

An example of network domain integrated in the system is shown below. It is the mobile access network domain. The network elements of this domain are the base station controllers (BSCs).

In this example, all the steps described above are performed. The following can be underlined as an example of agents for this domain:

• AgtGsmMotoGclk Treel . This is the agent which treats determined GCLK alarms of the BSS and RCXCDR systems for decision tree 1 , which consists of, if determined alarms are repeated 5 times in one hour, a trouble ticket being generated for its manual treatment. In this case, after receiving the alarms, the agent automatically creates a trouble ticket.

• AgtGsmMotoGclk_Tree4. This is the agent which treats determined GCLK alarms of the BSS and RCXCDR systems for the decision tree 4, which consists of, if determined alarms are repeated 3 times in one hour, remaining active for more than 5 minutes, an action being performed on the XBL device (reset). If this action is successful and the alarm disappears, nothing else has to be done. If, in contrast, the alarm persists, a trouble ticket will be generated for its manual treatment. In this case, if the action has gone well, the problem will have been solved automatically, and otherwise, all the information will have been available in the trouble ticket for its manual resolution.

These are only two examples of agents, between around 670 agents, of greater to lesser complexity, which are defined for facilitating the operation and maintenance of the networks.

An example of the different data and flows between steps is shown below for better understanding of the invention:

• First, the format of the alarm which is received from the network manager is shown. In this case it is an alarm which is received through the network element manager (Mobile SOC) for mobile access network Motorola technology. This information is the input information 1 of step 1 10. It basically translates the information which serves to identify the fault in the equipment.

- Example of Motorola Alarm:

AMBER START EXTENSION START CELSIG 08-41 -1 1 -01 -1 END CELSIG START CGI 214-07-4101 -16064 END CGI

START TYP E_U N KOWN ELEMENT END TYPE ELEMENT

START TYPE PROBLEM ^*NONE^*. equipmentFailureEvent

END TYPE PROBLEM

START ADMINISTRATIVE NUMBER 41 100001018169

END ADMINISTRATIVE NUMBER

START NETWORK ID 2-23-25 END NETWORK ID

AMBER END EXTENSION

#6643 - NEW - ^*NONE^*. equipmentFailureEvent - IAS - BSS-SE-

Bellavista-ll(SE:Ruben_Dario_DCS-25): 25 IAS 0 - 19/09/2007

12:33:36. [105] Fan Tray 2 Failure - FMIC - Investigate -/-. SITE 25 :

(SE:Ruben_Dario_DCS-25): Investigate

The output 2 of step 1 10 is shown below. In this case, it is the alarm in normalized format in which there are no empty fields. The important information of the alarm is basically maintained, formatting the information in determined fields.

- Normalized Alarm :

ID_EVENT. 37093906

DATE_EVENT. 1 190198022

SOC. OMC_SEVILLA

ORIGIN. BSS-SE-Bellavista-ll

DATE. 19/09/2007 12:33:36

TYPE. IAS

TEXT. AMBER START EXTENSION

START CELSIG 08-41 -1 1 -01 -1 END CELSIG

START CGI 214-07-4101 -16064 END CGI

ST A RT T Y P E_U N KO W N ELEMENT END TYPE ELEMENT

START PROBLEM TYPE ^*NONE^*. equipmentFailureEvent

END PROBLEM TYPE

START ADMINISTRATIVE NUMBER 41 100001018169

END ADMINISTRATIVE NUMBER

START NETWORKJD 2-23-25 END NETWORKJD

AMBER END EXTENSION

#6643 - NEW - ^*NONE^*. equipmentFailureEvent - IAS - BSS-SE- Bellavista-ll(SE:Ruben_Dario_DCS-25): 25 IAS 0 - 19/09/2007 12:33:36. [105] Fan Tray 2 Failure - FMIC - Investigate -/-. SITE 25 : (SE:Ruben_Dario_DCS-25): Investigate

NUMBERSEQUENCE. 6643

FAMILY. GSM MOTOROLA

GROUP. equipmentFailureEvent

CATEGORY. 1

CLASS. P

NATURE. A

The correlation object which is automatically created in step 130 as a result of receiving the alarm is shown below. This correlation object is the one that will serve for performing the correlation in the inference engine by means of applying the rules previously in a file. Besides the information contained in the alarm, configuration information of the alarms which the system previously has is added. An example of this is the persistence, i.e., the time that the motor retains the alarm while waiting to receive a termination, before launching a task, in this case 7,200 seconds.

Object Correlation Fields:

TYPE_EVENT. A

CLASS_EVENT. GSM_MOTOROLA_IAS

ORIGIN_EVENT. BSS-SE-Bellavista-ll

FAMIL Y_EVENT. GSM MOTOROLA

KEY_EVENT. OMC_SEVILLA_6643

ID_EVENT. 37093906:1 190198022

GROUP_EVENT. C

DATE_EVENT. 19/09/2007

HOUR_EVENT. 12:33:36

NATURE. A

ORIGIN. BSS-SE-Bellavista-ll

BTS. SE:Ruben_Dario_DCS-25

ID_BTS. 25

ID_EVENT_OMC. 6643

GRI. YES

TREAT. YES

PARAMETERS. 105

PERSISTENCE. 7200

SHORTPERIOD. 3600

SHORTREP. 8 LONGPERIOD. 86400

LONGREP. 200

VERYLONGPERIOD. 0

VERYLONGREP. 0

BLOCK_CLUSTER. IAS

BLOCK_REPET. BSS-SE-Bellavista-ll_25_105_IAS_0

AGENT. AgtGsmMotolas

DISP_COMPLETE: 0

The data which will serve step 140 for launching the task are shown below. These data are those which are received from step 130. Among these data, there are some which serve as information to be taken into account when the task is launched, for example, in this case the field indicating "REPETITIONS".

- Task request

ID_ORIGIN. 8135130:1256293829

ORIGIN. EOC_MOVILES

CENTRAL. BSS-SE-Bellavista-ll

SYSTEM. GSM MOTOROLA

GROUP. CORRECTIVE

TYPE. AgtGsmMotolas

PRIORITY. 0

ID_POSTAGE. 19

ID_EXTERNAL. 8135130:1256293829

DATJEXTt

DATJEXT2

DATJEXT3

DAT_EXT4

DATJEXT5

DATJEXT&

DATJEXT7. REPETITIONS

DATJEXT8.

DATJEXT9.

D47^"_EAT70:MANAGER_FAILURES_OO1

The following is the result of the start up of the AgtGsmMotolas process by step 150. The outlines show, for example, the time it has taken to execute the task (20 seconds), as well as the Process Identifier (Pid). -Launch of the agent:

[22/10/2009 16:01 :00] Pid(22558): Launching

D YN AM I C_AG E NT 004 (pid 15951 )

../OmegaGsmMotorola bin/AgtGsmMotolas 1 1278205 8135130:1256293829

[22/10/2009 16:01 :20] Pid(22558): The execution of

D YN AM I C_AG E NT 004 (pid=15951 ) has concluded

This invention can apply in networks of any type which require monitoring, operation and maintenance. It has been tested and provides excellent results in telecommunication networks, particularly in the operating and maintenance processes for said networks for telecommunications operators. It also is extensible to other technological areas which can provide an input stimulus to the system and/or which can be operated on, such as, for example, energy management control by means of analyzing the energy data received through this system.

In view of this description and drawing, the person skilled in the art will understand that the invention has been described according to several preferred embodiments thereof, but that multiple variations can be introduced in said preferred embodiments without departing from the object of the invention as it has been claimed.

Claims

1 . Monitoring and operating method, characterized in that it comprises the steps of:

- receiving and normalizing (1 10) a plurality of heterogeneous events (1 ) generated by at least one network element of a telecommunications network (300), obtaining a plurality of events with a normalized field structure (2);

- for each event with a normalized field structure (2), generating a flow of information of attribute-value pairs (3), after having applied one or more filtering criteria;

-from each flow of information formed by a set of attribute-value pairs (3), generating a correlation object;

- correlating (130) one or more correlation objects by means of at least one criterion included in an inference engine, determining, if said at least one correlation criterion is met, that a fault has occurred in the network (300);

- executing a task associated with the treatment of said fault;

- executing (150) an agent associated with said task, such that:

- if the fault can be resolved remotely, acting (6) on the network (300);

- if the fault cannot be resolved remotely, sending a trouble ticket message (7) so that a person can manually repair the fault.

2. Method of claim 1 , wherein said step of generating a flow of information

(120) from each event with a normalized field structure (2) comprises filtering said events (2) by comparison between the fields of the event (2) and previously configured filtering patterns, such that only the events which have passed said filtering are provided to the step of correlation.

3. Method of any of the previous claims, wherein said step of generating a flow of information (120) from each event with a normalized field structure (2) further comprises at least one of the following actions: inhibiting events (2), temporarily retaining events (2) and enriching the information comprised in events (2).

4. Method of any of the previous claims, wherein if the fault cannot be resolved remotely and a trouble ticket message (7) is sent so that a person can manually repair the fault, sending (8) said trouble ticket to a trouble ticketing system (200) so that there is a record of said task.

5. Method of claim 4, wherein said step of controlling the trouble tickets (160) comprises at least one of the following activities:

- distributing trouble tickets by different criteria;

- interface with the trouble ticketing system (200) for creating the trouble tickets;

- storing and displaying the trouble tickets in a WEB interface.

6. Method of any of the previous claims, wherein said at least one correlation criterion included in a configuration file used by the inference engine is chosen from among the following: association of activities and terminations of correlation objects generated from alarm-type events, association by different criteria of objects generated from events, counting of objects generated from the flow of information formed by attribute-value pairs for ordering the execution of a task, intermittence of the objects generated from alarms in a time period and decision-making by time slot.

7. Method of any of the previous claims, wherein said step of generating a flow of information (120) from each event with a normalized field structure (2) further comprises storing said events with a normalized field structure (2) in a repository.

8. Method of any of the previous claims, wherein said step of sending (140) a message (4) comprising an identifier of an agent capable of executing a task associated with said fault is performed after at least one of the following actions:

- controlling the limit of tasks which can be executed on the network (300);

- retaining the jobs while determined criteria are met;

- inhibiting jobs on the network (300); and

- establishing priorities between the tasks.

9. Method of any of the previous claims, configured to hierarchically find the root cause of the trouble tickets.

10. Method of any of the previous claims, configured to execute tasks programmed in the network (300), without having to receive any external stimulus.

1 1 . System for monitoring and operation configured to carry out the steps of the method of any of claims 1 to 10.

12. System for operation and maintenance of claim 1 1 , comprising at least one server.

13. System for operation and maintenance of claim 1 1 or 12, comprising a trouble ticketing module (200).

14. A computer program comprising computer program code means suitable for performing the steps of the method of any of claims 1 to 10 when the mentioned program is executed in a computer, a digital signal processor, a field- programmable gate array, an application-specific integrated circuit, a microprocessor, a microcontroller or any other type of programmable hardware, even in a distributed manner.