0% found this document useful (0 votes)

45 views13 pages

A Grid Monitoring Architecture

Uploaded by

Junliang Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views13 pages

A Grid Monitoring Architecture

Uploaded by

Junliang Chen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

GWD-I (Informational) B.

Tierney, Lawrence Berkeley National Laboratory

R. Aydt, University of Illinois at Urbana-Champaign
D. Gunter, Lawrence Berkeley National Laboratory
W. Smith, NASA Ames Research Center
M. Swany, University of California, Santa Barbara
V. Taylor, Northwestern University
R. Wolski, University of California, Santa Barbara

GGF Performance Working Group

March 2000
Revised 16-January-2002

A Grid Monitoring Architecture

Status of this Memo

This memo provides information to the Grid community regarding the Grid Monitoring Architecture
(GMA) being developed by the Global Grid Forum Performance Working Group. The goal of the
architecture is to provide a minimal specification that will support required functionality and allow
interoperability. Distribution is unlimited.

Copyright © Global Grid Forum (2002). All Rights Reserved.

Abstract

Large distributed systems such as Computational and Data Grids require that a
substantial amount of monitoring data be collected for various tasks such as fault
detection, performance analysis, performance tuning, performance prediction, and
scheduling. Some tools are currently available and others are being developed for
collecting and forwarding this data. The goal of this paper is to describe the major
components of a Grid monitoring architecture and their essential interactions. By
adopting standard terminology and describing the minimal specification to support
required functionality, we hope to encourage the development of interoperable high-
quality performance tools for the Grid. To motivate the Grid Monitoring Architecture
(GMA) design and to guide implementation, we also present the characteristics that are
critical to proper functioning of a performance monitoring system for the Grid.

bltierney@lbl.gov 1
GWD-I (Informational) B. Tierney, Lawrence Berkeley National Laboratory
R. Aydt, University of Illinois at Urbana-Champaign
D. Gunter, Lawrence Berkeley National Laboratory
W. Smith, NASA Ames Research Center
M. Swany, University of California, Santa Barbara
V. Taylor, Northwestern University
R. Wolski, University of California, Santa Barbara

GGF Performance Working Group

March 2000
Revised 16-January-2002
Contents

1. Abstract .......................................................................................................................1
2. Introduction ..................................................................................................................3
3. Design Considerations ..................................................................................................3
4. Architecture and Terminology ........................................................................................4
4.1 Directory Service Interactions .....................................................................................5
4.2 Producer/Consumer Interactions ................................................................................5
5. Components and Interfaces ...........................................................................................6
5.1 Directory Service.......................................................................................................6
5.2 Producer ...................................................................................................................6
5.3 Consumer.................................................................................................................7
5.4 Intermediaries ...........................................................................................................8
5.5 Sources of Event Data ...............................................................................................9
6. Sample Use .................................................................................................................9
7. Implementation Issues ................................................................................................ 10
8. Security Considerations .............................................................................................. 11
9. Glossary .................................................................................................................... 12
10. Author Information ...................................................................................................... 12
11. Acknowledgements..................................................................................................... 12
12. Intellectual Property Statement .................................................................................... 12
13. Full Copyright Notice................................................................................................... 13
14. References................................................................................................................. 13

bltierney@lbl.gov 2
GWD-I 16-January-2002

1. Introduction

Performance monitoring of distributed components is critical for enabling high-performance

distributed computing. Monitoring data is needed to determine the source of performance
problems and to tune the system and application. Fault detection and recovery mechanisms need
monitoring data to determine whether a server is down and to decide whether to restart the server
or to redirect service requests elsewhere [4][7]. A performance prediction service takes
monitoring data as input to a prediction model [9], which is in turn used by a scheduler to
determine which resources to assign to a job.

Several groups are developing Grid monitoring systems [3][6][7][9]. These groups, along with
others in the Global Grid Forum community, recognize the need to interoperate. To facilitate this
interoperation, we have developed an architecture of monitoring components that specifically
addresses the characteristics of Grid platforms. A Grid monitoring system is differentiated from a
general monitoring system in that it must be scalable across wide-area networks and encompass
a large number of heterogeneous resources. The monitoring system’s naming and security
mechanisms must also be integrated with other Grid middleware.

In this document we describe the core Grid Monitoring Architecture (GMA) components and
models for high-level communication between components of different types. This document
does not address component creation or management (coordination and control), which are
crucial in a production-level Grid monitoring system. We hope that these issues will be covered
in future documents.

2. Design Considerations

With the potential for thousands of resources at geographically distant sites and tens-of-
thousands of simultaneous Grid users, it is critical that data collection and distribution
mechanisms scale. A general-purpose information management system such as an off-the-shelf
database or directory service cannot efficiently meet this requirement because the characteristics
of performance information are fundamentally different from the characteristics of the data these
types of systems were designed to handle. In general, the following characteristics distinguish
performance-monitoring information from other forms of system or program-produced data.

o Performance information has a fixed, often short, lifetime of utility. Most monitoring
data goes stale quickly, making rapid read access important but obviating the need for
long-term storage. The notable exception to this is data that gets archived for accounting
or postmortem analysis.
o Updates are frequent. Unlike the more static forms of “metadata, ” dynamic
performance information is typically updated more frequently than it is read. Since most
extant information-base technologies are optimized for query and not for update, they are
potentially unsuitable for dynamic information storage.
o Performance information is often stochastic. It is frequently impossible to characterize
the performance of a resource or an application component by using a single value.
Therefore, dynamic performance information may carry quality-of-information metrics
quantifying its accuracy, distribution, lifetime, and so forth, which may need to be
calculated from the raw data.

Systems that collect and distribute performance information should satisfy certain requirements:

o Low latency. As previously stated, performance data is typically relevant for only a short
period of time. Therefore, it must be transmitted from where it is measured to where it is
needed with low latency.

bltierney@lbl.gov 3
GWD-I 16-January-2002

o High data rate. Performance data can be generated at high data rates. The performance
monitoring system should be able to handle such operating conditions.
o Minimal measurement overhead. If measurements are taken often, the measurement
itself must not be intrusive. Further, there must be a way for monitoring facilities to limit
their intrusiveness to an acceptable fraction of the available resources. If no mechanism
for managing performance monitors is provided, performance measurements may simply
measure the load introduced by other performance monitors.
o Secure. Typical user actions will include queries to the directory service concerning
event data availability, subscriptions for event data, and requests to instantiate new event
monitors or to adjust collection parameters on existing monitors. The data gathered by
the system may itself have access restrictions placed upon it by the owners of the
monitors. The monitoring system must be able to ensure its own integrity and to preserve
the access control policies imposed by the ultimate owners of the data.
o Scalable. Because there are potentially thousands of resources, services, and
applications to monitor and thousands of potential entities that would like to receive this
information, it is important that a performance monitoring system provide scalable
measurement, transmission of information, and security.

In order to meet these requirements, a monitoring system must have precise local control of the
overhead and latency associated with gathering and delivering the data. We believe that data
discovery needs to be separated from data transfer if this level of control is to be achieved.

In the Grid, the amount of available performance information will be very large, and searches of
this space will have unpredictable latencies. These potentially large latencies must not impact
every request for performance information. Instead, searches should be used only to locate an
appropriate information source or sink, whereas operations with a more predictable latency
should be used to transfer the actual performance information. In this way, individual
producer/consumer pairs can do “impedance matching” based on negotiated requirements, and
the amount of data flowing through the system can be controlled in a precise and distributed
fashion based on current local load considerations.

In order to separate data discovery from data transfer, metadata must be abstracted and placed
in a universally accessible location, called here a “directory service,” along with enough
information to bootstrap the communication between the data source and sink. Scalability results
from restricting and organizing the metadata so that the directory service itself may be distributed
and so that the rate of communication between
distributed nodes increases slowly relative to the
total amount of data transferred. This model differs
consumer event
from the “event channel” model of the CORBA publication
Event Service [1], which combines the mechanism information
for finding the data that should be transferred with
the actual transfer into a single “searchable”
channel of information. In contrast, in our design events directory
performance event data, which makes up the service
majority of the communication traffic, travels directly
event
from the producers of the data to the consumers of
publication
the data. producer information

3. Architecture and Terminology

Figure 1: Grid Monitoring Architecture
Components
The Grid Monitoring Architecture consists of three
types of components, shown in Figure 1:

o Directory Service: supports information publication and discovery

o Producer: makes performance data available (performance event source)

bltierney@lbl.gov 4
GWD-I 16-January-2002

o Consumer: receives performance data (performance event sink)

The GMA is designed to handle performance data transmitted as time-stamped (performance)

events. An event is a typed collection of data with a specific structure that is defined by an event
schema. Performance event data is always sent directly from a producer to a consumer.

3.1 Directory Service Interactions

Producers (and consumers that accept control messages) publish their existence in directory
service entries. The directory service, or registry, is used to locate producers and consumers.
Note that the term “directory service” is not meant to imply a hierarchical service such as LDAP
[8]; any lookup service could be used. The directory service serves to bootstrap communication
between consumers and producers, as entries are populated with information about understood
wire protocols and security mechanisms.

Consumers can use the directory service to discover producers of interest, and producers can
use the directory service to discover consumers of interest. Either a producer or a consumer may
initiate the interaction with a discovered peer. In either case, communication of control messages
and transfer of performance data occur directly between each consumer/producer pair without
further involvement of the directory service.

3.2 Producer/Consumer Interactions

The GMA architecture supports three interactions for transferring data between producers and
consumers: publish/subscribe, query/response, and notification.
The GMA publish/subscribe interaction has three stages. In the first stage, the initiator of the
interaction (this may be either a producer or consumer) contacts the “server” (if the initiator is a
consumer, the server is a producer, and vice versa) indicating interest in some set of events. The
mechanism for specifying events of interest is not addressed in this document. Additional
parameters needed to control the data transfer are also negotiated in this stage. These may
include where to send the performance events, how to encode or encrypt the performance
events, and how often to send the performance events, buffer sizes, and timeouts. The initial
contact and other communication during this stage are done via an exchange of control
messages between the initiator and the server. At this point, there is state in both the producer
and consumer, called a subscription. In the next stage of the interaction, the producer (which may
have been the initiator or the server for this interaction) sends one or more performance events to
the consumer. In the final stage, either the producer or consumer terminates the subscription,
possibly with additional control messages.

For the GMA query/response interaction, the initiator must be a consumer. The interaction
consists of two stages. The first stage sets up the transfer, similar to the first stage of
publish/subscribe. Then, instead of performance event transfer followed at some later time by a
terminating unsubscribe, the producer transfers all the performance events to the consumer in a
single response. This interaction maps particularly well to request/reply protocols such as HTTP.

The GMA notification interaction is a one-stage interaction, and the initiator must be a producer.
In this type of producer/consumer interaction, the producer transfers all the performance events
to a consumer in a single notification.

Protocols for control and event data channels are not specified by the GMA. Moreover, the wire
protocol used to communicate control information between the producer and consumer and the
wire protocol used to transfer performance events (data) may be completely different. System
implementers may support one or more wire protocols, for example, SOAP/HTTP, LDAP, or
XML/BXXP, choosing those best suited to their own requirements.

bltierney@lbl.gov 5
GWD-I 16-January-2002

Delivery guarantees are also not specified by the GMA. Implementations may support at-most-
once, at-least-once, and exactly-once delivery of performance events.

4. Components and Interfaces

In this section we further define the functionality and interfaces of the directory service, producer,
and consumer components. We also introduce the notion of “compound components” and
discuss potential sources of measurement data.

4.1 Directory Service

In order to describe and discover performance data on the Grid, a distributed directory service for
publishing and searching must be available. The GMA directory service stores information about
producers and consumers that accept requests. When producers and consumers publish their
existence in the directory service, they typically also publish information regarding the types of
events they produce or consume, along with the meta-information about accepted protocols,
security mechanisms, and so forth, as described in Section 4. This publication information, or
registration, allows other producers and consumers to discover the types of event data that are
currently available or accepted, the characteristics of that data, and ways to gain access to that
data. The directory service is not responsible for the storage of event data itself – it contains only
per-publication information about which event instances can be provided or accepted. The event
schema may, optionally, be available through the directory service.

Four functions are supported by the directory service.

1. Add: add an entry to the directory.

2. Update: change an entry in the directory.
3. Remove: remove an entry from the directory.
4. Search: perform a search for a producer or consumer based on some selection criteria.
The client should indicate whether a single match, or multiple matches, if available,
should be returned. An optional extension would allow the client to get multiple matches
one at time using a “get next” query in subsequent searches.

4.2 Producer

A producer is any component that uses the producer interface to send events to a consumer. A
given component may have multiple producer interfaces, each acting independently and sending
events. The term producer is used interchangeably, and inexactly, to refer both to a single
producer interface and to a component that contains at least one producer interface.

The core interaction functions that may be supported by a producer are listed below.

1. Maintain Registration: add/update/remove directory service entry or entries describing

events that the producer will send to a consumer. Corresponds to Directory Service Add,
Update, and Remove.
2. Accept Query: accept a query request from a consumer. One or more events are sent to
the consumer in response to the query. Corresponds to Consumer Initiate Query.
3. Accept Subscribe: accept a subscribe request from a consumer. Further details about the
subscription are returned in the reply. If the subscription is successfully established, the
producer sends events to the consumer until the subscription is terminated. Corresponds
to Consumer Initiate Subscribe.

bltierney@lbl.gov 6
GWD-I 16-January-2002

4. Accept Unsubscribe: accept an unsubscribe request from the consumer. If this succeeds,
the corresponding subscription will be closed and no more events will be sent for this
subscription. Corresponds to Consumer Initiate Unsubscribe.
5. Locate Consumer: search the directory service for a consumer. Corresponds to Directory
Service Search.
6. Notify: send a single set of event(s) to a consumer. Corresponds to Consumer Accept
Notification.
7. Initiate Subscribe: request to consumer to send it events. Further details about the
subscription are returned in the reply. If the subscription is successfully established, the
producer sends events to the consumer until the subscription is terminated. Corresponds
to Consumer Accept Subscribe.
8. Initiate Unsubscribe: terminate a subscription with a consumer. If this succeeds, the
subscription will be closed, and no more events will be sent for this subscription.
Corresponds to Consumer Accept Unsubscribe.

Producers can provide access control to the event data, allowing different access to different
classes of users. Since Grids typically have multiple organizations controlling the resources being
monitored, there may be different access policies (firewalls possibly), different frequencies of
measurement, and different performance details for consumers “inside” or “outside” the
organization running the resource. For example, some sites may allow internal access to real-
time event streams, while providing only summary data off-site. The producers can enforce these
policy decisions.

In addition to the core GMA producer functionality described above, producers could provide
many other services. Examples of these include event filtering, caching, and intermediate
processing of the raw data as requested by a consumer. For example, a scheduling consumer
might request that a prediction model be applied to the CPU load measurement history from a
particular compute resource, and be notified only if the predicted load falls below a specified
threshold, indicating that the resource is ready to accept new tasks. A “smart” producer could
apply the model supplied by the consumer with the subscription request, and send events only
when the resulting load predictions are below the threshold value.

Information on the services supported by a given producer would be published in the directory
service, along with the event information.

4.3 Consumer

A consumer is any component that uses the consumer interface to receive event data from a
producer. A given component may have multiple consumer interfaces, each acting independently
and receiving events. The term consumer is used interchangeably, and inexactly, to refer both to
a single consumer interface and to a component that contains at least one consumer interface.

The core interaction functions that may be supported by a consumer are listed below.

1. Locate Producer: search the directory service for a producer. Corresponds to Directory
Service Search.
2. Initiate Query: request one or more events from a producer, which are delivered as part
of the reply. Corresponds to Producer Accept Query.
3. Initiate Subscribe: request establishment of a subscription with a producer. Further
details about the subscription are returned in the reply. If the subscription is successfully
established, the producer sends events until the subscription is terminated. Corresponds
to Producer Accept Subscribe.

bltierney@lbl.gov 7
GWD-I 16-January-2002

4. Initiate Unsubscribe: terminate a subscription. If this succeeds, the corresponding

subscription will be closed, and no more events will be accepted for this subscription.
Corresponds to Producer Accept Unsubscribe.
5. Maintain Registration: add/update/remove directory service entry or entries describing
events that the consumer will accept from the producer. Corresponds to Directory Service
Add, Update, and Remove.
6. Accept Notification: accept a single set of event(s) from a producer. Corresponds to
Producer Notify.
7. Accept Subscribe: accept a subscribe request from a producer. Further details about the
subscription are returned in the reply. If the subscription is successfully established, the
producer sends events until the subscription is terminated. Corresponds to Producer
Initiate Subscribe.
8. Accept Unsubscribe: accept an unsubscribe request from the producer. If this succeeds,
the subscription will be closed, and no more events will be accepted for this subscription.
Corresponds to Producer Initiate Unsubscribe.
9. Locate Event Schema: search request to the schema repository for a given event type.
The schema repository may be the GMA directory service.
Many types of consumers are possible. A few are listed here as illustrative examples.

o Archiver: aggregate and store event data in long-term storage for later retrieval or
analysis. An archiver may also act as a GMA producer when the data is retrieved from
storage.
o Real-time monitor: collect monitoring data in real time for use by online analysis tools. An
example is a tool that graphs cpu_load information in real-time.
o Overview monitor: collect events from several sources and use the combined information
to make a decision that could not be made on the basis of data from only one producer.
For example, one might trigger a call to the system administrator's pager at 2:00 am if
both the primary and backup servers are down.

4.4 Intermediaries

A consequence of the separation of data

discovery from data transfer is that the
consumer
protocols used to perform the
publish/subscribe, query/response, and
notification interactions described in Section event data
3.2 can be used to construct intermediaries
that forward, broadcast, filter, or cache the Producer Interface
performance events.
Monitoring Service X
The building block for these advanced Consumer interface
services is the compound producer/consumer,
which is a single component that implements
event data
both producer and consumer interfaces. For
example, a consumer interface might collect producer
producer
event data from several producers, use that
data to generate a new derived event data
type, and make that available to other Figure 2: Compound Producer/Consumer
consumers through a producer interface, as
shown in Figure 2. Many Grid services may in fact be both consumers and producers of
monitoring events. For example, event archives would likely implement both producer and
consumer interfaces.

bltierney@lbl.gov 8
GWD-I 16-January-2002

Use of these intermediate components can lessen the load on producers of event data that is of
interest to many consumers, with subsequent reductions in the network traffic, as the
intermediaries can be placed “near” the data consumers.

4.5 Sources of Event Data

The data used to construct events can be gathered from many sources. Hardware or software
sensors that sample performance metrics in real time constitute one type of data source. Another
is a database with a query interface, which can provide historical data. Entire monitoring systems,
such as the Network Weather Service [9] can serve as a source of events. Additionally,
application timings from tools such as Autopilot [3] or NetLogger [5] can provide events related to
a specific application.

A producer may be associated with a single source, all sources on a given host, all sources on a
given subnet, or an arbitrary group of sources. Figure 3 shows one possible configuration, but the
architecture allows the performance system implementers to choose the configuration that best
suits their scalability and reliability needs. The GMA specifies neither the relationship nor the
interface between the measurement data sources and the GMA producer.

consumer

search /
register directory
events service

producer

sensor monitoring
application database
system

Figure 3: Sources of Event Data

5. Sample Use

A sample use of the GMA is shown in Figure 4. Event data is collected on the two hosts and at
the network routers between them. The host and network sensors are the sources of the
measurement data, which is managed by a producer. The producer registers the availability of
the host and network events in the directory service. A real-time monitoring consumer subscribes
to all available event data for real-time visualization and performance analysis. The producer is
capable of computing summaries of network throughput and latency data based on parameters
provided by a “network-aware” client. This client uses the summarized network information to
optimally set its TCP buffer size. The producer’s event data is also sent to an archive.

bltierney@lbl.gov 9
GWD-I 16-January-2002

Host sensor
(CPU,
memory, etc.)
network-
aware client
router router
Host sensor
TCP buffer network (CPU,
size (SNMP) memory, etc.)
client consumer sensor server
host host

events producer events

search/
register archiver
consumer
real-time
monitor
directory register
consumer search
service event archive

Figure 4: Sample GMA Usage

6. Implementation Issues

The purpose of a monitoring system is to reliably deliver timely and accurate information without
perturbing the system. We believe that the architecture described can be implemented with
acceptable levels of performance, scalability, and security. Unsatisfactory implementations are,
however, also possible.

In this section we present implementation characteristics that have emerged from development
experiences as being important to the success of monitoring and dynamic performance analysis
systems. These characteristics are presented as a guide to developers producing or intending to
produce implementations of the GMA. We recommend that the following strategies be considered
and incorporated in implementations that are intended to serve as more than proof-of-concept
investigations.

o System components must be fault tolerant. In large-scale distributed systems, failures

will occur. For example, computer systems with monitoring servers will go down, and
these monitoring servers should be restarted automatically from check-pointed internal
data. Directory servers will go down, and the data in these servers should be replicated in
other servers. Networks can go down, and monitoring components must automatically
reconnect and synchronize. The components of a monitoring system must be able to
tolerate and recover from failures, and building fault tolerance into the monitoring system
from the start will save effort later.
o The data management system must adapt to changing performance conditions.
Dynamic performance data is often used to determine whether the shared Grid resources
are performing well (e.g., fault diagnosis) or whether the current Grid load will admit a
particular application (e.g., resource allocation and scheduling). To assess dynamic
performance fluctuation, the data management system cannot itself be rendered
inoperable or inaccessible by the very fluctuations it seeks to capture. As such, the data

bltierney@lbl.gov 10
GWD-I 16-January-2002

management system must use the data it gathers to control its own execution and
resources in the face of dynamically changing conditions.
o All system components must scale. The monitoring system must be able to operate
ubiquitously with respect to the resources or application components being monitored.
To facilitate this scaling, one must be able to add additional producers and additional
directory servers as needed, reducing the load where necessary. In the case where many
consumers are requesting the same event data, the use of a producer reduces the
amount of work on and the amount of network traffic from the host being monitored.
Another important consideration is hierarchical control mechanisms for coordinating the
resource load generated by producers; the pool of producers should not be managed as
a “flat” collection. Moreover, distributed caching can be implemented by special-purpose
consumer-producer nodes that are programmed to store and forward data as a way of
relieving congestion and contention for particular data.
o Monitoring data must be managed in a distributed fashion. Having a single,
centralized repository for dynamic data (however short its lifespan) causes two distinct
performance problems. The first is that the centralized repository for information or
control represents a single point –of failure for the entire system. If the monitoring system
is to be used to detect network failure, and a network failure isolates a centralized
controller from separate system components, it will be unable to fulfill its role. All
components must be able to function when temporarily disconnected or unreachable
because of network or host failure. The second problem with centralized data
management is that it forms a performance bottleneck. For dynamic data, writes often
outnumber reads with measurements taken every few seconds or minutes. Experience
has shown that a centralized data repository simply cannot handle the load generated by
actively monitored resources at Grid scales.
o System components must control their intrusiveness on the resources they
monitor. Different resources experience varying amounts of sensitivity to the load
introduced by monitoring. A two-megabyte disk footprint may be insignificant within a 10-
terabyte storage system but extremely significant if implemented for a palm-top or RAM
disk. In general, performance monitors and other system components must have tunable
CPU, communication, memory, and storage requirements.
o Efficiency/ease -of-use tradeoffs for data formats should be carefully considered. In
choosing a data format, there are tradeoffs between ease-of-use and compactness.
While the easiest and most portable format may be ASCII text, including both event item
descriptions and event item data in each transmission, this is also the least compact.
Compressed binary representations fall at the opposite end of the ease/compactness
spectrum. Another approach is transmitting only the item data values and using a data
structure obtained separately to interpret the data. Implementers should carefully
consider the requirements of their monitoring system when selecting data formats.
o Security standards are useful. Public key-based X.509 identity certificates [2] are a
recognized solution for cross-realm identification of users. When the certificate is
presented through a secure protocol such as SSL (Secure Socket Layer), the server side
can be assured that the connection is indeed to the legitimate user named in the
certificate. User (consumer) access at each of the points mentioned above (directory
lookup and requests to a producer) would require an identity certificate passed though a
secure protocol (e.g., SSL). A wrapper to the directory server and the producer could
both call the same authorization interface with the user’s identity and the name of the
resource the user wants to access. This authorization interface could return a list of
allowed actions or simply deny access if the user is unauthorized. Communication
between the producer and the sensors may also need to be controlled, so that a
malicious user cannot communicate directly with the monitoring process.

7. Security Considerations

bltierney@lbl.gov 11
GWD-I 16-January-2002

A Grid monitoring system requires security mechanisms to ensure the integrity and privacy of
both the monitoring system and the event data itself. For the most part, specifying these
mechanisms is beyond the scope of this document, although some high-level discussion of
security considerations and standards may apply.

8. Author Information

Brian Tierney Ruth Aydt Dan Gunter Warren Smith

Lawrence Berkeley University of Illinois Lawrence Berkeley NASA Ames Research
National Laboratory at Urbana- National Laboratory Center
bltierney@lbl.gov Champaign dkgunter@lbl.gov wwsmith@arc.nasa.gov
ph 1-510-486-7381 aydt@uiuc.edu
fx 1-510-486-6363
Martin Swany Valerie Taylor Rich Wolski
University of California Northwestern University of
Santa Barbara University California
swany@cs.ucsb.edu taylor@ece.nwu.edu Santa Barbara
rich@cs.ucsb.edu

Glossary

GMA Grid Monitoring Architecture, as defined by the Global Grid Forum

Performance Working Group.

Acknowledgments

Input from many people went into this document, including almost all attendees of the various
Grid Forum meetings.

Dan Gunter and Brian Tierney are supported by the U.S. Dept. of Energy, Office of Science,
Office of Computational and Technology Research, Mathematical, Information, and
Computational Sciences Division, under contract DE-AC03-76SF00098 with the University of
California. Ruth Aydt is supported in part by the Department of Energy under contract DOE W-
7405-ENG-36 and by the National Science Foundation under Grant No. 9975020

Any opinions, findings, and conclusions or recommendations expressed in this material are those
of the author and do not necessarily reflect the views of the funding agencies.

Intellectual Property Statement

The GGF takes no position regarding the validity or scope of any intellectual property or other
rights that might be claimed to pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights might or might not be
available; neither does it represent that it has made any effort to identify any such rights. Copies
of claims of rights made available for publication and any assurances of licenses to be made
available, or the result of an attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this specification can be obtained from the
GGF Secretariat.

The GGF invites any interested party to bring to its attention any copyrights, patents or patent
applications, or other proprietary rights that may cover technology that may be required to
practice this recommendation. Please address the information to the GGF Executive Director.

bltierney@lbl.gov 12
GWD-I 16-January-2002

Full Copyright Notice

This document and translations of it may be copied and furnished to others, and derivative works
that comment on or otherwise explain it or assist in its implementation may be prepared, copied,
published and distributed, in whole or in part, without restriction of any kind, provided that the
above copyright notice and this paragraph are included on all such copies and derivative works.
However, this document itself may not be modified in any way, such as by removing the copyright
notice or references to the GGF or other organizations, except as needed for the purpose of
developing Grid Recommendations in which case the procedures for copyrights defined in the
GGF Document process must be followed, or as required to translate it into languages other than
English.

The limited permissions granted above are perpetual and will not be revoked by the GGF or its
successors or assigns.

This document and the information contained herein is provided on an "AS IS" basis and THE
GLOBAL GRID FORUM DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN
WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY
OR FITNESS FOR A PARTICULAR PURPOSE."

References

[1] CORBA. Systems Management: Event Management Service. X/Open Document Number:
P437. http://www.opengroup.org/onlinepubs/008356299/.
[2] R. Housely, W. Ford, W. Polk, and D. Solo, “Internet X.509 Public Key Infrastructure,” IETF
RFC 2459. Jan. 1999
[3] R. Ribler, J. Vetter, H. Simitci, and D. Reed. Autopilot: Adaptive Control of Distributed
Applications. Proceedings of the 7th IEEE Symposium on High-Performance Distributed
Computing, Chicago, July 1998.
[4] W. Smith. A Framework for Control and Observation in Distributed Environments. NASA
Advanced Supercomputing Division, NASA Ames Research Center, Moffett Field, CA.
NAS-01-006, June 2001.
[5] B. Tierney, W. Johnston, B. Crowley, G. Hoo, C. Brooks, and D. Gunter. The NetLogger
Methodology for High Performance Distributed Systems Performance Analysis.
Proceedings of IEEE High Performance Distributed Computing Conference, July 1998.
http://www-didc.lbl.gov/NetLogger/.
[6] B. Tierney, B. Crowley, D. Gunter, M. Holding, J. Lee, and M. Thompson. A Monitoring
Sensor Management System for Grid Environments. Proceedings of the IEEE High
Performance Distributed Computing Conference (HPDC-9), August 2000.
[7] A. Waheed, W. Smith, J. George, and J. Yan. An Infrastructure for Monitoring and
Management in Computational Grids. In Proceedings of the 2000 Conference on
Languages, Compilers, and Runtime Systems, 2000.
[8] M. Wahl, T. Howes, and S. Kille. Lightweight Directory Access Protocol (v3). Available from
ftp://ftp.isi.edu/in-notes/rfc2251.txt.
[9] R. Wolski, N. Spring, and J. Hayes. The Network Weather Service: A Distributed Resource
Performance Forecasting Service for Metacomputing. Future Generation Computing
Systems, 1999. http://nws.npaci.edu/.

bltierney@lbl.gov 13

Autonomus Monitoring of Grid
No ratings yet
Autonomus Monitoring of Grid
7 pages
Netelement Wetice06
No ratings yet
Netelement Wetice06
2 pages
5 137 Imamagic
No ratings yet
5 137 Imamagic
9 pages
Detailed Montoring System
No ratings yet
Detailed Montoring System
2 pages
Architecture of A Network Monitoring Element
No ratings yet
Architecture of A Network Monitoring Element
10 pages
CN Document
No ratings yet
CN Document
11 pages
Rapid Automated Deployment
100% (1)
Rapid Automated Deployment
8 pages
Active Distributed Monitoring For Dynamic Large-Scale Networks
No ratings yet
Active Distributed Monitoring For Dynamic Large-Scale Networks
7 pages
Montoring System - 1
No ratings yet
Montoring System - 1
1 page
Casper Monitoring
No ratings yet
Casper Monitoring
8 pages
Publish/subscribe in Grid
No ratings yet
Publish/subscribe in Grid
8 pages
Network Monitoring Approaches: An Overview: October 2015
No ratings yet
Network Monitoring Approaches: An Overview: October 2015
7 pages
Sieve: Actionable Insights From Monitored Metrics in Microservices
No ratings yet
Sieve: Actionable Insights From Monitored Metrics in Microservices
17 pages
Work Traffic Monitoring Analysis System
No ratings yet
Work Traffic Monitoring Analysis System
14 pages
Sieve - Actionable Insights From Monitored Metrics of Microservices
No ratings yet
Sieve - Actionable Insights From Monitored Metrics of Microservices
17 pages
GFD-I.14 DAIS Working Group
100% (2)
GFD-I.14 DAIS Working Group
27 pages
Network Monitoring Project
No ratings yet
Network Monitoring Project
30 pages
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
No ratings yet
Austin Parker, Ted Young - Learning OpenTelemetry - Setting Up and Operating A Modern Observability System-O'Reilly Media
54 pages
IntelligentAgentProject PH1
No ratings yet
IntelligentAgentProject PH1
25 pages
Network Monitoring Techniques
No ratings yet
Network Monitoring Techniques
6 pages
A Step On Developing Network Monitoring Tools
No ratings yet
A Step On Developing Network Monitoring Tools
8 pages
FYP Report
No ratings yet
FYP Report
99 pages
Ailibaba-Time-Series DB
No ratings yet
Ailibaba-Time-Series DB
13 pages
Grid - Information - Services - For - Distribute - Failure Management Strategies in Grid PDF
No ratings yet
Grid - Information - Services - For - Distribute - Failure Management Strategies in Grid PDF
14 pages
Ids Unit 2
No ratings yet
Ids Unit 2
51 pages
Networking Monitoring and Analysis 4
No ratings yet
Networking Monitoring and Analysis 4
12 pages
An Architectural Pattern For Enterprise Level Monitoring Tools
No ratings yet
An Architectural Pattern For Enterprise Level Monitoring Tools
11 pages
Monitoring in Grid
No ratings yet
Monitoring in Grid
8 pages
Monitoring in The Cloud Ebook 1
No ratings yet
Monitoring in The Cloud Ebook 1
10 pages
Network Monitoring For Dummies 2nd SolarWinds Special Edition - 3
No ratings yet
Network Monitoring For Dummies 2nd SolarWinds Special Edition - 3
5 pages
SALSA MAHARANI B1B120173 Tgs S. Informasi M. Review Jurnal
No ratings yet
SALSA MAHARANI B1B120173 Tgs S. Informasi M. Review Jurnal
22 pages
Rchitecture of Scalable Networks
No ratings yet
Rchitecture of Scalable Networks
4 pages
Network Monitor PDF 9th Batch
No ratings yet
Network Monitor PDF 9th Batch
4 pages
Systems Engineering Approach For Event Monitoring and Analysis in High Speed Enterprise Communications Systems
No ratings yet
Systems Engineering Approach For Event Monitoring and Analysis in High Speed Enterprise Communications Systems
6 pages
Keripik Pisang
No ratings yet
Keripik Pisang
12 pages
FlowMonitor - A Network Monitoring Framework For T
No ratings yet
FlowMonitor - A Network Monitoring Framework For T
11 pages
6 Steps Effective Performance Monitoring Strategy W - Sevo116 PDF
No ratings yet
6 Steps Effective Performance Monitoring Strategy W - Sevo116 PDF
6 pages
Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco Digital Version 2025
No ratings yet
Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco Digital Version 2025
70 pages
Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco Full Chapters Included
100% (1)
Practical OpenTelemetry: Adopting Open Observability Standards Across Your Organization 1st Edition Daniel Gomez Blanco Full Chapters Included
116 pages
Unit 2 Idps
No ratings yet
Unit 2 Idps
12 pages
Network Monitoring and Measurement
No ratings yet
Network Monitoring and Measurement
22 pages
CSE510 Monitoring
No ratings yet
CSE510 Monitoring
34 pages
Introduction to Grid Computing
No ratings yet
Introduction to Grid Computing
37 pages
Network Monitoring For Dummies 2nd SolarWinds Special Edition - 2
No ratings yet
Network Monitoring For Dummies 2nd SolarWinds Special Edition - 2
5 pages
CTM Mid-Sem Answers
No ratings yet
CTM Mid-Sem Answers
7 pages
Grid Computing Joshy Joseph Ebook1
No ratings yet
Grid Computing Joshy Joseph Ebook1
431 pages
Network Monitoring: Past and Future
No ratings yet
Network Monitoring: Past and Future
4 pages
Chapter 23
No ratings yet
Chapter 23
19 pages
Project Proposal For Wollega University Network Monitoring Using Icinga Tool
100% (1)
Project Proposal For Wollega University Network Monitoring Using Icinga Tool
20 pages
Practice Monitoring and Event Management
100% (2)
Practice Monitoring and Event Management
32 pages
Grid Computing for Engineers
100% (6)
Grid Computing for Engineers
12 pages
Thesis-A S
No ratings yet
Thesis-A S
74 pages
Magic Quadrant For Observability Platforms
No ratings yet
Magic Quadrant For Observability Platforms
39 pages
Distributed and Intelligent Platform of Intrusion Detection at Two Levels
No ratings yet
Distributed and Intelligent Platform of Intrusion Detection at Two Levels
8 pages
Monitoring Mordern Infrastructure PDF
100% (1)
Monitoring Mordern Infrastructure PDF
82 pages
Measuring The Robustness of Resource Allocations in A Stochastic Dynamic Environment
No ratings yet
Measuring The Robustness of Resource Allocations in A Stochastic Dynamic Environment
10 pages
Matchmaking Distributed Resource Management For High Throughput Computing
No ratings yet
Matchmaking Distributed Resource Management For High Throughput Computing
7 pages
NimrodG An Architecture For A Resource Management and Scheduling System in A Global Computational Grid
No ratings yet
NimrodG An Architecture For A Resource Management and Scheduling System in A Global Computational Grid
7 pages
New Sequential and Parallel Algorithm For Dynamic Resource Constrained Project Scheduling Problem
No ratings yet
New Sequential and Parallel Algorithm For Dynamic Resource Constrained Project Scheduling Problem
7 pages
Robust MIP for Real-Time Systems
No ratings yet
Robust MIP for Real-Time Systems
8 pages
A Distributed Resource Management Architecture That Supports Advance Reservations and Co-Allocation
No ratings yet
A Distributed Resource Management Architecture That Supports Advance Reservations and Co-Allocation
10 pages
A Novel Application of Option Pricing To Distributed Resources Management
No ratings yet
A Novel Application of Option Pricing To Distributed Resources Management
8 pages
Energy Aware Consolidation For Cloud Computing
No ratings yet
Energy Aware Consolidation For Cloud Computing
5 pages
The Future of Software Development With Low-Code
No ratings yet
The Future of Software Development With Low-Code
19 pages
Cloud Computing NEP 2024
No ratings yet
Cloud Computing NEP 2024
9 pages
Generative AI Leader (ILT) - Module3 - Gen AI - Navigate The Landscape
No ratings yet
Generative AI Leader (ILT) - Module3 - Gen AI - Navigate The Landscape
244 pages
Accelerating MSME Success - Unleashing The Potential of ERP
No ratings yet
Accelerating MSME Success - Unleashing The Potential of ERP
4 pages
Just Dial 123
No ratings yet
Just Dial 123
32 pages
AWS Cloud Tools for Developers
No ratings yet
AWS Cloud Tools for Developers
29 pages
Developing Big Data Solutions On Microsoft Azure HDInsight
No ratings yet
Developing Big Data Solutions On Microsoft Azure HDInsight
346 pages
Master Storage Spaces Direct
No ratings yet
Master Storage Spaces Direct
282 pages
Oracle EPM 11-1-2 1 Tuning Guide v4
50% (2)
Oracle EPM 11-1-2 1 Tuning Guide v4
58 pages
PostgreSQL: A Comprehensive Guide
100% (1)
PostgreSQL: A Comprehensive Guide
13 pages
WP Dremio Definitive Guide To The Data Lakehouse
No ratings yet
WP Dremio Definitive Guide To The Data Lakehouse
20 pages
Microservices Architectureand APIManagement
No ratings yet
Microservices Architectureand APIManagement
24 pages
Robert Wang Staff So Ware Engineer: Mazon
No ratings yet
Robert Wang Staff So Ware Engineer: Mazon
2 pages
HCIA-Storage V5.0 Guide: Huawei Certification
No ratings yet
HCIA-Storage V5.0 Guide: Huawei Certification
98 pages
SAP HANA Admin Guide for Basis Users
100% (2)
SAP HANA Admin Guide for Basis Users
171 pages
Os Module 1
No ratings yet
Os Module 1
48 pages
Ccs335 - Cloud Computing Book
No ratings yet
Ccs335 - Cloud Computing Book
118 pages
Cloud Computing Second Question Bank Solutions
No ratings yet
Cloud Computing Second Question Bank Solutions
22 pages
Deep Learning at Scale at The Intersection of Hardware Software and Data 1st Edition Suneeta Mall New Release 2025
100% (1)
Deep Learning at Scale at The Intersection of Hardware Software and Data 1st Edition Suneeta Mall New Release 2025
129 pages
Rancher 2 4 Architecture WP
100% (1)
Rancher 2 4 Architecture WP
11 pages
Architecting Distributed Transactional Applications
No ratings yet
Architecting Distributed Transactional Applications
43 pages
Performance, Security, and Energy
No ratings yet
Performance, Security, and Energy
9 pages
HR Operating Model Catalog V1
No ratings yet
HR Operating Model Catalog V1
10 pages
Aws Devops
100% (1)
Aws Devops
20 pages
AWS CCP Practice Questions (Basic Cloud Concepts)
No ratings yet
AWS CCP Practice Questions (Basic Cloud Concepts)
25 pages
FAQ - HPE Compute Scale-Up Server 3200
No ratings yet
FAQ - HPE Compute Scale-Up Server 3200
7 pages
Designing Scalable Systems - A Guide For Engineers
No ratings yet
Designing Scalable Systems - A Guide For Engineers
56 pages
Huawei OceanStor Pacific Series Technical White Paper
No ratings yet
Huawei OceanStor Pacific Series Technical White Paper
152 pages
Mishken College
No ratings yet
Mishken College
12 pages
Team 17 Project Synopsis
No ratings yet
Team 17 Project Synopsis
4 pages

A Grid Monitoring Architecture

Uploaded by

A Grid Monitoring Architecture

Uploaded by

GWD-I (Informational) B.

Tierney, Lawrence Berkeley National Laboratory

GGF Performance Working Group

A Grid Monitoring Architecture

Status of this Memo

Copyright © Global Grid Forum (2002). All Rights Reserved.

GGF Performance Working Group

Performance monitoring of distributed components is critical for enabling high-performance

3. Architecture and Terminology

o Directory Service: supports information publication and discovery

o Consumer: receives performance data (performance event sink)

The GMA is designed to handle performance data transmitted as time-stamped (performance)

3.1 Directory Service Interactions

3.2 Producer/Consumer Interactions

4. Components and Interfaces

4.1 Directory Service

Four functions are supported by the directory service.

1. Add: add an entry to the directory.

1. Maintain Registration: add/update/remove directory service entry or entries describing

4. Initiate Unsubscribe: terminate a subscription. If this succeeds, the corresponding

A consequence of the separation of data

4.5 Sources of Event Data

Figure 3: Sources of Event Data

events producer events

Figure 4: Sample GMA Usage

o System components must be fault tolerant. In large-scale distributed systems, failures

Brian Tierney Ruth Aydt Dan Gunter Warren Smith

GMA Grid Monitoring Architecture, as defined by the Global Grid Forum

Intellectual Property Statement

Full Copyright Notice

Copyright (C) Global Grid Forum (2002) All Rights Reserved.

You might also like