8000 GitHub - rashed091/observability
[go: up one dir, main page]

Skip to content

rashed091/observability

Folders and files

NameName
Last commit message
Last commit date 8000

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Observability

Observability is the ability to understand the internal state of a system by examining its outputs. In the context of software, this means being able to understand the internal state of a system by examining its telemetry data, which includes traces, metrics, and logs.

To make a system observable, it must be instrumented. That is, the code must emit traces, metrics, or logs. The instrumented data must then be sent to an observability backend.

What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework that provides a standardized way to collect metrics, logs, and traces from your applications and systems, so you can monitor performance and diagnose problems across distributed systems.

  • An observability framework and toolkit designed to facilitate the

    • Generation
    • Export
    • Collection of telemetry data such as traces, metrics, and logs.
  • Open source, as well as vendor- and tool-agnostic, meaning that it can be used with a broad variety of observability backends, including open source tools like Jaeger and Prometheus, as well as commercial offerings. OpenTelemetry is not an observability backend itself.

Why OpenTelemetry Exists

Before OTel, different vendors had their own proprietary SDKs for telemetry. This made it painful to switch or combine tools.
OTel solves this by providing vendor-neutral APIs and SDKs for:

  • Tracing – Following a request across microservices.
  • Metrics – Measuring performance and resource usage.
  • Logging – Recording discrete events.

You can collect the data once and export it to any backend (Prometheus, Jaeger, Grafana Tempo, Elasticsearch, etc.) without rewriting instrumentation.

Key Concepts

Concept Meaning
Instrumentation Adding OTel SDK calls to your code to collect telemetry.
Span A timed unit of work (e.g., "GET /users" request).
Trace A collection of spans that represent the journey of a single request.
Context Propagation Passing trace context between services so traces can be correlated.
Exporter Sends collected telemetry data to a backend (Jaeger, Prometheus, etc.).
Collector A separate service that receives, processes, and exports telemetry data from multiple apps.

How It Works

Flow:

  1. Your app has OTel SDK or auto-instrumentation.
  2. It creates logs, spans and metrics.
  3. Data is sent to the OpenTelemetry Collector.
  4. The collector processes and sends it to your observability backend.

📊 The Three Pillars of Observability

Trace – the whole storyline

A trace is the full journey of one request or transaction through your system. Traces add further to the observability picture by telling you what happens at each step or action in a data pathway. Traces provide the map—the where—something is going wrong.

  • In the restaurant analogy:
    One customer's entire visit — from entering the restaurant, ordering food, eating, to paying and leaving.

  • In OTel:
    A trace has a unique Trace ID and contains all the spans that happened as part of that request.

  • Example in an HTTP API:
    A POST /checkout request triggers:

    1. Web server receives request
    2. Calls inventory service
    3. Calls payment service
    4. Writes to database
      → All these are part of one trace.

Span – the scenes in the storyline

A span is a single operation or unit of work inside a trace.

  • In the restaurant analogy:
    One scene in the movie — e.g., "Server takes the order", "Chef cooks main course", "Cashier processes payment".

  • In OTel:

    • Has a start time & end time
    • Can have attributes (db.statement, http.method, etc.)
    • Can be nested (parent/child relationship)
  • Example in the API trace:

    • Span 1: HTTP POST /checkout (parent)

      • Span 1.1: SELECT inventory (child)
      • Span 1.2: Process payment (child)
      • Span 1.3: INSERT order record (child)

📌 Traces are made up of spans. Every span knows:

  • Which trace it belongs to (trace_id)
  • Which span called it (parent_span_id)

Metric – the scoreboard

A metric is a numerical measurement over time. Metrics provide a high level picture of the state of a system. Metrics are the foundation of alerts because metrics are numeric values and can be compared against known thresholds.

  • In the restaurant analogy:
    The scoreboard showing:

    • Number of customers served per hour
    • Average wait time
    • Revenue per day
  • In OTel:

    • Common types: Counter, Gauge, Histogram

    • Examples:

      • http.server.request_count (counter)
      • memory_usage_bytes (gauge)
      • http.request.duration (histogram)
  • Metrics are aggregated — you don't look at every single event, you look at totals, averages, percentiles over time.

Logs – the detailed diary

Logs provide an audit trail of activity from a single process that create informational context. Logs act as atomic events, detailing what's occurring in the services in your application.

  • In the restaurant analogy:
    The detailed diary entries:

    • "10:15 AM: Customer #42 requested extra cheese"
    • "10:16 AM: Kitchen started preparing order #42"
    • "10:17 AM: ERROR: Ran out of mozzarella cheese"
  • In OTel:

    • Structured logs with timestamps and levels

    • Examples:

      info!("Starting Salvo server with OpenTelemetry");
      warn!("Database connection retrying...");
      error!("Failed to process payment: {}", error_msg);
  • Log Levels: TRACEDEBUGINFOWARNERROR

  • Correlation: Logs can include trace_id and span_id to correlate with traces

  • Context: Rich structured data (user_id, request_id, etc.)

📊 Putting it together

Restaurant analogy summary:

OTel Concept Restaurant Analogy Example
Trace The full dining experience Customer's entire dinner visit
Span A single step in that visit "Server takes order"
Metric The stats across many visits Avg. cooking time today

Tech example (HTTP service):

OTel Concept Example
Trace One POST /checkout journey
Span ValidateCart() function call
Metric Average request latency over 5 minutes

🏗️ Observability Backend Systems

Tempo - Distributed Tracing

Tempo is an open source, easy-to-use, and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can ingest common open source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry.

Prometheus - Metrics Storage

Prometheus is a powerful open-source monitoring and alerting system designed for reliability and scalability. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and can trigger alerts when specified conditions are met. Prometheus stores all data as time series and uses a powerful query language (PromQL) for analysis.

Loki - Log Aggregation

Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. It is designed to be very cost effective and easy to operate, as it does not index the contents of the logs, but rather a set of labels for each log stream.

Alloy - OpenTelemetry Collector

Alloy is a flexible, high performance, vendor-neutral distribution of the OpenTelemetry Collector. It's fully compatible with the most popular open source observability standards such as OpenTelemetry and Prometheus. Alloy is a replacement for Promtail, it essentially replaces the log collector/scraper that traditionally used Promtail, Grafana Agent or OTel Agent.


🏛️ System Architecture

graph TB
    subgraph "Rust Application"
        App[Rust Server<br/>Port: 5800]
        App --> |Logs| OtelLogs[OpenTelemetry<br/>Logs Provider]
        App --> |Traces| OtelTraces[OpenTelemetry<br/>Trace Provider]
        App --> |Metrics| OtelMetrics[OpenTelemetry<br/>Metrics Provider]
    end
    
    subgraph "Data Collection"
        OtelLogs --> |HTTP/4318| Alloy
        OtelTraces --> |HTTP/4318| Alloy
        OtelMetrics --> |HTTP/4318| Alloy
        Alloy[Grafana Alloy<br/>OTLP Receiver<br/>Ports: 4317/4318]
    end
    
    subgraph "Storage Backends"
        Alloy --> |Forward Traces| Tempo[Tempo<br/>Trace Storage<br/>Port: 4317]
        Alloy --> |Forward Logs| Loki[Loki<br/>Log Storage<br/>Port: 3100]
        Alloy --> |Forward Metrics| Prometheus[Prometheus<br/>Metrics Storage<br/>Port: 9090]
    end
    
    subgraph "Visualization"
        Grafana[Grafana Dashboard<br/>Port: 3000]
        Grafana --> |Query| Tempo
        Grafana --> |Query| Loki
        Grafana --> |Query| Prometheus
    end
    
    subgraph "Kubernetes"
        K8s[K8s Cluster<br/>Namespace: monitoring]
        K8s -.-> Alloy
        K8s -.-> Tempo
        K8s -.-> Loki
        K8s -.-> Prometheus
        K8s -.-> Grafana
    end
Loading

The stack creates a complete observability pipeline:

  • Applications → Alloy:4318 (OTLP HTTP port)
  • Alloy → Tempo:4317 (forwarded traces)
  • Alloy → Loki:3100 (forwarded logs)
  • Alloy → Prometheus:9090 (forwarded metrics)
  • Grafana ↔ All backends (unified observability dashboard)

Port 4317 is the OpenTelemetry standard - like port 80 for HTTP, it's the universal port for trace collection.

📁 Project Structure

observability/
├── 📁 backend/              # Rust application with OpenTelemetry
│   ├── Cargo.toml           # Dependencies and project configuration
│   ├── Cargo.lock           # Lock file for reproducible builds
│   ├── Dockerfile           # Container image for the Rust server
│   └── src/
│       └── main.rs          # Main application with OTel instrumentation
├── 📁 k8s/                  # Kubernetes manifests
│   ├── deployment.yml       # Rust server deployment configuration
│   └── service.yml          # Kubernetes service for rust-server
├── 📁 helm/                 # Helm values for observability stack
│   ├── alloy/
│   │   └── values.yml       # Alloy (OTel Collector) configuration
│   ├── grafana/
│   │   └── values.yml       # Grafana dashboard configuration
│   ├── loki/
│   │   └── values.yml       # Loki (logs) configuration
│   ├── prometheus/
│   │   └── values.yml       # Prometheus (metrics) configuration
│   └── tempo/
│       └── values.yml       # Tempo (traces) configuration
├── skaffold.yaml            # Development workflow automation
├── Makefile                 # Convenient commands for development
└── README.md                # This documentation

🚀 Key Components

Component Purpose Port Configuration
rust-server Demo Rust app with OTel instrumentation 5800 backend/src/main.rs
Alloy OpenTelemetry Collector (data pipeline) 4317/4318 helm/alloy/values.yml
Grafana Visualization dashboard 3000 helm/grafana/values.yml
Loki Log aggregation system 3100 helm/loki/values.yml
Prometheus Metrics storage 9090 helm/prometheus/values.yml
Tempo Distributed tracing backend 3200 helm/tempo/values.yml

🛠️ Implementation Guide

Set up OpenTelemetry

Add the following crates to your Cargo.toml file.

dotenv = "0.15.0"
opentelemetry = { version = "0.30.0", features = ["logs", "metrics", "trace"] }
opentelemetry-appender-tracing = "0.30.1"
opentelemetry-otlp = { version = "0.30.0", features = ["logs", "metrics", "trace", "tokio"] }
opentelemetry-semantic-conventions = "0.30.0"
opentelemetry_sdk = { version = "0.30.0", features = ["logs", "metrics", "trace"] }
salvo = { version = "0.82.0", features = ["cors", "logging", "otel", "session"] }
tokio = { version = "1.44.1", features = ["full"] }
tracing = "0.1.41"
tracing-subscriber = { version = "0.3.19", features = ["env-filter", "fmt", "json", "registry", "tracing"] }

Note: Check main.rs for full code.

Understanding OpenTelemetry Crates

1️⃣ opentelemetry – the core API

  • What it is:
    The language-agnostic API layer for OpenTelemetry in Rust.
  • Role:
    Defines the traits, data types, and basic functions for creating traces, spans, and metrics — but doesn't decide how they're exported or processed.

Think of this as the interface that knows what a Trace/Span/Metric is, but not where it goes.

2️⃣ opentelemetry-sdk

  • What it is:
    The default SDK implementation of the OTel API for Rust.

  • Role:

    • Actually stores spans in memory until export.
    • Handles batching, sampling, aggregation.
    • Lets you configure resources (service name, version, etc.).
    • Connects API calls from opentelemetry to exporters like Jaeger or OTLP.

3️⃣ opentelemetry-otlp

  • What it is:
    An exporter implementation that sends telemetry data to an OTLP (OpenTelemetry Protocol) endpoint — usually the OTel Collector.
  • Role:
    Converts spans/metrics into OTLP gRPC or HTTP Protobuf format and ships them off.

Without this crate, you could still create spans in memory — but they'd never leave your app.

4️⃣ opentelemetry_semantic_conventions

  • What it is:
    A crate that contains standard attribute names & values defined by the OpenTelemetry spec.

  • Role:
    Ensures your telemetry data is consistent and portable between systems and languages.

  • Example:
    Instead of:

    KeyValue::new("service", "salvo-app")

    You'd use:

    use opentelemetry_semantic_conventions::resource::SERVICE_NAME;
    KeyValue::new(SERVICE_NAME, "salvo-app")

    → This way, Jaeger/Prometheus/Grafana knows exactly how to interpret service.name, http.method, net.peer.ip, etc.

If you make up your own attribute keys ("foo"), they may not show up in dashboards or get special treatment.

Instrument your application

Add tracing

// First, we need to get a tracer object.
let tracer = global::tracer("my-tracer");

// With tracer, we can now start new spans.
let mut _span = tracer
    .span_builder("Call to /myendpoint")
    .with_kind(SpanKind::Internal)
    .start(&tracer);
_span.set_attribute(KeyValue::new("http.method", "GET"));
_span.set_attribute(KeyValue::new("net.protocol.version", "1.1"));

// TODO: Your code goes here

_span.end();

In the above code, we:

  • Create a new span and name it "Call to /myendpoint"
  • Add two attributes, following the semantic naming convention, specific to - the action of this span: information on the HTTP method and version
  • Add a TODO in place of the eventual business logic
  • Call the span's end() method to complete the span

Collect metrics

// First, we need to get a tracer object.
let meter = global::meter("request_counter");

// With meter, we can now create individual instruments, such as a counter.
let updown_counter = meter.i64_up_down_counter("request_counter").build();

// We can now invoke the add() method of updown_counter to record new values with the counter.
updown_counter.add(1,&[],);

🚀 Deployment Guide

☸️ Kubernetes Deployment

Quick Start

# Start minikube and deploy everything
make start
make dev

Access Services

Manual Setup

# Deploy observability stack with Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
kubectl create namespace monitoring

# Deploy components (see skaffold.yaml for full configuration)
skaffold run

⛵ Skaffold Development

Skaffold automates builds, deployments, and port forwarding:

# Development with hot reload
make dev

# Deploy once
skaffold run

# Clean up
skaffold delete

Edit backend/src/main.rs → Skaffold rebuilds → Auto-deploy to K8s

🎯 Helm Configuration

Customize observability components via helm/*/values.yml:

  • Alloy: OTLP receivers, processors, exporters
  • Grafana: Data sources, dashboards, auth
  • Loki/Prometheus/Tempo: Storage, retention, resources

🔧 Development Commands

# Quick start
make start      # Start minikube
make dev        # Deploy with hot reload
make status     # Show cluster status

# Cleanup
make stop       # Stop cluster
make delete     # Delete cluster

🚀 Production Notes

  • Resources: Min 4 CPU, 8GB RAM
  • Security: Enable RBAC, use secrets, TLS
  • Scaling: Use HPA, distributed deployments
0