Operations: Distributed Systems Troubleshooting

Overview

Distributed systems scatter request handling across multiple services, each with its own log format and conventions. Troubleshooting requires correlating events across services using correlation IDs (request_id, trace_id). Normalizing logs from all services into a unified format enables request tracing and root cause analysis.

Core Problem Statement

"Distributed requests span multiple services with incompatible log formats." To understand why a request failed, you need to follow it through the gateway, order service, and payment service, but each logs differently and uses different field names for correlation IDs.

Example Scenario

Your e-commerce platform has three services:

API Gateway: Routes requests, logs request_id
Order Service: Manages orders, logs trace_id
Payment Service: Processes payments, logs correlation_id

All three refer to the same correlation ID but use different names and formats. A request fails with "Payment service unavailable" and you need to trace the full flow to understand what happened.

Input Data

API Gateway Logs

2024-11-15 10:00:01.123 [api-gateway] request_id=req-abc123 method=POST path=/orders user=alice@example.com
2024-11-15 10:00:01.456 [api-gateway] request_id=req-abc123 upstream=order-service status=200 duration=145ms
2024-11-15 10:00:02.234 [api-gateway] request_id=req-def456 method=GET path=/orders/5001 user=alice@example.com
2024-11-15 10:00:02.567 [api-gateway] request_id=req-def456 upstream=order-service status=200 duration=23ms
2024-11-15 10:00:03.345 [api-gateway] request_id=req-ghi789 method=POST path=/orders user=bob@example.com
2024-11-15 10:00:03.678 [api-gateway] request_id=req-ghi789 upstream=order-service status=500 duration=523ms error="Payment service unavailable"

Gateway logs with request_id and upstream service responses.

Order Service Logs

2024-11-15T10:00:01.234Z [order-service] trace_id=req-abc123 event=order_created order_id=5001 user_id=123
2024-11-15T10:00:01.345Z [order-service] trace_id=req-abc123 event=calling_payment_service order_id=5001 amount=99.99
2024-11-15T10:00:01.445Z [order-service] trace_id=req-abc123 event=payment_success order_id=5001
2024-11-15T10:00:02.245Z [order-service] trace_id=req-def456 event=order_lookup order_id=5001
2024-11-15T10:00:02.555Z [order-service] trace_id=req-def456 event=order_found order_id=5001 status=paid
2024-11-15T10:00:03.356Z [order-service] trace_id=req-ghi789 event=order_created order_id=5002 user_id=456
2024-11-15T10:00:03.467Z [order-service] trace_id=req-ghi789 event=calling_payment_service order_id=5002 amount=149.99
2024-11-15T10:00:03.667Z [order-service] trace_id=req-ghi789 event=payment_failed order_id=5002 error="Connection timeout"

Order service logs with trace_id and business events.

Payment Service Logs

2024-11-15T10:00:01.356Z payment-svc correlation_id=req-abc123 action=process_payment order=5001 amount=99.99
2024-11-15T10:00:01.423Z payment-svc correlation_id=req-abc123 action=payment_authorized order=5001 txn_id=txn-xyz789
2024-11-15T10:00:03.478Z payment-svc correlation_id=req-ghi789 action=process_payment order=5002 amount=149.99
2024-11-15T10:00:03.656Z payment-svc correlation_id=req-ghi789 action=payment_error order=5002 error="Payment gateway timeout"

Payment service logs with correlation_id and payment processing events.

Normalization Rules

Create rules that extract events and normalize correlation ID names:

Distributed Tracing Rules

rules:
  # API Gateway: Incoming request
  - name: gateway_request
    pattern:
      - field: timestamp
      - text: " [api-gateway] request_id="
      - field: request_id
      - text: " method="
      - field: method
      - text: " path="
      - field: path
      - text: " user="
      - field: user
    output: "[{request_id}] gateway:request:{method}:{path}"

  # API Gateway: Upstream response (success)
  - name: gateway_response_success
    pattern:
      - field: timestamp
      - text: " [api-gateway] request_id="
      - field: request_id
      - text: " upstream="
      - field: upstream
      - text: " status="
      - field: status
      - text: " duration="
      - field: duration
    output: "[{request_id}] gateway:response:status={status}"

  # API Gateway: Upstream response (error)
  - name: gateway_response_error
    pattern:
      - field: timestamp
      - text: " [api-gateway] request_id="
      - field: request_id
      - text: " upstream="
      - field: upstream
      - text: " status="
      - field: status
      - text: " duration="
      - field: duration
      - text: " error="
      - field: error
    output: "[{request_id}] gateway:error:status={status}"

  # Order Service: Order created
  - name: order_created
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=order_created order_id="
      - field: order_id
      - text: " user_id="
      - field: user_id
    output: "[{trace_id}] order:created:order_id={order_id}"

  # Order Service: Calling payment
  - name: order_calling_payment
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=calling_payment_service order_id="
      - field: order_id
      - text: " amount="
      - field: amount
    output: "[{trace_id}] order:calling_payment"

  # Order Service: Payment success
  - name: order_payment_success
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=payment_success order_id="
      - field: order_id
    output: "[{trace_id}] order:payment_ok"

  # Order Service: Payment failed
  - name: order_payment_failed
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=payment_failed order_id="
      - field: order_id
      - text: " error="
      - field: error
    output: "[{trace_id}] order:payment_failed"

  # Order Service: Order lookup
  - name: order_lookup
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=order_lookup order_id="
      - field: order_id
    output: "[{trace_id}] order:lookup:order_id={order_id}"

  # Order Service: Order found
  - name: order_found
    pattern:
      - field: timestamp
      - text: " [order-service] trace_id="
      - field: trace_id
      - text: " event=order_found order_id="
      - field: order_id
      - text: " status="
      - field: status
    output: "[{trace_id}] order:found:status={status}"

  # Payment Service: Process payment
  - name: payment_process
    pattern:
      - field: timestamp
      - text: " payment-svc correlation_id="
      - field: correlation_id
      - text: " action=process_payment order="
      - field: order
      - text: " amount="
      - field: amount
    output: "[{correlation_id}] payment:processing"

  # Payment Service: Payment authorized
  - name: payment_authorized
    pattern:
      - field: timestamp
      - text: " payment-svc correlation_id="
      - field: correlation_id
      - text: " action=payment_authorized order="
      - field: order
      - text: " txn_id="
      - field: txn_id
    output: "[{correlation_id}] payment:authorized"

  # Payment Service: Payment error
  - name: payment_error
    pattern:
      - field: timestamp
      - text: " payment-svc correlation_id="
      - field: correlation_id
      - text: " action=payment_error order="
      - field: order
      - text: " error="
      - field: error
    output: "[{correlation_id}] payment:error"

Rules preserve: correlation ID (normalized across services), service name, event type. Rules ignore: timestamps, durations, transaction IDs, error messages (extract separately).

Implementation

CLIPython

# Combine logs from all services
cat api-gateway.log order-service.log payment-service.log | \
    patterndb-yaml --rules distributed-rules.yaml --quiet > combined.log

# Trace a specific request
request_id="req-abc123"
echo "Request trace for $request_id:"
grep "^\[$request_id\]" combined.log

# Find failed requests
echo "\nFailed requests:"
grep 'error\|failed' combined.log | cut -d']' -f1 | sort -u | \
    while read req_id; do
        echo "\n$req_id]:"
        grep "^$req_id\]" combined.log
    done

# Analyze error distribution
echo "\nError types:"
grep 'error\|failed' combined.log | cut -d' ' -f2 | sort | uniq -c

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
from collections import defaultdict
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

# Combine and normalize all service logs
processor = PatterndbYaml(rules_path=Path("distributed-rules.yaml"))

log_files = ["api-gateway.log", "order-service.log", "payment-service.log"]

all_events = []
for log_file in log_files:
    with open(log_file) as f:
        from io import StringIO
        output = StringIO()
        processor.process(f, output)
        output.seek(0)

        for line in output:
            line = line.strip()
            if line:
                # Extract correlation ID and event
                if match := re.match(r'\[([^\]]+)\] (.+)', line):
                    correlation_id, event = match.groups()
                    all_events.append({
                        'correlation_id': correlation_id,
                        'event': event,
                        'raw': line
                    })

# Group by correlation ID
traces = defaultdict(list)
for event in all_events:
    traces[event['correlation_id']].append(event)

# Analyze traces
print("Distributed Request Analysis:\n")

for correlation_id, events in sorted(traces.items()):
    # Check for errors
    has_error = any('error' in e['event'] or 'failed' in e['event']
                   for e in events)

    status = "✗ FAILED" if has_error else "✓ SUCCESS"
    print(f"{status} {correlation_id} ({len(events)} events)")

    # Show trace
    for event in events:
        has_issue = 'error' in event['event'] or \
            'failed' in event['event']
        prefix = "  ⚠" if has_issue else "   "
        print(f"{prefix} {event['event']}")

    print()

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

Expected Output

Normalized Distributed Trace

[req-abc123] gateway:request:POST:/orders
[req-abc123] order:created:order_id=5001
[req-abc123] order:calling_payment
[req-abc123] payment:processing
[req-abc123] payment:authorized
[req-abc123] order:payment_ok
[req-abc123] gateway:response:status=200
[req-def456] gateway:request:GET:/orders/5001
[req-def456] order:lookup:order_id=5001
[req-def456] order:found:status=paid
[req-def456] gateway:response:status=200
[req-ghi789] gateway:request:POST:/orders
[req-ghi789] order:created:order_id=5002
[req-ghi789] order:calling_payment
[req-ghi789] payment:processing
[req-ghi789] payment:error
[req-ghi789] order:payment_failed
[req-ghi789] gateway:error:status=500

All services unified with correlation IDs, showing complete request flows.

Request Flows

req-abc123 (Success): 1. Gateway receives POST /orders 2. Order service creates order 5001 3. Order service calls payment 4. Payment service processes payment 5. Payment service authorizes payment 6. Order service receives payment success 7. Gateway returns 200

req-def456 (Success): 1. Gateway receives GET /orders/5001 2. Order service looks up order 3. Order service finds order (status=paid) 4. Gateway returns 200

req-ghi789 (Failure): 1. Gateway receives POST /orders 2. Order service creates order 5002 3. Order service calls payment 4. Payment service processes payment 5. Payment service error (timeout) 6. Order service receives payment failure 7. Gateway returns 500

Practical Workflows

1. Root Cause Analysis

Trace failed requests to find root cause:

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

processor = PatterndbYaml(rules_path=Path("distributed-rules.yaml"))

# Combine logs
log_files = ["api-gateway.log", "order-service.log", "payment-service.log"]
combined_logs = []

for log_file in log_files:
    with open(log_file) as f:
        combined_logs.extend(f.readlines())

# Normalize
from io import StringIO
input_data = StringIO("".join(combined_logs))
output_data = StringIO()
processor.process(input_data, output_data)
output_data.seek(0)

# Find failed requests
failed_requests = set()
for line in output_data:
    if 'error' in line.lower() or 'failed' in line.lower():
        if match := re.match(r'\[([^\]]+)\]', line):
            failed_requests.add(match.group(1))

# Analyze each failure
output_data.seek(0)
print("Root Cause Analysis:\n")

for request_id in sorted(failed_requests):
    print(f"Request: {request_id}")
    print("Timeline:")

    output_data.seek(0)
    for line in output_data:
        if line.startswith(f"[{request_id}]"):
            # Highlight errors
            if 'error' in line.lower() or 'failed' in line.lower():
                print(f"  → {line.strip()}  <-- ROOT CAUSE")
            else:
                print(f"    {line.strip()}")

    print()

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

2. Service Dependency Mapping

Identify service call patterns:

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

processor = PatterndbYaml(rules_path=Path("distributed-rules.yaml"))

# Process all logs
log_files = ["api-gateway.log", "order-service.log", "payment-service.log"]
events = []

for log_file in log_files:
    with open(log_file) as f:
        from io import StringIO
        output = StringIO()
        processor.process(f, output)
        output.seek(0)

        for line in output:
            if match := re.match(r'\[([^\]]+)\] ([^:]+):(.+)', line.strip()):
                req_id, service, event = match.groups()
                events.append((req_id, service, event))

# Build dependency graph
from collections import defaultdict
dependencies = defaultdict(set)

# Group events by request
from itertools import groupby
events.sort(key=lambda x: x[0])

for req_id, group in groupby(events, key=lambda x: x[0]):
    services_in_request = []
    for _, service, _ in group:
        if service not in services_in_request:
            services_in_request.append(service)

    # Record dependencies (sequential calls)
    for i in range(len(services_in_request) - 1):
        dependencies[services_in_request[i]].add(services_in_request[i+1])

# Display dependency graph
print("Service Dependency Map:\n")
for caller, callees in sorted(dependencies.items()):
    print(f"{caller} →")
    for callee in sorted(callees):
        print(f"    {callee}")

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

3. Latency Bottleneck Detection

Find slow services in request path:

# Extract service events with timing (requires raw logs for timestamps)
cat api-gateway.log order-service.log payment-service.log | \
    patterndb-yaml --rules distributed-rules.yaml --quiet > normalized.log

# For each request, calculate service-level latency
# (This example assumes normalized output preserves original timestamps)

grep '^\[req-' normalized.log | \
    while read line; do
        req_id=$(echo "$line" | grep -o '^\[[^]]*\]')
        service=$(echo "$line" | cut -d' ' -f2 | cut -d':' -f1)
        echo "$req_id $service"
    done | \
    sort | uniq -c | \
    awk '{print $2 " " $3 ": " $1 " events"}'

4. Error Rate by Service

Calculate error rates for each service:

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
from collections import Counter
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

processor = PatterndbYaml(rules_path=Path("distributed-rules.yaml"))

# Process logs
log_files = ["api-gateway.log", "order-service.log", "payment-service.log"]
service_events = Counter()
service_errors = Counter()

for log_file in log_files:
    with open(log_file) as f:
        from io import StringIO
        output = StringIO()
        processor.process(f, output)
        output.seek(0)

        for line in output:
            if match := re.match(r'\[([^\]]+)\] ([^:]+):(.+)', line.strip()):
                _, service, event = match.groups()
                service_events[service] += 1

                if 'error' in event.lower() or 'failed' in event.lower():
                    service_errors[service] += 1

# Calculate error rates
print("Error Rates by Service:\n")
print(f"{'Service':<20} {'Total':<10} {'Errors':<10} {'Rate':<10}")
print("-" * 50)

for service in sorted(service_events.keys()):
    total = service_events[service]
    errors = service_errors[service]
    rate = (errors / total * 100) if total > 0 else 0

    status = "⚠" if rate > 10 else "✓"
    print(f"{status} {service:<18} {total:<10} {errors:<10} {rate:>6.1f}%")

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

5. Request Correlation Dashboard

Generate summary of all requests:

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

processor = PatterndbYaml(rules_path=Path("distributed-rules.yaml"))

# Combine and process
log_files = ["api-gateway.log", "order-service.log", "payment-service.log"]
all_logs = []
for log_file in log_files:
    with open(log_file) as f:
        all_logs.extend(f.readlines())

from io import StringIO
output = StringIO()
processor.process(StringIO("".join(all_logs)), output)
output.seek(0)

# Group by request
from collections import defaultdict
requests = defaultdict(
    lambda: {'services': set(), 'events': [], 'status': 'unknown'}
)

for line in output:
    if match := re.match(r'\[([^\]]+)\] ([^:]+):(.+)', line.strip()):
        req_id, service, event = match.groups()
        requests[req_id]['services'].add(service)
        requests[req_id]['events'].append(f"{service}:{event}")

        # Determine status
        if 'error' in event.lower() or 'failed' in event.lower():
            requests[req_id]['status'] = 'failed'
        elif requests[req_id]['status'] == 'unknown':
            requests[req_id]['status'] = 'success'

# Display dashboard
print("Request Dashboard:\n")
print(f"{'Request ID':<15} {'Status':<10} {'Services':<8} {'Events':<8}")
print("-" * 50)

for req_id in sorted(requests.keys()):
    req = requests[req_id]
    status_icon = "✓" if req['status'] == 'success' else "✗"
    service_count = len(req['services'])
    event_count = len(req['events'])

    print(f"{req_id:<15} {status_icon} {req['status']:<9} "
          f"{service_count:<8} {event_count:<8}")

# Summary stats
total = len(requests)
failed = sum(1 for r in requests.values() if r['status'] == 'failed')
success = total - failed
failure_pct = failed / total * 100

print(f"\nSummary: {total} requests, {success} success, "
      f"{failed} failed ({failure_pct:.0f}% failure rate)")

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

Key Benefits

End-to-end tracing: Follow requests across all services
Root cause identification: Pinpoint which service caused failures
Service dependency mapping: Understand service call patterns
Error correlation: Link errors across service boundaries
Performance analysis: Identify bottlenecks in distributed flows

Rules - Pattern matching and normalization
Statistics - Measure match coverage
Explain Mode - Debug pattern matching