Skip to content

DevOps: CI/CD Build Reproducibility

Overview

Reproducible builds are critical for supply chain security and debugging. Build logs contain ephemeral data (timestamps, PIDs, temp paths) that obscure whether builds are truly identical. Normalizing logs reveals real differences while ignoring noise.

Core Problem Statement

"Build logs vary with every execution due to ephemeral details, making it impossible to verify builds are reproducible." You need to distinguish between cosmetic differences (timestamps, process IDs) and substantive changes (dependency versions, compiler settings, build artifacts).

Example Scenario

Your CI/CD pipeline runs builds on every commit. You want to verify that:

  • Rebuilding the same commit produces identical outputs
  • Different commits with actual changes are detected
  • Build environment changes (compiler updates, dependency changes) are caught

Input Data

Build 1 (14:32)
[2024-11-15 14:32:11] Build started (PID: 28471)
[2024-11-15 14:32:11] Working directory: /tmp/build-a3f9e2d1
[2024-11-15 14:32:11] Compiler: gcc 11.4.0
[2024-11-15 14:32:12] Dependency: libssl 3.0.2
[2024-11-15 14:32:12] Dependency: zlib 1.2.11
[2024-11-15 14:32:13] Compiling src/main.c... success (0.234s)
[2024-11-15 14:32:13] Compiling src/utils.c... success (0.189s)
[2024-11-15 14:32:14] Linking binary... success
[2024-11-15 14:32:14] Binary hash: sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b
[2024-11-15 14:32:14] Build completed (total: 3.012s)

First build with specific timestamp, PID, and temp directory.

Build 2 (15:47)
[2024-11-15 15:47:23] Build started (PID: 29834)
[2024-11-15 15:47:23] Working directory: /tmp/build-c7b4f8e2
[2024-11-15 15:47:23] Compiler: gcc 11.4.0
[2024-11-15 15:47:24] Dependency: libssl 3.0.2
[2024-11-15 15:47:24] Dependency: zlib 1.2.11
[2024-11-15 15:47:25] Compiling src/main.c... success (0.221s)
[2024-11-15 15:47:25] Compiling src/utils.c... success (0.198s)
[2024-11-15 15:47:26] Linking binary... success
[2024-11-15 15:47:26] Binary hash: sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b
[2024-11-15 15:47:26] Build completed (total: 3.145s)

Second build of same code - different timestamp, PID, temp directory, but same dependencies and output.

Build 3 (16:22)
[2024-11-15 16:22:09] Build started (PID: 30192)
[2024-11-15 16:22:09] Working directory: /tmp/build-e9d3a1f7
[2024-11-15 16:22:09] Compiler: gcc 11.4.0
[2024-11-15 16:22:10] Dependency: libssl 3.0.7
[2024-11-15 16:22:10] Dependency: zlib 1.2.11
[2024-11-15 16:22:11] Compiling src/main.c... success (0.245s)
[2024-11-15 16:22:11] Compiling src/utils.c... success (0.201s)
[2024-11-15 16:22:12] Linking binary... success
[2024-11-15 16:22:12] Binary hash: sha256:8b4f9e3d0c2e5f7a1b3c6d9e4f8a2b5c
[2024-11-15 16:22:12] Build completed (total: 3.089s)

Third build with updated dependency (libssl 3.0.7) - should be detected as different.

Normalization Rules

Create rules that preserve meaningful data while filtering ephemeral details:

Build Normalization Rules
rules:
  # Build started - ignore timestamp and PID
  - name: build_started
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Build started (PID: "
      - field: pid
      - text: ")"
    output: "[build-started]"

  # Working directory - ignore temp path
  - name: working_directory
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Working directory: "
      - field: path
    output: "[working-directory]"

  # Compiler version - keep this, it matters
  - name: compiler
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Compiler: "
      - field: name
      - text: " "
      - field: version
    output: "[compiler:{name},{version}]"

  # Dependencies - keep these, they matter
  - name: dependency
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Dependency: "
      - field: name
      - text: " "
      - field: version
    output: "[dependency:{name},{version}]"

  # Compilation - keep file, ignore timing
  - name: compilation
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Compiling "
      - field: file
      - text: "... success ("
      - field: duration
      - text: ")"
    output: "[compiled:{file}]"

  # Linking - ignore timing details
  - name: linking
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Linking binary... success"
    output: "[linked]"

  # Binary hash - keep this, it's the build output signature
  - name: binary_hash
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Binary hash: "
      - field: hash
    output: "[binary-hash:{hash}]"

  # Build completed - ignore total time
  - name: build_completed
    pattern:
      - text: "["
      - field: timestamp
      - text: "] Build completed (total: "
      - field: duration
      - text: ")"
    output: "[build-completed]"

Rules extract and preserve: compiler version, dependency versions, compiled files, and binary hash. Rules ignore: timestamps, PIDs, temp paths, and timing measurements.

Implementation

# Normalize all three builds
patterndb-yaml --rules build-reproducibility-rules.yaml build1.log \
    --quiet > normalized-build1.log

patterndb-yaml --rules build-reproducibility-rules.yaml build2.log \
    --quiet > normalized-build2.log

patterndb-yaml --rules build-reproducibility-rules.yaml build3.log \
    --quiet > normalized-build3.log

# Compare builds 1 and 2 (should be identical)
if diff -q normalized-build1.log normalized-build2.log; then
    echo "✓ Builds 1 and 2 are reproducible"
fi

# Compare builds 2 and 3 (should differ)
if ! diff -q normalized-build2.log normalized-build3.log; then
    echo "✗ Build 3 has changes:"
    diff normalized-build2.log normalized-build3.log
fi

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import subprocess

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

# Normalize all three builds
processor = PatterndbYaml(
    rules_path=Path("build-reproducibility-rules.yaml")
)

for build_num in [1, 2, 3]:
    with open(f"build{build_num}.log") as f:
        with open(f"normalized-build{build_num}.log", "w") as out:
            processor.process(f, out)

# Compare builds 1 and 2
result = subprocess.run(
    ["diff", "-q", "normalized-build1.log", "normalized-build2.log"],
    capture_output=True
)

if result.returncode == 0:
    print("✓ Builds 1 and 2 are reproducible")
else:
    print("✗ Builds differ unexpectedly")

# Compare builds 2 and 3
result = subprocess.run(
    ["diff", "normalized-build2.log", "normalized-build3.log"],
    capture_output=True,
    text=True
)

if result.returncode != 0:
    print("✗ Build 3 has changes:")
    print(result.stdout)

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

Expected Output

Builds 1 and 2 (Identical - Reproducible)
[build-started]
[working-directory]
[compiler:gcc,11.4.0]
[dependency:libssl,3.0.2]
[dependency:zlib,1.2.11]
[compiled:src/main.c]
[compiled:src/utils.c]
[linked]
[binary-hash:sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b]
[build-completed]

Builds 1 and 2 produce identical normalized output, confirming reproducibility.

Build 3 (Different - Dependency Update)
[build-started]
[working-directory]
[compiler:gcc,11.4.0]
[dependency:libssl,3.0.7]
[dependency:zlib,1.2.11]
[compiled:src/main.c]
[compiled:src/utils.c]
[linked]
[binary-hash:sha256:8b4f9e3d0c2e5f7a1b3c6d9e4f8a2b5c]
[build-completed]

Build 3 shows different dependency version (libssl 3.0.7) and different binary hash.

Practical Workflows

1. CI/CD Reproducibility Verification

Automatically verify builds are reproducible in your pipeline:

#!/bin/bash
# Build twice from same commit
git checkout $COMMIT_SHA
docker build -t app:build1 . 2>&1 | tee build1.log
docker build -t app:build2 . 2>&1 | tee build2.log

# Normalize both builds
patterndb-yaml --rules build-rules.yaml build1.log --quiet > norm1.log
patterndb-yaml --rules build-rules.yaml build2.log --quiet > norm2.log

# Verify reproducibility
if ! diff -q norm1.log norm2.log; then
    echo "ERROR: Build is not reproducible"
    diff norm1.log norm2.log
    exit 1
fi

echo "✓ Build is reproducible"

2. Dependency Change Detection

Detect when dependencies change between builds:

# Normalize baseline build
patterndb-yaml --rules build-rules.yaml baseline-build.log \
    --quiet > baseline-norm.log

# Normalize current build
patterndb-yaml --rules build-rules.yaml current-build.log \
    --quiet > current-norm.log

# Extract and compare dependencies
grep '^\[dependency:' baseline-norm.log | sort > baseline-deps.txt
grep '^\[dependency:' current-norm.log | sort > current-deps.txt

if ! diff -q baseline-deps.txt current-deps.txt; then
    echo "Dependency changes detected:"
    diff baseline-deps.txt current-deps.txt
fi

3. Build Artifact Verification

Verify build artifacts match across environments:

import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import re

# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file

processor = PatterndbYaml(rules_path=Path("build-rules.yaml"))

# Normalize builds from dev, staging, and prod environments
for env in ['dev', 'staging', 'prod']:
    with open(f"{env}-build.log") as f:
        with open(f"{env}-normalized.log", "w") as out:
            processor.process(f, out)

# Extract binary hashes
def get_binary_hash(normalized_log):
    with open(normalized_log) as f:
        for line in f:
            if match := re.match(r'\[binary-hash:(.*)\]', line):
                return match.group(1)
    return None

dev_hash = get_binary_hash("dev-normalized.log")
staging_hash = get_binary_hash("staging-normalized.log")
prod_hash = get_binary_hash("prod-normalized.log")

if dev_hash == staging_hash == prod_hash:
    print(f"✓ All environments produced identical binary: {dev_hash}")
else:
    print("✗ Binary hash mismatch across environments")
    print(f"  Dev:     {dev_hash}")
    print(f"  Staging: {staging_hash}")
    print(f"  Prod:    {prod_hash}")

# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()

4. Historical Build Comparison

Compare current builds against historical baselines:

# Archive normalized build as baseline
patterndb-yaml --rules build-rules.yaml build.log --quiet > baseline.log
git add baseline.log
git commit -m "Archive build baseline"

# Later, compare new builds against baseline
patterndb-yaml --rules build-rules.yaml new-build.log --quiet > new-norm.log

# Show what changed
echo "Changes since baseline:"
diff baseline.log new-norm.log | grep '^[<>]' | while read line; do
    case "$line" in
        \<*) echo "  Removed: ${line:2}" ;;
        \>*) echo "  Added:   ${line:2}" ;;
    esac
done

5. Supply Chain Verification

Verify builds in supply chain match expectations:

#!/bin/bash
# Normalize vendor-provided build log
patterndb-yaml --rules build-rules.yaml vendor-build.log \
    --quiet > vendor-norm.log

# Normalize your own rebuild
patterndb-yaml --rules build-rules.yaml local-build.log \
    --quiet > local-norm.log

# Extract and compare critical fields
for field in compiler dependency binary-hash; do
    echo "Comparing $field..."
    grep "^\[$field:" vendor-norm.log | sort > vendor-$field.txt
    grep "^\[$field:" local-norm.log | sort > local-$field.txt

    if ! diff -q vendor-$field.txt local-$field.txt; then
        echo "⚠ $field mismatch:"
        diff vendor-$field.txt local-$field.txt
    else
        echo "✓ $field matches"
    fi
done

Key Benefits

  • Verify reproducibility: Confirm identical commits produce identical builds
  • Detect real changes: Distinguish dependency updates from noise
  • Supply chain security: Verify vendor builds match your rebuilds
  • Debug build issues: Compare successful vs. failed builds meaningfully
  • Environment parity: Ensure dev/staging/prod build consistently