DevOps: CI/CD Build Reproducibility
Overview
Reproducible builds are critical for supply chain security and debugging. Build logs contain ephemeral data (timestamps, PIDs, temp paths) that obscure whether builds are truly identical. Normalizing logs reveals real differences while ignoring noise.
Core Problem Statement
"Build logs vary with every execution due to ephemeral details, making it impossible to verify builds are reproducible." You need to distinguish between cosmetic differences (timestamps, process IDs) and substantive changes (dependency versions, compiler settings, build artifacts).
Example Scenario
Your CI/CD pipeline runs builds on every commit. You want to verify that:
- Rebuilding the same commit produces identical outputs
- Different commits with actual changes are detected
- Build environment changes (compiler updates, dependency changes) are caught
Input Data
Build 1 (14:32)
[2024-11-15 14:32:11] Build started (PID: 28471)
[2024-11-15 14:32:11] Working directory: /tmp/build-a3f9e2d1
[2024-11-15 14:32:11] Compiler: gcc 11.4.0
[2024-11-15 14:32:12] Dependency: libssl 3.0.2
[2024-11-15 14:32:12] Dependency: zlib 1.2.11
[2024-11-15 14:32:13] Compiling src/main.c... success (0.234s)
[2024-11-15 14:32:13] Compiling src/utils.c... success (0.189s)
[2024-11-15 14:32:14] Linking binary... success
[2024-11-15 14:32:14] Binary hash: sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b
[2024-11-15 14:32:14] Build completed (total: 3.012s)
First build with specific timestamp, PID, and temp directory.
Build 2 (15:47)
[2024-11-15 15:47:23] Build started (PID: 29834)
[2024-11-15 15:47:23] Working directory: /tmp/build-c7b4f8e2
[2024-11-15 15:47:23] Compiler: gcc 11.4.0
[2024-11-15 15:47:24] Dependency: libssl 3.0.2
[2024-11-15 15:47:24] Dependency: zlib 1.2.11
[2024-11-15 15:47:25] Compiling src/main.c... success (0.221s)
[2024-11-15 15:47:25] Compiling src/utils.c... success (0.198s)
[2024-11-15 15:47:26] Linking binary... success
[2024-11-15 15:47:26] Binary hash: sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b
[2024-11-15 15:47:26] Build completed (total: 3.145s)
Second build of same code - different timestamp, PID, temp directory, but same dependencies and output.
Build 3 (16:22)
[2024-11-15 16:22:09] Build started (PID: 30192)
[2024-11-15 16:22:09] Working directory: /tmp/build-e9d3a1f7
[2024-11-15 16:22:09] Compiler: gcc 11.4.0
[2024-11-15 16:22:10] Dependency: libssl 3.0.7
[2024-11-15 16:22:10] Dependency: zlib 1.2.11
[2024-11-15 16:22:11] Compiling src/main.c... success (0.245s)
[2024-11-15 16:22:11] Compiling src/utils.c... success (0.201s)
[2024-11-15 16:22:12] Linking binary... success
[2024-11-15 16:22:12] Binary hash: sha256:8b4f9e3d0c2e5f7a1b3c6d9e4f8a2b5c
[2024-11-15 16:22:12] Build completed (total: 3.089s)
Third build with updated dependency (libssl 3.0.7) - should be detected as different.
Normalization Rules
Create rules that preserve meaningful data while filtering ephemeral details:
Build Normalization Rules
rules:
# Build started - ignore timestamp and PID
- name: build_started
pattern:
- text: "["
- field: timestamp
- text: "] Build started (PID: "
- field: pid
- text: ")"
output: "[build-started]"
# Working directory - ignore temp path
- name: working_directory
pattern:
- text: "["
- field: timestamp
- text: "] Working directory: "
- field: path
output: "[working-directory]"
# Compiler version - keep this, it matters
- name: compiler
pattern:
- text: "["
- field: timestamp
- text: "] Compiler: "
- field: name
- text: " "
- field: version
output: "[compiler:{name},{version}]"
# Dependencies - keep these, they matter
- name: dependency
pattern:
- text: "["
- field: timestamp
- text: "] Dependency: "
- field: name
- text: " "
- field: version
output: "[dependency:{name},{version}]"
# Compilation - keep file, ignore timing
- name: compilation
pattern:
- text: "["
- field: timestamp
- text: "] Compiling "
- field: file
- text: "... success ("
- field: duration
- text: ")"
output: "[compiled:{file}]"
# Linking - ignore timing details
- name: linking
pattern:
- text: "["
- field: timestamp
- text: "] Linking binary... success"
output: "[linked]"
# Binary hash - keep this, it's the build output signature
- name: binary_hash
pattern:
- text: "["
- field: timestamp
- text: "] Binary hash: "
- field: hash
output: "[binary-hash:{hash}]"
# Build completed - ignore total time
- name: build_completed
pattern:
- text: "["
- field: timestamp
- text: "] Build completed (total: "
- field: duration
- text: ")"
output: "[build-completed]"
Rules extract and preserve: compiler version, dependency versions, compiled files, and binary hash. Rules ignore: timestamps, PIDs, temp paths, and timing measurements.
Implementation
# Normalize all three builds
patterndb-yaml --rules build-reproducibility-rules.yaml build1.log \
--quiet > normalized-build1.log
patterndb-yaml --rules build-reproducibility-rules.yaml build2.log \
--quiet > normalized-build2.log
patterndb-yaml --rules build-reproducibility-rules.yaml build3.log \
--quiet > normalized-build3.log
# Compare builds 1 and 2 (should be identical)
if diff -q normalized-build1.log normalized-build2.log; then
echo "✓ Builds 1 and 2 are reproducible"
fi
# Compare builds 2 and 3 (should differ)
if ! diff -q normalized-build2.log normalized-build3.log; then
echo "✗ Build 3 has changes:"
diff normalized-build2.log normalized-build3.log
fi
import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import subprocess
# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file
# Normalize all three builds
processor = PatterndbYaml(
rules_path=Path("build-reproducibility-rules.yaml")
)
for build_num in [1, 2, 3]:
with open(f"build{build_num}.log") as f:
with open(f"normalized-build{build_num}.log", "w") as out:
processor.process(f, out)
# Compare builds 1 and 2
result = subprocess.run(
["diff", "-q", "normalized-build1.log", "normalized-build2.log"],
capture_output=True
)
if result.returncode == 0:
print("✓ Builds 1 and 2 are reproducible")
else:
print("✗ Builds differ unexpectedly")
# Compare builds 2 and 3
result = subprocess.run(
["diff", "normalized-build2.log", "normalized-build3.log"],
capture_output=True,
text=True
)
if result.returncode != 0:
print("✗ Build 3 has changes:")
print(result.stdout)
# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()
Expected Output
Builds 1 and 2 (Identical - Reproducible)
[build-started]
[working-directory]
[compiler:gcc,11.4.0]
[dependency:libssl,3.0.2]
[dependency:zlib,1.2.11]
[compiled:src/main.c]
[compiled:src/utils.c]
[linked]
[binary-hash:sha256:7a3f8e2c9b1d4e6f0a2b5c8d3e7f1a4b]
[build-completed]
Builds 1 and 2 produce identical normalized output, confirming reproducibility.
Build 3 (Different - Dependency Update)
[build-started]
[working-directory]
[compiler:gcc,11.4.0]
[dependency:libssl,3.0.7]
[dependency:zlib,1.2.11]
[compiled:src/main.c]
[compiled:src/utils.c]
[linked]
[binary-hash:sha256:8b4f9e3d0c2e5f7a1b3c6d9e4f8a2b5c]
[build-completed]
Build 3 shows different dependency version (libssl 3.0.7) and different binary hash.
Practical Workflows
1. CI/CD Reproducibility Verification
Automatically verify builds are reproducible in your pipeline:
#!/bin/bash
# Build twice from same commit
git checkout $COMMIT_SHA
docker build -t app:build1 . 2>&1 | tee build1.log
docker build -t app:build2 . 2>&1 | tee build2.log
# Normalize both builds
patterndb-yaml --rules build-rules.yaml build1.log --quiet > norm1.log
patterndb-yaml --rules build-rules.yaml build2.log --quiet > norm2.log
# Verify reproducibility
if ! diff -q norm1.log norm2.log; then
echo "ERROR: Build is not reproducible"
diff norm1.log norm2.log
exit 1
fi
echo "✓ Build is reproducible"
2. Dependency Change Detection
Detect when dependencies change between builds:
# Normalize baseline build
patterndb-yaml --rules build-rules.yaml baseline-build.log \
--quiet > baseline-norm.log
# Normalize current build
patterndb-yaml --rules build-rules.yaml current-build.log \
--quiet > current-norm.log
# Extract and compare dependencies
grep '^\[dependency:' baseline-norm.log | sort > baseline-deps.txt
grep '^\[dependency:' current-norm.log | sort > current-deps.txt
if ! diff -q baseline-deps.txt current-deps.txt; then
echo "Dependency changes detected:"
diff baseline-deps.txt current-deps.txt
fi
3. Build Artifact Verification
Verify build artifacts match across environments:
import sys
from patterndb_yaml import PatterndbYaml
from pathlib import Path
import re
# Redirect stdout to file for testing
_original_stdout = sys.stdout
output_file = open("output.txt", "w")
sys.stdout = output_file
processor = PatterndbYaml(rules_path=Path("build-rules.yaml"))
# Normalize builds from dev, staging, and prod environments
for env in ['dev', 'staging', 'prod']:
with open(f"{env}-build.log") as f:
with open(f"{env}-normalized.log", "w") as out:
processor.process(f, out)
# Extract binary hashes
def get_binary_hash(normalized_log):
with open(normalized_log) as f:
for line in f:
if match := re.match(r'\[binary-hash:(.*)\]', line):
return match.group(1)
return None
dev_hash = get_binary_hash("dev-normalized.log")
staging_hash = get_binary_hash("staging-normalized.log")
prod_hash = get_binary_hash("prod-normalized.log")
if dev_hash == staging_hash == prod_hash:
print(f"✓ All environments produced identical binary: {dev_hash}")
else:
print("✗ Binary hash mismatch across environments")
print(f" Dev: {dev_hash}")
print(f" Staging: {staging_hash}")
print(f" Prod: {prod_hash}")
# Restore stdout and close output file
sys.stdout = _original_stdout
output_file.close()
4. Historical Build Comparison
Compare current builds against historical baselines:
# Archive normalized build as baseline
patterndb-yaml --rules build-rules.yaml build.log --quiet > baseline.log
git add baseline.log
git commit -m "Archive build baseline"
# Later, compare new builds against baseline
patterndb-yaml --rules build-rules.yaml new-build.log --quiet > new-norm.log
# Show what changed
echo "Changes since baseline:"
diff baseline.log new-norm.log | grep '^[<>]' | while read line; do
case "$line" in
\<*) echo " Removed: ${line:2}" ;;
\>*) echo " Added: ${line:2}" ;;
esac
done
5. Supply Chain Verification
Verify builds in supply chain match expectations:
#!/bin/bash
# Normalize vendor-provided build log
patterndb-yaml --rules build-rules.yaml vendor-build.log \
--quiet > vendor-norm.log
# Normalize your own rebuild
patterndb-yaml --rules build-rules.yaml local-build.log \
--quiet > local-norm.log
# Extract and compare critical fields
for field in compiler dependency binary-hash; do
echo "Comparing $field..."
grep "^\[$field:" vendor-norm.log | sort > vendor-$field.txt
grep "^\[$field:" local-norm.log | sort > local-$field.txt
if ! diff -q vendor-$field.txt local-$field.txt; then
echo "⚠ $field mismatch:"
diff vendor-$field.txt local-$field.txt
else
echo "✓ $field matches"
fi
done
Key Benefits
- Verify reproducibility: Confirm identical commits produce identical builds
- Detect real changes: Distinguish dependency updates from noise
- Supply chain security: Verify vendor builds match your rebuilds
- Debug build issues: Compare successful vs. failed builds meaningfully
- Environment parity: Ensure dev/staging/prod build consistently
Related Topics
- Rules - Pattern matching and normalization
- Statistics - Measure match coverage
- Explain Mode - Debug pattern matching