Performance Guide
Optimize patterndb-yaml for your use case by understanding its architecture and syslog-ng's pattern matching engine.
Architecture
patterndb-yaml uses syslog-ng's patterndb engine for pattern matching. Understanding this helps optimize performance.
Pattern Matching Algorithm
syslog-ng uses a radix tree (longest prefix match) for pattern matching:
- Character-by-character matching from the beginning of messages
- Tree structure organizes patterns alphabetically
- Performance scales independently of the number of patterns
- Efficient subset evaluation - only relevant branches checked at each position
See 'How pattern matching works` in the syslog-ng documentation for details on the radix tree approach.
Memory Architecture
patterndb-yaml processes logs line-by-line without loading entire files:
Memory components:
- Python runtime + syslog-ng pattern matcher (base overhead)
- Rule definitions (scales with number/complexity of patterns)
- LRU cache (65,536 entry limit for repeated lines)
- Sequence buffer (holds multi-line sequences in progress)
Large files can be processed with constant memory due to streaming.
Optimization Strategies
1. Order Rules by Frequency
Put frequently-matched patterns first in your rules file.
Why: patterndb-yaml tries rules sequentially until a match is found (first match wins).
Example:
rules:
# Most common pattern first
- name: info_log
pattern:
- field: timestamp
- text: " [INFO] "
- field: message
output: "[INFO]"
# Less common patterns after
- name: error_log
pattern:
- field: timestamp
- text: " [ERROR] "
- field: message
output: "[ERROR]"
Measure impact:
2. Use Specific Patterns
Avoid overly-general patterns that match everything:
Too general:
Better:
Real-World Scenarios
Large File Processing
Approach:
# Quiet mode for batch processing
patterndb-yaml --rules rules.yaml --quiet large.log > normalized.log
# Show progress for very large files
patterndb-yaml --rules rules.yaml --progress huge.log > normalized.log
Why:
--quiet: Eliminates statistics display overhead--progress: Provides feedback without significant slowdown- Streaming handles files of any size
Real-Time Stream Processing
Approach:
# Stream processing
tail -f /var/log/app.log | patterndb-yaml --rules rules.yaml --quiet
# With filtering
tail -f /var/log/app.log | \
patterndb-yaml --rules rules.yaml --quiet | grep '\[ERROR\]'
Why:
- Line-by-line processing minimizes latency
- No buffering delays
- Cache benefits repeated patterns
Batch Processing Multiple Files
Approach:
# Serial processing
for log in logs/*.log; do
patterndb-yaml --rules rules.yaml --quiet "$log" > \
"normalized_$(basename $log)"
done
# Parallel processing (4 at a time)
ls logs/*.log | xargs -P 4 -I {} sh -c \
'patterndb-yaml --rules rules.yaml --quiet "{}" > \
"normalized_$(basename {})"'
Why:
- Each process is independent (no shared state)
- Parallel processing utilizes multiple cores
- Linear memory scaling with number of processes
Performance Monitoring
Track Statistics
Output:
Key metrics:
lines_processed: Total throughputmatch_rate: Pattern coverage (low values indicate missing patterns)
Benchmark Your Data
#!/bin/bash
LOG_FILE=$1
RULES_FILE=${2:-rules.yaml}
echo "Benchmarking: $LOG_FILE with $RULES_FILE"
# Count lines
LINES=$(wc -l < "$LOG_FILE")
echo "Input lines: $LINES"
# Measure time
START=$(date +%s.%N)
patterndb-yaml --rules "$RULES_FILE" --quiet "$LOG_FILE" > /dev/null
END=$(date +%s.%N)
# Calculate throughput
ELAPSED=$(echo "$END - $START" | bc)
THROUGHPUT=$(echo "scale=0; $LINES / $ELAPSED" | bc)
echo "Elapsed: ${ELAPSED}s"
echo "Throughput: ${THROUGHPUT} lines/sec"
Troubleshooting
Low Match Rate
Diagnosis:
# Find unmatched lines
patterndb-yaml --rules rules.yaml --explain test.log 2>&1 | \
grep "No pattern matched"
Solutions:
- Add missing patterns for unmatched log formats
- Check whitespace (patterns must match exactly)
- Verify pattern order (specific before general)
High Memory Usage
Diagnosis:
- Check for long multi-line sequences (consume buffer memory)
- Monitor with:
ps aux | grep patterndb-yaml - Check if running multiple instances
Solutions:
- Keep sequences short where possible
- Reduce parallel process count
- Restart between large batches to clear cache
Best Practices
- Measure first - Benchmark with real data before optimizing
- Order by frequency - Common patterns first
- Simplify alternatives - Use field extraction when possible
- Profile with real data - Test with production logs
- Use quiet mode -
--quietfor batch processing - Monitor match rate - Low rates indicate missing patterns
See Also
- syslog-ng Pattern Database Documentation - Details on the radix tree algorithm
- Algorithm Details - How patterndb-yaml works internally
- Troubleshooting - Solving performance problems
- Common Patterns - Efficient pattern examples