Benchmarking RAGLib Techniques¶

RAGLib includes a comprehensive benchmarking system that allows you to evaluate and compare different RAG techniques on question-answering datasets. This guide shows how to use the benchmarking infrastructure to measure technique performance.

Overview¶

The benchmarking system provides:

BenchmarkHarness: Core evaluation engine that runs techniques on QA datasets
Dataset loaders: Support for JSONL and CSV formats
Metrics computation: Automatic calculation of exact match, F1, and overlap scores
Result output: JSON format for analysis and comparison
CLI tools: Ready-to-use scripts for running benchmarks

Quick Start¶

Running a Basic Benchmark¶

from raglib.benchmark import BenchmarkHarness
from raglib.registry import TechniqueRegistry
from raglib.schemas import Document

# Create your document corpus
corpus = [
    Document(id="1", text="Paris is the capital of France"),
    Document(id="2", text="To make tea, boil water and steep leaves"),
    Document(id="3", text="Photosynthesis converts sunlight to energy")
]

# Get a technique to test
technique = TechniqueRegistry.get("bm25")

# Run benchmark
harness = BenchmarkHarness(quick_mode=True, verbose=True)
results = harness.run_benchmark(
    technique=technique,
    dataset_path="tests/data/tiny_qa_dataset.jsonl",
    corpus_docs=corpus,
    output_path="results/benchmark_results.json"
)

print(f"F1 Score: {results['metrics']['f1']:.3f}")

Using the CLI Tool¶

# Run quick benchmark on default techniques
python examples/benchmark_run.py --quick

# Test specific techniques
python examples/benchmark_run.py --techniques bm25,dense_retriever

# Use custom dataset and output location
python examples/benchmark_run.py \
    --dataset my_qa_dataset.jsonl \
    --output results/my_benchmark

# List all available techniques
python examples/benchmark_run.py --list-techniques

Dataset Format¶

JSONL Format (Recommended)¶

Each line contains a JSON object with question-answer pairs:

{"question": "What is the capital of France?", "answer": "Paris", "context": "France is in Europe..."}
{"question": "How do you make tea?", "answer": "Boil water and steep leaves", "context": "Tea preparation..."}
{"question": "What is photosynthesis?", "answer": "Process converting sunlight to energy"}

Required fields: - question: The question to ask - answer: Expected answer for evaluation

Optional fields: - context: Additional context information - Any other fields will be stored in metadata

CSV Format¶

CSV files with headers for question, answer, and optional fields:

question,answer,context
"What is the capital of France?","Paris","France is in Europe..."
"How do you make tea?","Boil water and steep leaves","Tea preparation involves..."

Sparse Retrieval Benchmarking¶

RAGLib includes a comprehensive set of sparse retrieval techniques that can be benchmarked and compared. These techniques use lexical matching and term-based scoring rather than dense embeddings.

Available Sparse Techniques¶

Technique	Description	Use Case
BM25	Classic probabilistic ranking function	Baseline sparse retrieval, exact term matching
TF-IDF	Term frequency-inverse document frequency	Statistical term weighting, cosine similarity
Lexical Matcher	Configurable lexical matching modes	Flexible text matching (exact, substring, token overlap)
SPLADE	Sparse lexical and dense expansion	Hybrid approach with term expansion
Lexical Transformer	Transformer-aware lexical retrieval	Attention weighting and positional encoding

Running Sparse Retrieval Benchmark¶

Use the dedicated sparse retrieval benchmark script:

# Benchmark all sparse retrieval techniques
python examples/sparse_retrieval_benchmark.py

# Test specific technique
python examples/sparse_retrieval_benchmark.py --technique bm25

# Quick mode with fewer test queries  
python examples/sparse_retrieval_benchmark.py --quick

# Save results to file
python examples/sparse_retrieval_benchmark.py --output results/sparse_benchmark.json

Example Output:

Sparse Retrieval Comparison:
BM25:
  - Results: 5, Top score: 2.845
  - Best match: tech3
TF-IDF:
  - Results: 5, Top score: 0.712
  - Best match: tech3

Vector Retrieval Benchmarking¶

RAGLib includes advanced vector/dense retrieval techniques that use semantic embeddings for finding relevant information. These techniques provide more nuanced understanding of query-document relationships.

Available Vector Techniques¶

Technique	Description	Use Case
FAISS Retriever	High-performance vector search with FAISS	Large-scale similarity search, production systems
Dual Encoder	Asymmetric query/document encoding	Different encoders for queries vs documents
ColBERT Retriever	Token-level late interaction	Fine-grained matching, high precision
Multi-Query Retriever	Query expansion with result fusion	Improved recall through query diversity
Multi-Vector Retriever	Document segmentation representation	Long documents, granular matching
Dense Retriever	Basic embedding-based retrieval	Baseline vector retrieval

Running Vector Retrieval Benchmark¶

Use the dedicated vector retrieval benchmark script:

# Benchmark all vector retrieval techniques
python examples/vector_retrieval_benchmark.py

# Test specific technique
python examples/vector_retrieval_benchmark.py --technique faiss_retriever

# Quick mode with fewer test queries
python examples/vector_retrieval_benchmark.py --quick

# Save results to custom file
python examples/vector_retrieval_benchmark.py --output my_results.json

Example Output:

Performance Comparison:
Technique            Precision  Recall   F1       NDCG     Avg Time
----------------------------------------------------------------------
faiss_retriever      0.667      0.600    0.632    0.745    0.012s
dual_encoder         0.733      0.667    0.698    0.789    0.015s
colbert_retriever    0.800      0.700    0.747    0.823    0.028s
multi_vector_retriever 0.667    0.633    0.649    0.756    0.021s

Enhanced Benchmark Comparison¶

Compare all retrieval types (sparse and vector) side-by-side:

# Compare all retrieval techniques
python examples/enhanced_benchmark.py

# Quick comparison mode
python examples/enhanced_benchmark.py --quick

Technique Comparison Example¶

from raglib.registry import TechniqueRegistry
from raglib.schemas import Document

# Create test corpus
corpus = [
    Document(id="1", text="Machine learning algorithms learn from data"),
    Document(id="2", text="Natural language processing handles text"),
    Document(id="3", text="Information retrieval finds relevant documents")
]

# Test multiple sparse techniques
sparse_techniques = TechniqueRegistry.find_by_category('sparse_retrieval')
query = "machine learning"

for name, TechniqueClass in sparse_techniques.items():
    technique = TechniqueClass(docs=corpus)
    result = technique.apply(query=query, top_k=3)

    if result.success:
        hits = result.payload['hits']
        print(f"{name}: {len(hits)} results, top score: {hits[0].score:.3f}")

Metrics¶

The benchmark system computes three main metrics:

Exact Match (EM)¶

Binary score indicating whether the expected answer appears exactly in the retrieved text (case-insensitive).

# Example: Expected="Paris", Retrieved="Paris is the capital" → EM=1.0
# Example: Expected="Paris", Retrieved="London is the capital" → EM=0.0

F1 Score¶

Token-level F1 score measuring precision and recall of word overlap between expected and retrieved text.

# Example: Expected="machine learning", Retrieved="machine learning algorithms" 
# Precision=2/3, Recall=2/2 → F1=0.8

Overlap Score¶

Ratio of expected answer tokens found in retrieved text.

# Example: Expected="deep neural networks", Retrieved="neural networks are powerful"
# Overlap = 2/3 = 0.67 (found "neural networks" out of "deep neural networks")

BenchmarkHarness API¶

Initialization¶

harness = BenchmarkHarness(
    quick_mode=False,    # If True, limit to first 3 examples for fast testing
    verbose=True         # Print progress information
)

Running Benchmarks¶

results = harness.run_benchmark(
    technique=technique,           # RAGTechnique instance to evaluate
    dataset_path="data.jsonl",    # Path to QA dataset
    corpus_docs=documents,        # List of Document objects to search
    output_path="results.json",   # Optional: save results to file
    top_k=5                       # Number of documents to retrieve
)

Result Structure¶

The benchmark returns a comprehensive result dictionary:

{
    "timestamp": "2024-01-15T10:30:00",
    "technique": "BM25Simple", 
    "dataset_path": "/path/to/dataset.jsonl",
    "corpus_size": 100,
    "num_questions": 25,
    "quick_mode": false,
    "top_k": 5,
    "runtime_seconds": 2.34,
    "metrics": {
        "exact_match": 0.72,
        "f1": 0.85,
        "overlap": 0.78
    },
    "individual_results": [
        {
            "question": "What is the capital of France?",
            "expected_answer": "Paris", 
            "retrieved_count": 3,
            "metrics": {"exact_match": 1.0, "f1": 0.9, "overlap": 0.8},
            "metadata": {}
        }
        // ... more individual results
    ]
}

Advanced Usage¶

Custom Evaluation Metrics¶

You can extend the BenchmarkHarness to compute custom metrics:

class CustomBenchmarkHarness(BenchmarkHarness):
    def _compute_metrics(self, qa_item, result):
        # Get standard metrics
        metrics = super()._compute_metrics(qa_item, result)

        # Add custom metrics
        metrics["custom_score"] = self._compute_custom_score(qa_item, result)

        return metrics

    def _compute_custom_score(self, qa_item, result):
        # Your custom scoring logic
        return 0.85

Batch Evaluation¶

Compare multiple techniques systematically:

techniques_to_test = ["bm25", "dense_retriever", "hyde"]
results_list = []

for technique_name in techniques_to_test:
    technique = TechniqueRegistry.get(technique_name)
    result = harness.run_benchmark(
        technique=technique,
        dataset_path=dataset_path,
        corpus_docs=corpus
    )
    results_list.append(result)

# Find best performing technique
best_technique = max(results_list, key=lambda r: r['metrics']['f1'])
print(f"Best technique: {best_technique['technique']} (F1: {best_technique['metrics']['f1']:.3f})")

Error Handling¶

The benchmark system handles errors gracefully:

# Individual question errors are captured in results
result = harness.run_benchmark(technique, dataset_path, corpus)

for individual_result in result["individual_results"]:
    if "error" in individual_result:
        print(f"Error processing question: {individual_result['error']}")

Performance Considerations¶

Quick Mode¶

For rapid iteration and testing, use quick mode:

# Only processes first 3 examples
harness = BenchmarkHarness(quick_mode=True)

Large Datasets¶

For large datasets, consider:

Processing in batches
Using multiprocessing for technique comparisons
Sampling representative subsets
Caching intermediate results

Memory Management¶

Monitor memory usage when working with large corpora:

# Process documents in chunks for very large corpora
chunk_size = 1000
for i in range(0, len(large_corpus), chunk_size):
    chunk = large_corpus[i:i+chunk_size]
    # Process chunk...

Integration with Development Workflow¶

Automated Testing¶

Include benchmark runs in your CI/CD pipeline:

# .github/workflows/ci.yml
- name: Run benchmark smoke test
  run: python examples/benchmark_run.py --quick
  env:
    RUN_BENCH: "true"

Regression Detection¶

Track performance over time:

# Compare against baseline results
baseline_f1 = 0.75
current_f1 = results['metrics']['f1']

if current_f1 < baseline_f1 - 0.05:
    print("⚠️  Performance regression detected!")

A/B Testing New Techniques¶

# Compare new technique against current best
baseline_technique = TechniqueRegistry.get("current_best")
new_technique = TechniqueRegistry.get("experimental")

baseline_results = harness.run_benchmark(baseline_technique, dataset, corpus)
new_results = harness.run_benchmark(new_technique, dataset, corpus)

improvement = new_results['metrics']['f1'] - baseline_results['metrics']['f1']
print(f"F1 improvement: {improvement:+.3f}")

Best Practices¶

Use representative datasets: Ensure your QA dataset reflects real use cases
Multiple metrics: Don't rely on a single metric; consider all three
Cross-validation: Test on multiple datasets for robust evaluation
Statistical significance: Run multiple trials for statistical validity
Document assumptions: Note any preprocessing or filtering applied
Version control results: Track benchmark results over time
Reproducible runs: Set random seeds and document environment

Troubleshooting¶

Common Issues¶

Dataset loading errors:

# Check file format and encoding
python -c "from raglib.benchmark import load_qa_dataset; print(load_qa_dataset('data.jsonl')[:2])"

Memory errors with large corpora:

# Use generators or chunking for large datasets
def chunk_corpus(corpus, chunk_size=1000):
    for i in range(0, len(corpus), chunk_size):
        yield corpus[i:i+chunk_size]

Technique not found:

# List available techniques
print("Available:", TechniqueRegistry.list())

Low scores across all techniques: - Check dataset quality and answer format - Verify corpus contains relevant information - Review technique parameters and configuration

For more detailed examples, see examples/benchmark_run.py and the test files in tests/test_benchmark.py.