Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator
Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.
The Benchmark
The dataset contains hundreds of straightforward questions across 15+ categories:
- Pattern-matching: Analogies, odd-one-out, sequence completion.
- Commonsense: Simulation, causality, everyday reasoning.
- Logic: Consistency, deduction, formal patterns.
- Math: Arithmetic, reasoning, number patterns.
- Language: Structure, transformation, comprehension.
- Knowledge: Definitions and basic facts.
Each entry follows a simple JSON structure with category, difficulty, question, and answer. Many answers are semantic, making naive exact-match evaluation misleading.
Why Standard Tools Fall Short
Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.
Key challenges:
- Semantic equivalence vs. literal text
- Verbose model outputs that bury the actual answer
- Category-specific correctness criteria
- Dataset duplicates (especially in logic)
The Hybrid Evaluation System
We implemented a category-aware evaluator router with mixed 0–1 scoring:
| Category Type | Scoring Method | Examples |
|---|---|---|
| Strict Exact | Normalized string equality | Math-arithmetic, Logic, Knowledge-basic |
| Flexible Exact | Lowercase + punctuation removal | Pattern-matching, Language-structure |
| Semantic | Embedding cosine similarity (threshold \~0.88) | Commonsense, Language-comprehension |
| Hybrid | Combination of above | Language-transformation |
Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.
Results on Qwen2.5-1.5B-Instruct
On this compact instruction-tuned model we observed clear patterns:
Strengths (75–85%)
- Commonsense simulation, causality, and reasoning
- Language transformation & comprehension
- Knowledge definitions
Weaknesses
- Pattern-matching (raw score low due to explanatory outputs)
- Symbolic logic (0–33%)
- Math pattern continuation
The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.
Key Takeaways
- Answer extraction is critical — strip explanations, use last sentence/word heuristics, or category-specific parsers.
- One-size-fits-all metrics hide truth. Per-category scoring reveals real strengths and weaknesses.
- Embeddings provide cheap, reliable semantic judgment without relying on another LLM judge.
- Partial credit systems improve signal — reward conceptual correctness even when formatting is imperfect.
- Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.
Better Benchmarks Ahead
Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.
Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.
Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?
Built with a hybrid Python framework combining direct inference and category-specific scoring logic.