June 2026 · AI Evaluation

Evaluating LLMs on Custom Benchmarks: Lessons from a Hybrid Evaluator

Large Language Models are impressive, but measuring their real capabilities requires more than simple string matching. Recently, I built a custom evaluation framework for a diverse “easy” benchmark covering pattern recognition, commonsense reasoning, logic, math, language, and knowledge. Here’s what we learned — and why per-category hybrid scoring matters.

The Benchmark

The dataset contains hundreds of straightforward questions across 15+ categories:

Pattern-matching: Analogies, odd-one-out, sequence completion.
Commonsense: Simulation, causality, everyday reasoning.
Logic: Consistency, deduction, formal patterns.
Math: Arithmetic, reasoning, number patterns.
Language: Structure, transformation, comprehension.
Knowledge: Definitions and basic facts.

Each entry follows a simple JSON structure with category, difficulty, question, and answer. Many answers are semantic, making naive exact-match evaluation misleading.

Why Standard Tools Fall Short

Exact string matching severely underestimates performance on reasoning tasks. A model might correctly grasp “cat is to kitten as dog is to ?” but respond with an explanation or slight phrasing variation.

Key challenges:

Semantic equivalence vs. literal text
Verbose model outputs that bury the actual answer
Category-specific correctness criteria
Dataset duplicates (especially in logic)

The Hybrid Evaluation System

We implemented a category-aware evaluator router with mixed 0–1 scoring:

Category Type	Scoring Method	Examples
Strict Exact	Normalized string equality	Math-arithmetic, Logic, Knowledge-basic
Flexible Exact	Lowercase + punctuation removal	Pattern-matching, Language-structure
Semantic	Embedding cosine similarity (threshold \~0.88)	Commonsense, Language-comprehension
Hybrid	Combination of above	Language-transformation

Scores are aggregated as category averages, plus overall micro (sample-weighted) and macro (category-unweighted) percentages.

Results on Qwen2.5-1.5B-Instruct

On this compact instruction-tuned model we observed clear patterns:

Strengths (75–85%)

Commonsense simulation, causality, and reasoning
Language transformation & comprehension
Knowledge definitions

Weaknesses

Pattern-matching (raw score low due to explanatory outputs)
Symbolic logic (0–33%)
Math pattern continuation

The biggest insight: many “failures” were actually format issues rather than reasoning errors. Models understood the pattern but wrapped the answer in explanations.

Key Takeaways

Answer extraction is critical — strip explanations, use last sentence/word heuristics, or category-specific parsers.
One-size-fits-all metrics hide truth. Per-category scoring reveals real strengths and weaknesses.
Embeddings provide cheap, reliable semantic judgment without relying on another LLM judge.
Partial credit systems improve signal — reward conceptual correctness even when formatting is imperfect.
Small models handle everyday reasoning surprisingly well but struggle with strict symbolic manipulation.

Better Benchmarks Ahead

Robust LLM evaluation is as much an engineering challenge as a modeling one. Start with a clean dataset, invest early in a flexible evaluator, and iterate based on real runs.

Future improvements could include adversarial paraphrasing, multi-answer gold sets, confidence-weighted scoring, and interactive dashboards.

Custom benchmarks like this one give far more actionable insights than generic leaderboards. What evaluation tricks have you discovered in your own work?

Built with a hybrid Python framework combining direct inference and category-specific scoring logic.