PEGASUS: evaluation driven development for GenAI

“PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases.”

Generative AI & Agentic AI success hinges on three core areas:

Right model for the job (planning, orchestration, reasoning)
Context engineering that feeds foundation models the right materials to reason with
Evaluation to understand how well a model performs on specific tasks

The bottleneck isn’t (just) the model; it’s performance evaluation. With increasingly powerful models from Google, Anthropic and OpenAI, the blocker to production is rarely model choice. It’s robust, reliable evaluation. We see promising results, slick demos and excited stakeholders, yet few Proofs of Concept (POCs) reach production because evaluation is the make or break step for trust, compliance and business value.

Without it, engineers are flying blind (slow cycles, trial and error prompting), model risk and audit can’t approve (delays), and business owners can’t quantify risk or benefits (stalled automation and value realisation).

PEGASUS: our path to Evaluation Driven Development (EDD)

PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases. It enables teams to design, automate and repeat evaluations as part of the development loop, shifting us from “demo driven” to evaluation driven development.

Evaluation has three key questions:

What to measure? Metric design
How to measure? Methodology (e.g., AI as Judge) with clear Likert style criteria
Where to measure? On representative datasets for scenarios like semantic search, Q&A and multi turn dialogue

PEGASUS operationalises all three: it ships a portfolio of Lloyds Banking Group built metrics, provides AI as Judge patterns with defined rating scales, and includes utilities for dataset preparation and adaptation so evaluations reflect real use case contexts.

Who is PEGASUS for?

Data Scientists & AI Developers: for tuning, ablation and CI style test gates
Model Validators: for rigorous, repeatable validation workflows
Business users: for actionable, decision grade insights

By standardising metrics and methodology, PEGASUS helps technical teams compare models apples to apples, gives validators audit ready artefacts, and equips product owners with numbers that matter to operations and risk.

What to measure: metrics (exact vs. subjective)

Two families of evaluation exist:

Exact (objective) metrics produce judgment with minimal ambiguity, e.g., BLEU, ROUGE, METEOR (token overlap based) and BERTScore, BLEURT (semantic similarity). These are well defined in open source NLP libraries and are not the emphasis of PEGASUS. We use them when appropriate, but they’re readily available from standard open-source NLP toolkits.
Subjective (judgment based) metrics are where PEGASUS focuses, akin to essay grading. Scores can vary by rater; therefore, clear rubrics and anchored scales are essential for fairness and repeatability. PEGASUS formalises this via explicit criteria, Likert style anchors, optional business context (the guidance parameter), and support for chain of thought rationales to make the “why” behind a score auditable.

Current metric families in PEGASUS (v1.4):

Below are the core metric families and representative metrics in current release.

Retrieval Augmented Generation: Answer Correctness, Answer Relevancy, Context Precision, Context Recall. These assess whether retrieved context is relevant and whether the answer is grounded in that context.
Summarisation (including PDF aware variants for multimodal sources): Faithfulness, Accuracy, Clarity, Coherence, Completeness, Conciseness.
Prompt quality (task agnostic): Goal Alignment, Context Relevance, Information Adequacy, Instruction Clarity and Specificity, Overall Quality. These provide a shared language for prompt quality across teams.
Ethics / Safety (where required): Bias, Toxicity, Hallucination/Fact risk signals—used in concert with our Responsible AI controls.

Design intent: “What to measure” starts from common needs (RAG, summarisation, prompt quality, safety) and extends to use‑case‑specific demands via a bespoke namespace (e.g., NL2SQL code correctness or conversational smoothness for chatbots).

How to measure: methodology (AI as Judge, Likert rubrics)

AI as Judge / LLM as Judge (LaaJ)

Rather than relying solely on human panels (slow, costly, inconsistent and difficult to scale) or overlap metrics (often misaligned with human preferences), LaaJ uses an LLM to score outputs against explicit criteria by utilising structured prompts. Two common modes are:

Pointwise (single output) scoring: judge assigns a score to one output given the input (optionally a reference).
Pairwise (or listwise) comparison: judge picks a “winner” among multiple outputs; robust for A/B or A/B/n tests.

Quality can be improved with chain of thought (CoT) judging, in context exemplars, position swapping (to reduce ordering bias), and careful prompt engineering of the evaluator. It is also important to validate the result of LaaJ against human panels.

Why Likert style scales

Subjective judgments benefit from anchored rating scales (e.g., 5 or 7 point; we expose a 1–10 with decimals for sensitivity). Anchors provide consistent interpretation across raters (or judge prompts) and make scores analysable over time. PEGASUS implements anchored numeric scales with descriptors to limit rater drift and enable thresholding in Continuous Integration feedback loop.

PEGASUS implements LaaJ-based scoring across the various metric families described above, with evaluation rubrics encoded directly into the evaluator prompt. The framework integrates seamlessly with evaluation approaches inspired by G-Eval, DeepEval, and RAGAS-style checks, while remaining model-agnostic and compatible with both Vertex and open-source models via the platform. Below is a pointwise example of “Summarisation Coherence” rubric, judge prompt and code demo:

Where to measure: datasets (real & synthetic)

To ensure evaluations reflect production reality, PEGASUS supports:

Expert labelled, real datasets curated with business SMEs for priority use cases; and
Synthetic datasets to cover corners or scale breadth across patterns such as RAG, grounded Q&A, and multi turn conversational agents.

These utilities help teams curate representative datasets so we’re not optimizing against toy prompts.

Teams are already using PEGASUS to turn experimental wins into production-grade decisions. A recent example comes from Athena, a Lloyds Banking Group knowledge management tool for customer facing colleagues, where the team evaluated switching from one AI model (Claude) to another (Nova, hosted on AWS).

Instead of relying on gut feel, they used PEGASUS to score and compare model responses – ensuring the change was based on performance, not preference. PEGASUS also helps Athena maintain content quality during migrations from older systems like Fountain and SharePoint, and checks rewritten documents against the Athena style guide to catch issues like semantic drift. If content fails multiple checks, it’s flagged for human review. That’s evaluation driving engineering, not the other way round.

Why this matters for Lloyds Banking Group

Approvals at pace. Transparent, standardised evidence accelerates Model Risk and Audit decisions.
Reliability under scrutiny. Quantified performance builds trust with stakeholders, unlocks automation, and supports safe scaling across the bank.
Platform aligned. As our GenAI platform evolves (e.g. integration through CorteX), PEGASUS is becoming the common evaluation backbone across teams, patterns and agents.

At Lloyds Banking Group we take Responsible AI seriously. By embedding Evaluation Driven Development through PEGASUS, we’re setting the bar for safe, explainable and high quality GenAI in financial services – helping the Group deliver trusted innovation at scale and, ultimately, help Britain prosper.

AI

About the author Eric Ren

Lead Data Scientist

Eric is a Lead Data & AI Scientist in Lloyds Banking Group's AI Centre of Excellence. As the Data Science Modelling Chapter Lead, he is responsible for the development and implementation of best practice in modelling, elevating the team’s capabilities in data science and fostering innovation in machine learning. He is leading the AI development work and capturing patterns and best practices to help other data science teams across the group.

Prior to joining Lloyds, Eric was a research engineer at a national fusion research laboratory working on Tokamak plasma control. He has a PhD in Control and Systems Engineering from the University of Sheffield.

Eric's background

External links expandable section
1. What Is a Likert Scale? (www.scribbr.com)
2. G-Eval | DeepEval - The Open-Source LLM Evaluation Framework (deepeval.com)
3. Core Concepts (Ragas)
4. [2306.05685] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (arxiv.org)

Our purpose and strategy

Sustainability

Our brands

Who was Flora Murray?

Farming with nature: mapping the growth opportunity for UK agriculture

Health and wellbeing

40 years of Foundations

Annual report

Financial downloads

Financial calendar

Why invest in Lloyds Banking Group?

Fixed income investors

Lloyds Premier: More than just a bank account

How tech continues to transform mobile banking

Tech and data roles

A life-long career in tech

Our brands

Our purpose and strategy

Sustainability

Our brands

Who was Flora Murray?

Farming with nature: mapping the growth opportunity for UK agriculture

Health and wellbeing

40 years of Foundations

Annual report

Financial downloads

Financial calendar

Why invest in Lloyds Banking Group?

Fixed income investors

Lloyds Premier: More than just a bank account

How tech continues to transform mobile banking

Tech and data roles

A life-long career in tech

Our brands

PEGASUS: evaluation driven development for GenAI

PEGASUS: our path to Evaluation Driven Development (EDD)

What PEGASUS enables (in practice)

Who is PEGASUS for?

What to measure: metrics (exact vs. subjective)

Two families of evaluation exist:

Current metric families in PEGASUS (v1.4):

Reimagining banking with AI

How to measure: methodology (AI as Judge, Likert rubrics)

AI as Judge / LLM as Judge (LaaJ)

Why Likert style scales

Example: Summarisation Coherence (rubric + prompt + code)

Artificial Intelligence at Lloyds Banking Group

Where to measure: datasets (real & synthetic)

Why this matters for Lloyds Banking Group

External links expandable section

Related content

Reimagining the future: how AI is transforming Lloyds Banking Group

The transformative power of AI in fintech

No Ordinary Tech Podcast

More news and insight