PEGASUS: evaluation driven development for GenAI

 

“PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases.” 

Eric Ren,
Lead Data Scientist
Published on: 14 August 2025
6 min read

Generative AI & Agentic AI success hinges on three core areas: 

  • Right model for the job (planning, orchestration, reasoning)
  • Context engineering that feeds foundation models the right materials to reason with
  • Evaluation to understand how well a model performs on specific tasks

The bottleneck isn’t (just) the model; it’s performance evaluation. With increasingly powerful models from Google, Anthropic and OpenAI, the blocker to production is rarely model choice. It’s robust, reliable evaluation. We see promising results, slick demos and excited stakeholders, yet few Proofs of Concept (POCs) reach production because evaluation is the make or break step for trust, compliance and business value.

Without it, engineers are flying blind (slow cycles, trial and error prompting), model risk and audit can’t approve (delays), and business owners can’t quantify risk or benefits (stalled automation and value realisation). 

PEGASUS: our path to Evaluation Driven Development (EDD)

PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases. It enables teams to design, automate and repeat evaluations as part of the development loop, shifting us from “demo driven” to evaluation driven development. 

Evaluation has three key questions:

  • What to measure? Metric design
  • How to measure? Methodology (e.g., AI as Judge) with clear Likert style criteria
  • Where to measure? On representative datasets for scenarios like semantic search, Q&A and multi turn dialogue

PEGASUS operationalises all three: it ships a portfolio of Lloyds Banking Group built metrics, provides AI as Judge patterns with defined rating scales, and includes utilities for dataset preparation and adaptation so evaluations reflect real use case contexts. 

What PEGASUS enables (in practice)

  • Context aware evaluation. A new guidance parameter lets you inject business context into the evaluation prompt. That context flows through 32 Lloyds Banking Group built metrics – 10 Prompt, 5 RAG, 17 Summarisation (incl. 4 PDF specific) – so scores reflect what our applications need to get right. 
  • Finer scoring granularity. Move beyond pass/fail or coarse buckets: PEGASUS supports a 1–10 scale with decimals, enabling nuanced comparisons across models, prompts, and retrieval strategies. 
  • Generalised prompt metrics. Six task agnostic metrics – Goal Alignment, Context Relevance, Information Adequacy, Instruction Specificity, Overall Quality – create a shared language about prompt quality across teams. 
  • Built for Lloyds Banking Group’s stack. PEGASUS is a Python SDK, integrated into the GenAI Lab ecosystem and evolving towards seamless use via CorteX API and modular installs, so teams can adopt just the bits they need. 

Who is PEGASUS for?

  • Data Scientists & AI Developers: for tuning, ablation and CI style test gates
  • Model Validators: for rigorous, repeatable validation workflows
  • Business users: for actionable, decision grade insights

By standardising metrics and methodology, PEGASUS helps technical teams compare models apples to apples, gives validators audit ready artefacts, and equips product owners with numbers that matter to operations and risk. 

What to measure: metrics (exact vs. subjective)

Two families of evaluation exist:

  • Exact (objective) metrics produce judgment with minimal ambiguity, e.g., BLEU, ROUGE, METEOR (token overlap based) and BERTScore, BLEURT (semantic similarity). These are well defined in open source NLP libraries and are not the emphasis of PEGASUS. We use them when appropriate, but they’re readily available from standard open-source NLP toolkits.
  • Subjective (judgment based) metrics are where PEGASUS focuses, akin to essay grading. Scores can vary by rater; therefore, clear rubrics and anchored scales are essential for fairness and repeatability. PEGASUS formalises this via explicit criteria, Likert style anchors, optional business context (the guidance parameter), and support for chain of thought rationales to make the “why” behind a score auditable. 

Current metric families in PEGASUS (v1.4):

Below are the core metric families and representative metrics in current release. 

  • Retrieval Augmented Generation: Answer Correctness, Answer Relevancy, Context Precision, Context Recall. These assess whether retrieved context is relevant and whether the answer is grounded in that context.
  • Summarisation (including PDF aware variants for multimodal sources): Faithfulness, Accuracy, Clarity, Coherence, Completeness, Conciseness. 
  • Prompt quality (task agnostic): Goal Alignment, Context Relevance, Information Adequacy, Instruction Clarity and Specificity, Overall Quality. These provide a shared language for prompt quality across teams.
  • Ethics / Safety (where required): Bias, Toxicity, Hallucination/Fact risk signals—used in concert with our Responsible AI controls. 

Design intent: “What to measure” starts from common needs (RAG, summarisation, prompt quality, safety) and extends to use‑case‑specific demands via a bespoke namespace (e.g., NL2SQL code correctness or conversational smoothness for chatbots).

 

Reimagining banking with AI

Rohit Dhawan believes that AI will revolutionise the financial services industry over the coming decade. And the transformation will be broad – impacting everything from customer experience to administrative operations.

Read Rohit's article

How to measure: methodology (AI as Judge, Likert rubrics)

AI as Judge / LLM as Judge (LaaJ)

Rather than relying solely on human panels (slow, costly, inconsistent and difficult to scale) or overlap metrics (often misaligned with human preferences), LaaJ uses an LLM to score outputs against explicit criteria by utilising structured prompts. Two common modes are:

  • Pointwise (single output) scoring: judge assigns a score to one output given the input (optionally a reference).
  • Pairwise (or listwise) comparison: judge picks a “winner” among multiple outputs; robust for A/B or A/B/n tests.

Quality can be improved with chain of thought (CoT) judging, in context exemplars, position swapping (to reduce ordering bias), and careful prompt engineering of the evaluator. It is also important to validate the result of LaaJ against human panels.

Why Likert style scales

Subjective judgments benefit from anchored rating scales (e.g., 5 or 7 point; we expose a 1–10 with decimals for sensitivity). Anchors provide consistent interpretation across raters (or judge prompts) and make scores analysable over time. PEGASUS implements anchored numeric scales with descriptors to limit rater drift and enable thresholding in Continuous Integration feedback loop. 

PEGASUS implements LaaJ-based scoring across the various metric families described above, with evaluation rubrics encoded directly into the evaluator prompt. The framework integrates seamlessly with evaluation approaches inspired by G-Eval, DeepEval, and RAGAS-style checks, while remaining model-agnostic and compatible with both Vertex and open-source models via the platform. Below is a pointwise example of “Summarisation Coherence” rubric, judge prompt and code demo:

Example: Summarisation Coherence (rubric + prompt + code)

Evaluation rubric (Likert style, illustrative)

  • 1–2 (Poor): Disjointed; sentences contradict or jump topics; transitions missing.
  • 3–4 (Weak): Occasional logical gaps; poor flow; weak topic continuity.
  • 5–6 (Adequate): Main ideas connected; some minor jumps; overall understandable.
  • 7–8 (Good): Clear logical progression; smooth transitions; minimal redundancy.
  • 9–10 (Excellent): Seamless, logically structured narrative; precise transitions; no contradictions.

This mirrors the intent of PEGASUS’s Summarisation Coherence metric while using a judge prompt with anchors to reduce ambiguity. 

Take a look at the Illustrative judge prompt (used by the evaluator under the hood).

Take a look
 

Artificial Intelligence at Lloyds Banking Group

We’re reimagining how we operate by harnessing the full potential of AI–embedding it across our business to drive smarter decisions, faster outcomes, and better experiences.

Visit the AI hub

Where to measure: datasets (real & synthetic)

To ensure evaluations reflect production reality, PEGASUS supports:

  • Expert labelled, real datasets curated with business SMEs for priority use cases; and
  • Synthetic datasets to cover corners or scale breadth across patterns such as RAG, grounded Q&A, and multi turn conversational agents.

These utilities help teams curate representative datasets so we’re not optimizing against toy prompts.

Teams are already using PEGASUS to turn experimental wins into production-grade decisions. A recent example comes from Athena, a Lloyds Banking Group knowledge management tool for customer facing colleagues, where the team evaluated switching from one AI model (Claude) to another (Nova, hosted on AWS).

Instead of relying on gut feel, they used PEGASUS to score and compare model responses – ensuring the change was based on performance, not preference. PEGASUS also helps Athena maintain content quality during migrations from older systems like Fountain and SharePoint, and checks rewritten documents against the Athena style guide to catch issues like semantic drift. If content fails multiple checks, it’s flagged for human review. That’s evaluation driving engineering, not the other way round.

Why this matters for Lloyds Banking Group

  • Approvals at pace. Transparent, standardised evidence accelerates Model Risk and Audit decisions. 
  • Reliability under scrutiny. Quantified performance builds trust with stakeholders, unlocks automation, and supports safe scaling across the bank. 
  • Platform aligned. As our GenAI platform evolves (e.g. integration through CorteX), PEGASUS is becoming the common evaluation backbone across teams, patterns and agents. 

At Lloyds Banking Group we take Responsible AI seriously. By embedding Evaluation Driven Development through PEGASUS, we’re setting the bar for safe, explainable and high quality GenAI in financial services – helping the Group deliver trusted innovation at scale and, ultimately, help Britain prosper. 

AI

""
About the author Eric Ren

Lead Data Scientist

Eric is a Lead Data & AI Scientist in Lloyds Banking Group's AI Centre of Excellence. As the Data Science Modelling Chapter Lead, he is responsible for the development and implementation of best practice in modelling, elevating the team’s capabilities in data science and fostering innovation in machine learning. He is leading the AI development work and capturing patterns and best practices to help other data science teams across the group.

Prior to joining Lloyds, Eric was a research engineer at a national fusion research laboratory working on Tokamak plasma control. He has a PhD in Control and Systems Engineering from the University of Sheffield.

Related content

Reimagining the future: how AI is transforming Lloyds Banking Group

Artificial Intelligence (AI) is no longer a futuristic concept; it’s a present-day catalyst for transformation. At Lloyds Banking Group, we’ve embraced this reality with purpose and ambition.

Read Rohit's article opens in same tab

The transformative power of AI in fintech

With its ability to process vast datasets, identify patterns and make real-time decisions, AI is enabling fintechs to deliver smarter, faster, and more personalised services than ever before.

Read their article Opens in same tab

No Ordinary Tech Podcast

This is your window into the dynamic landscape of large-scale digital transformation. Each fortnight we invite industry leaders and experts to explore the latest trends, insights and stories shaping modern tech.

Listen on Spotify