Our purpose and strategy
Our purpose is Helping Britain Prosper.
“PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases.”
Generative AI & Agentic AI success hinges on three core areas:
The bottleneck isn’t (just) the model; it’s performance evaluation. With increasingly powerful models from Google, Anthropic and OpenAI, the blocker to production is rarely model choice. It’s robust, reliable evaluation. We see promising results, slick demos and excited stakeholders, yet few Proofs of Concept (POCs) reach production because evaluation is the make or break step for trust, compliance and business value.
Without it, engineers are flying blind (slow cycles, trial and error prompting), model risk and audit can’t approve (delays), and business owners can’t quantify risk or benefits (stalled automation and value realisation).
PEGASUS (Performance Evaluation of Generative AI Systems and Utility Suite) is Lloyds Banking Group’s in house evaluation package to standardise how we measure GenAI quality across use cases. It enables teams to design, automate and repeat evaluations as part of the development loop, shifting us from “demo driven” to evaluation driven development.
Evaluation has three key questions:
PEGASUS operationalises all three: it ships a portfolio of Lloyds Banking Group built metrics, provides AI as Judge patterns with defined rating scales, and includes utilities for dataset preparation and adaptation so evaluations reflect real use case contexts.
By standardising metrics and methodology, PEGASUS helps technical teams compare models apples to apples, gives validators audit ready artefacts, and equips product owners with numbers that matter to operations and risk.
Below are the core metric families and representative metrics in current release.
Design intent: “What to measure” starts from common needs (RAG, summarisation, prompt quality, safety) and extends to use‑case‑specific demands via a bespoke namespace (e.g., NL2SQL code correctness or conversational smoothness for chatbots).
Rohit Dhawan believes that AI will revolutionise the financial services industry over the coming decade. And the transformation will be broad – impacting everything from customer experience to administrative operations.
Rather than relying solely on human panels (slow, costly, inconsistent and difficult to scale) or overlap metrics (often misaligned with human preferences), LaaJ uses an LLM to score outputs against explicit criteria by utilising structured prompts. Two common modes are:
Quality can be improved with chain of thought (CoT) judging, in context exemplars, position swapping (to reduce ordering bias), and careful prompt engineering of the evaluator. It is also important to validate the result of LaaJ against human panels.
Subjective judgments benefit from anchored rating scales (e.g., 5 or 7 point; we expose a 1–10 with decimals for sensitivity). Anchors provide consistent interpretation across raters (or judge prompts) and make scores analysable over time. PEGASUS implements anchored numeric scales with descriptors to limit rater drift and enable thresholding in Continuous Integration feedback loop.
PEGASUS implements LaaJ-based scoring across the various metric families described above, with evaluation rubrics encoded directly into the evaluator prompt. The framework integrates seamlessly with evaluation approaches inspired by G-Eval, DeepEval, and RAGAS-style checks, while remaining model-agnostic and compatible with both Vertex and open-source models via the platform. Below is a pointwise example of “Summarisation Coherence” rubric, judge prompt and code demo:
Evaluation rubric (Likert style, illustrative)
This mirrors the intent of PEGASUS’s Summarisation Coherence metric while using a judge prompt with anchors to reduce ambiguity.
Take a look at the Illustrative judge prompt (used by the evaluator under the hood).
We’re reimagining how we operate by harnessing the full potential of AI–embedding it across our business to drive smarter decisions, faster outcomes, and better experiences.
To ensure evaluations reflect production reality, PEGASUS supports:
These utilities help teams curate representative datasets so we’re not optimizing against toy prompts.
Teams are already using PEGASUS to turn experimental wins into production-grade decisions. A recent example comes from Athena, a Lloyds Banking Group knowledge management tool for customer facing colleagues, where the team evaluated switching from one AI model (Claude) to another (Nova, hosted on AWS).
Instead of relying on gut feel, they used PEGASUS to score and compare model responses – ensuring the change was based on performance, not preference. PEGASUS also helps Athena maintain content quality during migrations from older systems like Fountain and SharePoint, and checks rewritten documents against the Athena style guide to catch issues like semantic drift. If content fails multiple checks, it’s flagged for human review. That’s evaluation driving engineering, not the other way round.
At Lloyds Banking Group we take Responsible AI seriously. By embedding Evaluation Driven Development through PEGASUS, we’re setting the bar for safe, explainable and high quality GenAI in financial services – helping the Group deliver trusted innovation at scale and, ultimately, help Britain prosper.
Lead Data Scientist
Eric is a Lead Data & AI Scientist in Lloyds Banking Group's AI Centre of Excellence. As the Data Science Modelling Chapter Lead, he is responsible for the development and implementation of best practice in modelling, elevating the team’s capabilities in data science and fostering innovation in machine learning. He is leading the AI development work and capturing patterns and best practices to help other data science teams across the group.
Prior to joining Lloyds, Eric was a research engineer at a national fusion research laboratory working on Tokamak plasma control. He has a PhD in Control and Systems Engineering from the University of Sheffield.
Artificial Intelligence (AI) is no longer a futuristic concept; it’s a present-day catalyst for transformation. At Lloyds Banking Group, we’ve embraced this reality with purpose and ambition.
With its ability to process vast datasets, identify patterns and make real-time decisions, AI is enabling fintechs to deliver smarter, faster, and more personalised services than ever before.
This is your window into the dynamic landscape of large-scale digital transformation. Each fortnight we invite industry leaders and experts to explore the latest trends, insights and stories shaping modern tech.
Popular topics you might be interested in
Sustainability Diversity Supporting business Housing Pensions Investment