Dialogue with data: translating natural language into queries using AI

Matthew Mason, Azahar Machwe & Ansel Liu
Published on: 08 October 2025
9 min read

At a glance

  • Text-to-SQL empowers non-technical users to query databases using natural language, unlocking faster, more intuitive access to insights.
  • Lloyds Banking Group’s ‘Dialogue with Data’ experiment used GenAI to test conversational analytics in a secure sandbox, validating workflows with synthetic HR data.
  • A bespoke evaluation pipeline combining automated metrics and human review revealed that schema context and domain knowledge are critical for accuracy.
  • Introducing a semantic layer and schema enrichment boosted exact match accuracy to 86.1%, highlighting the importance of context-aware design.
  • The project has evolved into Generative Business Intelligence (GenBI), with ambitions to offer conversational data access as a reusable, cross-domain service.

Text-to-SQL (Structured Query Language) is a technology that enables users to query relational databases using natural language instead of traditional SQL syntax. It leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to translate human-readable questions into structured SQL queries. 

This approach is particularly valuable in large organisations, where empowering non-technical users to access data directly can reduce bottlenecks, improve agility, and support data-driven decision making. 

Over the past year, Lloyds Banking Group has made significant strides towards this vision. This article highlights the key technical achievements and milestones shaping the future of data interaction at the Group. 

Why Text-to-SQL 

The concept isn’t new; systems like LUNAR1 and CHAT-802 explored natural language interfaces in the 70s and 80s, validating the need for intuitive data access. However, rule-based methods and compute constraints limited its scalability. Today, advances in LLMs, schema-aware encoding, and prompt engineering bring us closer than ever to accurately generating SQL across diverse domains. 

Almost two years ago we had a vision of improving our engineering efficiency and speeding up delivery cycles. We also wanted to democratise access to data and insights, through an intuitive tool that eliminates technical barriers and unlocks greater value from our data assets. 

 

Transforming through technology

We’re using the power of tech and data to offer our customers hyper-personalised services and seamless experiences on their device and channel of choice.

Visit the hub

The evolution of Dialogue with Data  

Dialogue with Data (DwD) is Lloyds Banking Group’s experiment to explore text-to-SQL using GenAI technology, introducing Agentic concepts. It was developed and tested in our “Innovation Sandbox”  using Google Cloud Platform (GCP) environment isolated from production. This provided safe conditions, access to the latest models and scalable compute for rapid, iterative learnings.  

Through our experimentation framework, we selected a low risk, yet high‑value Business Intelligence (BI) use case, using a synthetically generated Human Resource (HR) dataset that mirrors holiday and training records. This enabled us to validate end‑to‑end Text‑to‑SQL workflows (governance, guardrails, and UX) without exposing sensitive data.  

The experiment was divided into three phases: 

  • Phase one – Framework exploration and zero-shot LLM benchmarking 
  • Phase two – Experimentation sprints: functional demo and SQL generation 
  • Phase three – Introducing semantic layer into target state experiment architecture  

Phase one – framework and zero shot LLM benchmarks 

Phase one benchmarked LLMs on their ability to generate accurate SQL from natural language prompts. The experiment surfaced limitations in off-the-shelf evaluation tools like Spider and Bird, prompting the creation of a bespoke Group-specific automated evaluation pipeline. The resulting framework supports nuanced assessment, including human-in-the-loop review, and laid the foundations for future phases.

Phase one labelled ‘zero shot’ (see Figure 1). We provided the full schema the above instruction prompt (see Figure 2) and iterated through a set of 44 questions curated from seven business users ranging in difficulty from easy to extra hard.   

Code snippet template for generating SQL statements based on a database schema and user question. Includes placeholders for schema, table creation statements, and user input, with instructions to ensure proper SQL syntax and formatting.

To ensure deterministic behaviour and reproducibility across model evaluations through all phases; all sampling related parameters including temperature3, top-p4 and top-k5 were fixed at zero or their lowest permissible values. This attempted to suppress the inherent stochasticity of LLMs, enabling as consistent output generation as possible across repeated runs and helped in facilitating fair comparative analysis.  

The automated evaluation pipeline included metrics such as execution accuracy, syntax and semantic correctness, and time-based scoring (R-Ves6). Additional metrics like Soft-F17 were used to account for minor variations in output structure, further strengthening the robustness of the analysis. 

Whilst the automated evaluation pipeline performed well, it became clear that human evaluators played a crucial role. Programmatic metrics such as soft F1, Exact Match (EM), and accuracy do not always fully capture correctness. In practice, evaluation often required resolving ambiguity in cases where multiple permutations of valid SQL statements could be considered correct. Human evaluators were also essential in judging partial correctness, ensuring that results aligned with the intended meaning of the input query. 

The evaluation revealed that while state of the art (SoA) models such as Claude 3.5 Sonnet, GPT 4.0 and Gemini demonstrated strong performance on objective queries, particularly in the "easy" and "medium" categories, they struggled with more subjective or domain-specific questions. This insight was reinforced through targeted human evaluation, which underscored the importance of schema context and the injection of domain knowledge.

Bar chart comparing overall and difficulty-level accuracy scores (%) for AI models including Claude 3.5 Sonnet, ChatGPT GPT-4o, Gemini GPT-4o+, Defog.ai llama3+, and Pipable AI pipablesql.

Notably, specialised text-to-SQL open-source models such as Defog.ai and PipableSQL exhibited significantly lower accuracy across all difficulty levels, highlighting a clear performance gap between SoA and open-source alternatives.  

Another divergence between experimental setups and real-world deployment is the impact of differing Data Definition Language (DDL) environments on performance metrics. Initially, we aligned with open-source frameworks used by Spider and Bird, which at the time relied on SQLite. This consistency supported the development of our own automated evaluation pipeline. However, deploying SQLite in a production grade enterprise environment proved more complex, prompting a transition to PostgreSQL as an interim step. 

Following the PostgreSQL migration, model performance degraded. We attribute this, in part, to dialect bias driven by SQLite’s simpler SQL and its prevalence in benchmark datasets, which nudges models toward SQLite syntax. To enable migration from the Sandbox to other environments and improve scalability, we made the decision to transition to Google Gemini paired with BigQuery. 

Phase 2 – experimentation sprints: functional demo and SQL generation

To bring the experiment to life and enable user experience testing, we developed a working MVP (Minimum Viable Product) that served as both a functional prototype and a demonstration tool. This allowed us not only to showcase the potential of conversational analytics in a live environment, but also to validate integration pathways, assess usability and gather early feedback to inform future iterations. 

We manage conversations using a simple memory buffer, as we currently have no prior user history. At this stage, we’ve adopted a straightforward approach by leveraging Gemini Flash 1.5 to handle user intent (see conversation management figure): 

  • New Query: Run SQL generation process 
  • Follow-up: Create an augmented question using the previous question from memory buffer 
  • Follow up - non query: If the user required different aliases that does not require SQL generation 
  • Non-Database question: Re-prompt user to clarify intent with DwD rules  
  • Example: “What is Joe Blogs job title”, we can see the previous context is included in the augmented question. (See example)

While the conversation management handles structured exchanges effectively and is simple to implement, it doesn’t yet account for the probabilistic and context-sensitive nature of real human conversation. Future iterations will need to incorporate more sophisticated intent modelling and contextual memory to manage conversational permutations. 

 

SQL generation experiments  

As part of the early experiment, we looked at how to improve the SQL generation component in isolation. For this we were still ingesting the schema within a system prompt but in some scenarios, it would have been disseminated in separate LLM calls. 

Phase 3 – introducing semantic layer into target state experiment architecture

Despite progress, accuracy consistently hit an 80% EM ceiling due to limitations in semantic understanding. To address the underlying challenge, we introduced two critical enablers, robust schema management and a semantic layer, resulting in an EM score of 86.1% on a validation set collated from different business users.

What began as a traditional DDL schema with basic descriptions has evolved to include synonyms/acronyms and nominal distinct column values to enrich semantic understanding. 

 

Visual representation of a database schema used for SQL generation, showing table structures and relationships.

We began by compiling the schema into CSV and splitting it across the stages ready for our Retrieval-Augmented Generation (RAG) workflows. For schema pruning, we deliberately eschewed RAG and used the LLM in parametric (closed-book) mode to condense field descriptions, remove redundancy, and prioritise high-utility columns. This semantics-aware pass was more effective than keyword-driven pruning. 

For mapping domain context language to the schema, we relied on traditional Name Entity Recognition (NER)/tagging rather than the LLM. A rules and dictionary based NER layer handles discreate nominal columns8, synonyms, and acronyms, and resolves hierarchies that span multiple columns (e.g., Layer_1 … Layer_x). This made it straightforward to assert, for example, that “Department A” belongs to Layer_3 hierarchies and to provide that context to the generation phase coupled with the pruned schema as a first pass. 

We also incorporated Gold SQL into the RAG pipeline. Using a single-prompt strategy with structured CoT and Gold SQL exemplars, we observed strong gains in both accuracy and latency. Additionally, when a user request is a 1:1 match to a previously validated query, we can bypass generation and return the stored Gold validated SQL. 

The way the schema and semantic layer are consumed is arguably the most critical factor in designing a robust text-to-SQL system. This becomes even more important when working with large schemas, though even with relatively small schema, we observed auto-regressive9 bias during rapid prototyping particularly when the schema was embedded directly within the SQL generation prompt. 

 

Target state architecture

In conclusion

The performance of text-to-SQL in a real-world setting, is significantly different to that in existing benchmarks especially when faced with nuanced domain specific context.  

Accuracy is often perceived as an absolute measure, yet in practice, it is far from straightforward. While creating a benchmark dataset provides a controlled environment for evaluation, real-world conditions introduce noise, ambiguity, and variability that benchmarks rarely capture. 

A key challenge lies in defining ‘ground truth’. Labels are typically treated as objective, but they are created by humans bringing inherent subjectivity and bias. This raises an important question: can we ever claim 100% accuracy when the very notion of correctness depends on human interpretation? 

These factors create a persistent gap between model performance in benchmark settings and operational environments. Recognising this gap is critical for designing robust systems and setting realistic expectations for stakeholders.  

Data democratisation is still the right ambition, but it remains elusive for a system to truly handle ambiguous questions which often come in the form of true natural language, ie “How well are we doing?” These questions today sit outside of any descriptive dashboard that even a semantic layer will not be able to solve.   

We believe the focus should be on enabling a human-in-the-loop system that learns from user experience, with transparency for users at its core.

Next steps 

We developed DwD with a small team of highly skilled engineers. What began as a text-to-SQL experiment has evolved into a broader industry concept now recognised as Generative Business Intelligence (GenBI). Building on the internal success of DwD, we are now working closely with our strategic partners and wider teams to explore how these learnings can be implemented and deliver value across the bank.

Our ambition is to offer conversational data access as a reusable, cross-domain service enabling more intuitive, efficient and secure engagement with data across Lloyds Banking Group. 

 



More on these topics

Digital technology AI

Matthew Mason
About the author Matthew Mason

Lead Data & AI Scientist

Matthew Mason is a Lead Data & AI Scientist in Lloyds Banking Group’s Chief Operating Office. He leads the development and deployment of innovative solutions that enhance operational efficiency across the organisation. 

Prior to his current role, Matthew held a range of positions in engineering, analytics, and data science within the Group. His career began in the branch network over 16 years ago, marking the start of a journey that has evolved into a deep focus on data and technology.

Dr Azahar Machwe
About the author Dr Azahar Machwe

Senior Enterprise Architect

Azahar is a subject matter expert in AI.  An Enterprise Architect, with a focus on emerging trends in AI. Innovation and experimentation with emerging tech is a critical part of his life.

He has worked with startups and FTSE 100 companies delivering solutions based on technologies such as Conversational AI, 5G, and Cloud. He has a Ph.D. in Artificial Intelligence looking at generative design.

Ansel Liu
About the author Ansel Liu

Innovation Experiment Lead 

Ansel Liu is an Emerging Technology and Innovation Manager in the Chief Technology Office (CTO) at Lloyds Banking Group. He leads initiatives to assess the impact of emerging technologies on colleagues, customers, and clients, and drives experimentation to unlock new customer segments and business models.

Over the past four years, Ansel has focused on developing technology-led propositions and running experiments in areas (such as BNPL, Data Product and GenAI, with an emphasis on business and commercial banking use cases.

Prior to Lloyds Banking Group, Ansel worked in tech strategy consulting, corporate venture building and co-founded NOMAD, a PropTech startup within the LabTech Group.

Related content

Putting customers first: how Lloyds Banking Group is transforming insurance journeys

27 Aug 2025 | Duncan Kirkpatrick 

Lloyds Banking Group is transforming the insurance experience through digital tools like five-minute claims processing and self-service policy management, while maintaining human support for complex cases.

Read Duncan's article

Responsible artificial intelligence: building trust and value at Lloyds Banking Group

249 August 2025 | Rohit Dhawan

Lloyds Banking Group’s responsible AI framework ensures rigorous testing, human oversight, and scenario planning to prevent harm, with real-world safeguards like escalation protocols for virtual assistants and fraud detection systems.

Read Rohit's article

Unlocking the future of payments: digital assets at Lloyds Banking Group

18 August 2025 | Rob Hale and Peter Left 

By embracing digital assets, smart contracts, and DLT, we’re not just improving existing processes – we’re reimagining what’s possible.

Read their article