Dialogue with data: translating natural language into queries using AI

Text-to-SQL (Structured Query Language) is a technology that enables users to query relational databases using natural language instead of traditional SQL syntax. It leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to translate human-readable questions into structured SQL queries.

This approach is particularly valuable in large organisations, where empowering non-technical users to access data directly can reduce bottlenecks, improve agility, and support data-driven decision making.

Over the past year, Lloyds Banking Group has made significant strides towards this vision. This article highlights the key technical achievements and milestones shaping the future of data interaction at the Group.

Why Text-to-SQL

The concept isn’t new; systems like LUNAR¹ and CHAT-80² explored natural language interfaces in the 70s and 80s, validating the need for intuitive data access. However, rule-based methods and compute constraints limited its scalability. Today, advances in LLMs, schema-aware encoding, and prompt engineering bring us closer than ever to accurately generating SQL across diverse domains.

Almost two years ago we had a vision of improving our engineering efficiency and speeding up delivery cycles. We also wanted to democratise access to data and insights, through an intuitive tool that eliminates technical barriers and unlocks greater value from our data assets.

The evolution of Dialogue with Data

Dialogue with Data (DwD) is Lloyds Banking Group’s experiment to explore text-to-SQL using GenAI technology, introducing Agentic concepts. It was developed and tested in our “Innovation Sandbox” using Google Cloud Platform (GCP) environment isolated from production. This provided safe conditions, access to the latest models and scalable compute for rapid, iterative learnings.

Through our experimentation framework, we selected a low risk, yet high‑value Business Intelligence (BI) use case, using a synthetically generated Human Resource (HR) dataset that mirrors holiday and training records. This enabled us to validate end‑to‑end Text‑to‑SQL workflows (governance, guardrails, and UX) without exposing sensitive data.

The experiment was divided into three phases:

Phase one – Framework exploration and zero-shot LLM benchmarking
Phase two – Experimentation sprints: functional demo and SQL generation
Phase three – Introducing semantic layer into target state experiment architecture

Phase one – framework and zero shot LLM benchmarks

Phase one benchmarked LLMs on their ability to generate accurate SQL from natural language prompts. The experiment surfaced limitations in off-the-shelf evaluation tools like Spider and Bird, prompting the creation of a bespoke Group-specific automated evaluation pipeline. The resulting framework supports nuanced assessment, including human-in-the-loop review, and laid the foundations for future phases.

Phase one labelled ‘zero shot’ (see Figure 1). We provided the full schema the above instruction prompt (see Figure 2) and iterated through a set of 44 questions curated from seven business users ranging in difficulty from easy to extra hard.

To ensure deterministic behaviour and reproducibility across model evaluations through all phases; all sampling related parameters including temperature³, top-p⁴ and top-k⁵ were fixed at zero or their lowest permissible values. This attempted to suppress the inherent stochasticity of LLMs, enabling as consistent output generation as possible across repeated runs and helped in facilitating fair comparative analysis.

The automated evaluation pipeline included metrics such as execution accuracy, syntax and semantic correctness, and time-based scoring (R-Ves⁶). Additional metrics like Soft-F1⁷ were used to account for minor variations in output structure, further strengthening the robustness of the analysis.

Whilst the automated evaluation pipeline performed well, it became clear that human evaluators played a crucial role. Programmatic metrics such as soft F1, Exact Match (EM), and accuracy do not always fully capture correctness. In practice, evaluation often required resolving ambiguity in cases where multiple permutations of valid SQL statements could be considered correct. Human evaluators were also essential in judging partial correctness, ensuring that results aligned with the intended meaning of the input query.

Diagram illustrating query execution and evaluation stages. It categorises queries by syntax and semantic errors, correctness of answers, and execution time. Includes a sample accuracy table for "Claude run_1" across difficulty levels.

The evaluation revealed that while state of the art (SoA) models such as Claude 3.5 Sonnet, GPT 4.0 and Gemini demonstrated strong performance on objective queries, particularly in the "easy" and "medium" categories, they struggled with more subjective or domain-specific questions. This insight was reinforced through targeted human evaluation, which underscored the importance of schema context and the injection of domain knowledge.

Bar chart comparing overall and difficulty-level accuracy scores (%) for AI models including Claude 3.5 Sonnet, ChatGPT GPT-4o, Gemini GPT-4o+, Defog.ai llama3+, and Pipable AI pipablesql.

Notably, specialised text-to-SQL open-source models such as Defog.ai and PipableSQL exhibited significantly lower accuracy across all difficulty levels, highlighting a clear performance gap between SoA and open-source alternatives.

Another divergence between experimental setups and real-world deployment is the impact of differing Data Definition Language (DDL) environments on performance metrics. Initially, we aligned with open-source frameworks used by Spider and Bird, which at the time relied on SQLite. This consistency supported the development of our own automated evaluation pipeline. However, deploying SQLite in a production grade enterprise environment proved more complex, prompting a transition to PostgreSQL as an interim step.

Following the PostgreSQL migration, model performance degraded. We attribute this, in part, to dialect bias driven by SQLite’s simpler SQL and its prevalence in benchmark datasets, which nudges models toward SQLite syntax. To enable migration from the Sandbox to other environments and improve scalability, we made the decision to transition to Google Gemini paired with BigQuery.

Phase 2 – experimentation sprints: functional demo and SQL generation

To bring the experiment to life and enable user experience testing, we developed a working MVP (Minimum Viable Product) that served as both a functional prototype and a demonstration tool. This allowed us not only to showcase the potential of conversational analytics in a live environment, but also to validate integration pathways, assess usability and gather early feedback to inform future iterations.

We manage conversations using a simple memory buffer, as we currently have no prior user history. At this stage, we’ve adopted a straightforward approach by leveraging Gemini Flash 1.5 to handle user intent (see conversation management figure):

New Query: Run SQL generation process
Follow-up: Create an augmented question using the previous question from memory buffer
Follow up - non query: If the user required different aliases that does not require SQL generation
Non-Database question: Re-prompt user to clarify intent with DwD rules
Example: “What is Joe Blogs job title”, we can see the previous context is included in the augmented question. (See example)

While the conversation management handles structured exchanges effectively and is simple to implement, it doesn’t yet account for the probabilistic and context-sensitive nature of real human conversation. Future iterations will need to incorporate more sophisticated intent modelling and contextual memory to manage conversational permutations.

Flowchart showing the handling of a user query, branching into four categories: New SQL query, Follow-up query, Follow-up non-query, and Non-DB question. Each path leads to specific actions such as SQL generation or re-prompting the user.

Phase 3 – introducing semantic layer into target state experiment architecture

Despite progress, accuracy consistently hit an 80% EM ceiling due to limitations in semantic understanding. To address the underlying challenge, we introduced two critical enablers, robust schema management and a semantic layer, resulting in an EM score of 86.1% on a validation set collated from different business users.

What began as a traditional DDL schema with basic descriptions has evolved to include synonyms/acronyms and nominal distinct column values to enrich semantic understanding.

Visual representation of a database schema used for SQL generation, showing table structures and relationships.

More on these topics

Digital technology AI

External links and technical terms expandable section
1. LUNAR - (PDF) The Lunar Sciences Natural Language Information System
2. CHAT-80 - An Efficient for Interpreting Easily Adaptable System Natural Language Queries
3. Temperature is a parameter that controls randomness in text generation by scaling token probabilities, higher values make outputs more diverse, while lower values make them more deterministic.
4. Top-p (nucleus) sampling is a text generation method that selects the smallest set of tokens whose cumulative probability exceeds a threshold p and samples the next token from that set.
5. Top-k sampling is a text generation method that selects the k most probable tokens and randomly samples the next token from that fixed set.
6. R-VES assigns a reward based on the execution time ratio between the predicted query and the ground-truth query.
7. Measures partial correctness of predictions by giving credit for overlapping SQL components with the ground truth rather than requiring an exact match.
8. Nominal column contains categorical values without any inherent order.
9. Auto-Regressive Bias - Auto-regressive bias refers to the systematic skew introduced by the sequential nature of text generation.

About the author Matthew Mason

Lead Data & AI Scientist

Matthew Mason is a Lead Data & AI Scientist in Lloyds Banking Group’s Chief Operating Office. He leads the development and deployment of innovative solutions that enhance operational efficiency across the organisation.

Prior to his current role, Matthew held a range of positions in engineering, analytics, and data science within the Group. His career began in the branch network over 16 years ago, marking the start of a journey that has evolved into a deep focus on data and technology.

Matthew's background

About the author Dr Azahar Machwe

Senior Enterprise Architect

Azahar is a subject matter expert in AI. An Enterprise Architect, with a focus on emerging trends in AI. Innovation and experimentation with emerging tech is a critical part of his life.

He has worked with startups and FTSE 100 companies delivering solutions based on technologies such as Conversational AI, 5G, and Cloud. He has a Ph.D. in Artificial Intelligence looking at generative design.

Azahar's background

About the author Ansel Liu

Innovation Experiment Lead

Ansel Liu is an Emerging Technology and Innovation Manager in the Chief Technology Office (CTO) at Lloyds Banking Group. He leads initiatives to assess the impact of emerging technologies on colleagues, customers, and clients, and drives experimentation to unlock new customer segments and business models.

Over the past four years, Ansel has focused on developing technology-led propositions and running experiments in areas (such as BNPL, Data Product and GenAI, with an emphasis on business and commercial banking use cases.

Prior to Lloyds Banking Group, Ansel worked in tech strategy consulting, corporate venture building and co-founded NOMAD, a PropTech startup within the LabTech Group.

Ansel 's background

Dialogue with data: translating natural language into queries using AI

Why Text-to-SQL

The evolution of Dialogue with Data

Phase one – framework and zero shot LLM benchmarks

Figure 1 expandable section

Figure 2 expandable section

Figure 3 expandable section

Figure 4 expandable section

Phase 2 – experimentation sprints: functional demo and SQL generation

Conversation management figure expandable section

Example expandable section

Interface example expandable section

Phase 3 – introducing semantic layer into target state experiment architecture

External links and technical terms expandable section

Related content

Assumptions and user choices

Prompt chaining with iterative memory cache

Single prompt strategy

Target State Architecture of DwD Experiment

Our purpose and strategy

Sustainability

Our brands

Who was Flora Murray?

Farming with nature: mapping the growth opportunity for UK agriculture

Health and wellbeing

40 years of Foundations

Annual report

Financial downloads

Financial calendar

Why invest in Lloyds Banking Group?

Fixed income investors

Lloyds Premier: More than just a bank account

How tech continues to transform mobile banking

Tech and data roles

A life-long career in tech

Our brands

Our purpose and strategy

Sustainability

Our brands

Who was Flora Murray?

Farming with nature: mapping the growth opportunity for UK agriculture

Health and wellbeing

40 years of Foundations

Annual report

Financial downloads

Financial calendar

Why invest in Lloyds Banking Group?

Fixed income investors

Lloyds Premier: More than just a bank account

How tech continues to transform mobile banking

Tech and data roles

A life-long career in tech

Our brands

Dialogue with data: translating natural language into queries using AI

At a glance

Why Text-to-SQL

Transforming through technology

The evolution of Dialogue with Data

Phase one – framework and zero shot LLM benchmarks

Phase 2 – experimentation sprints: functional demo and SQL generation

SQL generation experiments

Phase 3 – introducing semantic layer into target state experiment architecture

In conclusion

Next steps

Related content

Putting customers first: how Lloyds Banking Group is transforming insurance journeys

Responsible artificial intelligence: building trust and value at Lloyds Banking Group

Unlocking the future of payments: digital assets at Lloyds Banking Group

More news and insight