SQL for AI Engineers: Querying Vector Databases and LLM Outputs

The most common misconception about AI engineering is that it's a departure from traditional data skills. In practice, AI engineers write more SQL than most backend developers — just for different purposes: vector similarity search, embedding storage, LLM output extraction, and ML feature pipelines.

This post covers the SQL patterns that show up most in modern AI engineering work.

Vector Similarity Search with pgvector

PostgreSQL's pgvector extension adds a vector data type and similarity search operators. It's the most widely used approach for semantic search in production applications.

Setup

-- Enable the extension (once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- Store document embeddings alongside metadata
CREATE TABLE document_embeddings (
    id         SERIAL PRIMARY KEY,
    content    TEXT NOT NULL,
    source_url TEXT,
    doc_type   TEXT,
    embedding  vector(1536),   -- OpenAI text-embedding-3-small dimension
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest-neighbor search at scale
CREATE INDEX ON document_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

Semantic Search (RAG Retrieval Pattern)

-- Find the 5 most semantically similar documents to a query embedding
-- $1 is the embedding vector generated by your application
SELECT
    id,
    content,
    source_url,
    doc_type,
    1 - (embedding <=> $1) AS similarity_score
FROM document_embeddings
WHERE doc_type = 'knowledge-base'
ORDER BY embedding <=> $1        -- <=> = cosine distance
LIMIT 5;

Operator reference:

<=> — cosine distance (most common for text embeddings)
<-> — Euclidean (L2) distance
<#> — negative inner product

Hybrid Search: Semantic + Keyword

Pure vector search misses exact keyword matches. Combining both gives better recall:

WITH semantic AS (
    SELECT id, content, 1 - (embedding <=> $1) AS sem_score
    FROM document_embeddings
    ORDER BY embedding <=> $1
    LIMIT 20
),
keyword AS (
    SELECT id, content,
           ts_rank(to_tsvector('english', content), to_tsquery('english', $2)) AS kw_score
    FROM document_embeddings
    WHERE to_tsvector('english', content) @@ to_tsquery('english', $2)
    LIMIT 20
)
SELECT
    COALESCE(s.id, k.id) AS id,
    COALESCE(s.content, k.content) AS content,
    COALESCE(s.sem_score, 0) * 0.7 + COALESCE(k.kw_score, 0) * 0.3 AS combined_score
FROM semantic s
FULL OUTER JOIN keyword k ON s.id = k.id
ORDER BY combined_score DESC
LIMIT 5;

Querying LLM Outputs Stored in Databases

LLM outputs are often stored as JSONB. Querying them efficiently requires JSONB operators.

Storing Structured LLM Responses

CREATE TABLE llm_responses (
    id          SERIAL PRIMARY KEY,
    prompt      TEXT,
    model       TEXT,
    response    JSONB,
    latency_ms  INTEGER,
    tokens_used INTEGER,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

Extracting Fields from JSONB

-- Extract specific fields from the response JSON
SELECT
    id,
    model,
    response->>'finish_reason' AS finish_reason,
    (response->'usage'->>'prompt_tokens')::INT AS prompt_tokens,
    (response->'usage'->>'completion_tokens')::INT AS completion_tokens,
    response->'choices'->0->'message'->>'content' AS response_text
FROM llm_responses
WHERE model = 'gpt-4o'
  AND created_at >= NOW() - INTERVAL '7 days';

Aggregating LLM Performance Metrics

-- Model comparison: latency and token usage per model per day
SELECT
    DATE_TRUNC('day', created_at) AS day,
    model,
    COUNT(*) AS requests,
    AVG(latency_ms) AS avg_latency_ms,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) AS p95_latency_ms,
    SUM(tokens_used) AS total_tokens,
    AVG(tokens_used) AS avg_tokens_per_request
FROM llm_responses
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 1 DESC, avg_latency_ms;

ML Feature Engineering with SQL

Feature engineering — computing derived metrics from raw event data — is still predominantly done in SQL, even in modern ML stacks.

User Behavioral Features

-- Features for a churn prediction model
WITH user_events_30d AS (
    SELECT
        user_id,
        COUNT(*)                                            AS total_events,
        COUNT(DISTINCT DATE(event_time))                    AS active_days,
        COUNT(DISTINCT session_id)                          AS session_count,
        COUNT(*) FILTER (WHERE event_type = 'purchase')    AS purchase_count,
        SUM(revenue) FILTER (WHERE event_type = 'purchase') AS total_revenue,
        MAX(event_time)                                     AS last_active,
        MIN(event_time)                                     AS first_active
    FROM user_events
    WHERE event_time >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY user_id
)
SELECT
    user_id,
    total_events,
    active_days,
    session_count,
    purchase_count,
    COALESCE(total_revenue, 0)                                AS total_revenue,
    total_events / NULLIF(active_days, 0)                     AS events_per_active_day,
    EXTRACT(EPOCH FROM (NOW() - last_active)) / 86400         AS days_since_last_active,
    EXTRACT(EPOCH FROM (last_active - first_active)) / 86400  AS days_as_user
FROM user_events_30d;

Rolling Window Features

-- Rolling features for time-series models
SELECT
    user_id,
    event_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS revenue_7d,
    SUM(daily_revenue) OVER (
        PARTITION BY user_id
        ORDER BY event_date
        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
    ) AS revenue_30d,
    daily_revenue - LAG(daily_revenue) OVER (
        PARTITION BY user_id ORDER BY event_date
    ) AS revenue_dod,
    EXTRACT(DOW FROM event_date) IN (0, 6) AS is_weekend
FROM daily_user_revenue
ORDER BY user_id, event_date;

For more window function patterns, see the SQL Window Functions Guide.

Label Creation for Supervised Learning

-- Binary churn label: did the user stop engaging in the next 30 days?
WITH last_event AS (
    SELECT user_id, MAX(event_time) AS last_active
    FROM user_events
    GROUP BY user_id
)
SELECT
    u.user_id,
    l.last_active,
    CASE
        WHEN l.last_active < CURRENT_DATE - INTERVAL '30 days' THEN 1
        ELSE 0
    END AS churned
FROM users u
JOIN last_event l ON u.user_id = l.user_id
WHERE u.created_at < CURRENT_DATE - INTERVAL '60 days';  -- Only users old enough to churn

dbt: SQL as the Core of Modern Data Pipelines

dbt treats every transformation as a SQL SELECT statement. The modern ML data stack — feature stores, training data pipelines, analytics layers — is built on dbt models.

A dbt Feature Model

-- models/features/user_features.sql
{{ config(materialized='table') }}

WITH base AS (
    SELECT *
    FROM {{ ref('stg_user_events') }}
    WHERE event_time >= DATEADD('day', -30, CURRENT_DATE)
),
aggregated AS (
    SELECT
        user_id,
        COUNT(*) AS events_30d,
        COUNT(DISTINCT DATE(event_time)) AS active_days_30d,
        SUM(revenue) AS revenue_30d
    FROM base
    GROUP BY user_id
)
SELECT
    user_id,
    events_30d,
    active_days_30d,
    revenue_30d,
    events_30d / NULLIF(active_days_30d, 0) AS events_per_active_day
FROM aggregated

dbt handles dependency ordering, runs models in the right sequence, and tests data quality. Every AI/ML team working at scale uses it.

AI Observability Queries

AI applications generate traces, spans, and evaluation scores that need to be queried like any other time-series data.

-- Track model quality scores over time
SELECT
    DATE_TRUNC('day', evaluated_at) AS day,
    model_version,
    evaluator_type,
    AVG(score) AS avg_score,
    STDDEV(score) AS score_stddev,
    COUNT(*) AS sample_count,
    AVG(score) - LAG(AVG(score)) OVER (
        PARTITION BY model_version, evaluator_type
        ORDER BY DATE_TRUNC('day', evaluated_at)
    ) AS score_change_vs_yesterday
FROM llm_evaluations
WHERE evaluated_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2, 3
ORDER BY 1 DESC, 2, 3;

Frequently Asked Questions

Do AI engineers need to know SQL?

Yes. AI engineers who work with data — which is most of them — write SQL constantly: for feature engineering, for querying vector databases, for analyzing model outputs, for building evaluation pipelines. The tools have changed (pgvector, BigQuery ML, dbt), but SQL is still the language of data.

What is pgvector and why does it matter?

pgvector is a PostgreSQL extension that adds a vector data type and similarity search operators. It lets you store embeddings alongside structured data in the same database and query them with cosine similarity or Euclidean distance. It powers the retrieval step in most RAG (Retrieval-Augmented Generation) applications and eliminates the need for a separate vector database in many use cases.

What SQL database is best for AI applications?

PostgreSQL with pgvector is the most common choice for applications that need both relational queries and vector search. For large-scale analytics and feature engineering, BigQuery and Snowflake are standard — both support SQL natively. For pure vector workloads at very large scale, dedicated vector databases (Pinecone, Weaviate, Qdrant) are used, though these have their own query interfaces rather than SQL.

Want to sharpen the SQL skills that matter for AI roles? The SQL Crash Course covers window functions, CTEs, and advanced query patterns in 270+ pages. For how SQL fits into the broader AI landscape, see Is SQL Still Relevant in the Age of LLMs?

Working on AI infrastructure and hitting SQL questions? Get in touch.