Complete the draft

This commit is contained in:
Andrew Stryker
2026-05-03 18:19:07 -07:00
parent 82f719f734
commit c7396a5fd4

View File

@@ -1,7 +1,7 @@
---
title: 'MCP for Your Data Warehouse'
date: 2026-05-04T12:00:00
draft: true
date: 2026-05-03T18:00:00
draft: false
tags:
- Data Warehouse
- MCP
@@ -18,9 +18,9 @@ Business users want data that drives decisions. They want to know what
happened, why it happened, and what to do about it --- without learning a data
model, mastering dimensional thinking, or writing and debugging SQL. That is
the gap a good analyst closes. Analysts use both _domain_ and _technical_
knowledge to pull insights out of an organization's data warehouse. LLMs can
offer a different path to solve for a business user by connecting the LLM to
a data warehouse via an MCP.
knowledge to pull insights out of an organization's data warehouse. LLMs
offer a different path: connect the model to the warehouse via an MCP server,
and the user gets answers without needing the analyst's technical knowledge.
The promise is real. A well-connected LLM translates plain questions into
queries, interprets results, and returns answers without requiring the user to
@@ -46,11 +46,11 @@ executes it:
```sql
SELECT count(DISTINCT user_id)
FROM mart_user_activity
WHERE activity_week = CURRENT_DATE - 7
WHERE activity_week = DATE_TRUNC('week', CURRENT_DATE)
AND user_id NOT IN (
SELECT user_id
FROM mart_user_activity
WHERE activity_week = CURRENT_DATE - 14
WHERE activity_week = DATE_TRUNC('week', CURRENT_DATE) - 7
)
```
@@ -59,8 +59,8 @@ requests it. That is the appeal --- and the risk. Most users are not going to
inspect the query. Even if they did, they would be unlikely to spot an error.
Everything depends on the LLM querying the correct tables, the trustworthiness
of the tables, and valid interpretation of the result. An LLM might not notice
that three days of data are missing from the monthly report -- an LLM does not
natively understand your business domain.
that three days of data are missing from the monthly report --- an LLM does
not natively understand your business domain.
---
@@ -74,7 +74,7 @@ What makes a warehouse trustworthy is layered correctness. Raw data enters the
pipeline unvalidated. Each transformation layer enforces a guarantee --- types
conform, entities are uniquely identified, cross-field relationships are
consistent --- so that by the time data reaches the analytical surface,
a chain of checks has been applied. (See [this post][ontology] for more
a chain of checks has been applied. (See [this post][ontology] for a more
comprehensive explanation of using a layered model to build a data warehouse.)
The analytical surface --- marts, reports, or however your warehouse organizes
its consumption layers --- is where these guarantees culminate.
@@ -86,8 +86,8 @@ LLM should see.
## Problems with LLMs
At first glance, maybe the only step required is to restrict the LLM access to
the trustworthy consumption layers? Not quite. LLMs have failure modes:
At first glance, maybe the only step required is to restrict the LLM's access
to the trustworthy consumption layers? Not quite. LLMs have failure modes:
**Inherited statistical defaults.** LLMs generate responses that reflect common
analytical practice, and common practice is often wrong for a given dataset.
@@ -133,9 +133,9 @@ analytical system built without discipline, and the mitigations are the same.
The failure modes above are known. Good design addresses them the way
engineering addresses any known failure: not by hoping the system behaves, but
by constraining it so that it cannot fail in those ways --- or fails visibly
when it does. Each of the following tips targets a specific failure mode: what
the LLM can query, what metadata it sees, what context it can consult, and
whether it has verified answers to reach for before writing new SQL.
when it does. The following design choices address those failure modes ---
along with the operational concerns of query safety and data freshness that
come with letting an LLM execute SQL against a production warehouse.
### Restrict access to the analytical surface
@@ -225,8 +225,8 @@ A markdown document at a well-known location --- queryable via MCP --- can
supply this context. Ask users to have the LLM consult the guide at the start
of a session.
A usage section become load-bearing within a session. The LLM may not always
follow it -- instrument logging to identify shortcomings. But it is
The usage guide becomes load-bearing within a session. The LLM may not always
follow it --- instrument logging to identify shortcomings. But it is
a practical first step, and writing the guide forces the team to articulate
institutional knowledge that is otherwise implicit.
@@ -259,12 +259,9 @@ Business users want to ask what happened, why, and what to do about it ---
without learning the plumbing. MCP makes that interaction possible. But the
interaction is only as trustworthy as the data behind it.
The work described here --- restricting access to curated consumption layers,
surfacing metadata and grain, validating queries, pre-computing meaningful
measurements, making freshness visible, encoding domain knowledge, and
building a verified query library --- is what turns a bare database connection
into a measurement instrument that an LLM can use responsibly. None of it is
new. All of it is load-bearing.
The work described here is what turns a bare database connection into
a measurement instrument that an LLM can use responsibly. None of it is new.
All of it is load-bearing.
As Falconer and O'Keefe [have argued][falconer], AI will not save you from
your data modeling problems. This post takes that observation one step further: