Complete the draft
This commit is contained in:
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: 'MCP for Your Data Warehouse'
|
||||
date: 2026-05-04T12:00:00
|
||||
draft: true
|
||||
date: 2026-05-03T18:00:00
|
||||
draft: false
|
||||
tags:
|
||||
- Data Warehouse
|
||||
- MCP
|
||||
@@ -18,9 +18,9 @@ Business users want data that drives decisions. They want to know what
|
||||
happened, why it happened, and what to do about it --- without learning a data
|
||||
model, mastering dimensional thinking, or writing and debugging SQL. That is
|
||||
the gap a good analyst closes. Analysts use both _domain_ and _technical_
|
||||
knowledge to pull insights out of an organization's data warehouse. LLMs can
|
||||
offer a different path to solve for a business user by connecting the LLM to
|
||||
a data warehouse via an MCP.
|
||||
knowledge to pull insights out of an organization's data warehouse. LLMs
|
||||
offer a different path: connect the model to the warehouse via an MCP server,
|
||||
and the user gets answers without needing the analyst's technical knowledge.
|
||||
|
||||
The promise is real. A well-connected LLM translates plain questions into
|
||||
queries, interprets results, and returns answers without requiring the user to
|
||||
@@ -46,11 +46,11 @@ executes it:
|
||||
```sql
|
||||
SELECT count(DISTINCT user_id)
|
||||
FROM mart_user_activity
|
||||
WHERE activity_week = CURRENT_DATE - 7
|
||||
WHERE activity_week = DATE_TRUNC('week', CURRENT_DATE)
|
||||
AND user_id NOT IN (
|
||||
SELECT user_id
|
||||
FROM mart_user_activity
|
||||
WHERE activity_week = CURRENT_DATE - 14
|
||||
WHERE activity_week = DATE_TRUNC('week', CURRENT_DATE) - 7
|
||||
)
|
||||
```
|
||||
|
||||
@@ -59,8 +59,8 @@ requests it. That is the appeal --- and the risk. Most users are not going to
|
||||
inspect the query. Even if they did, they would be unlikely to spot an error.
|
||||
Everything depends on the LLM querying the correct tables, the trustworthiness
|
||||
of the tables, and valid interpretation of the result. An LLM might not notice
|
||||
that three days of data are missing from the monthly report -- an LLM does not
|
||||
natively understand your business domain.
|
||||
that three days of data are missing from the monthly report --- an LLM does
|
||||
not natively understand your business domain.
|
||||
|
||||
---
|
||||
|
||||
@@ -74,7 +74,7 @@ What makes a warehouse trustworthy is layered correctness. Raw data enters the
|
||||
pipeline unvalidated. Each transformation layer enforces a guarantee --- types
|
||||
conform, entities are uniquely identified, cross-field relationships are
|
||||
consistent --- so that by the time data reaches the analytical surface,
|
||||
a chain of checks has been applied. (See [this post][ontology] for more
|
||||
a chain of checks has been applied. (See [this post][ontology] for a more
|
||||
comprehensive explanation of using a layered model to build a data warehouse.)
|
||||
The analytical surface --- marts, reports, or however your warehouse organizes
|
||||
its consumption layers --- is where these guarantees culminate.
|
||||
@@ -86,8 +86,8 @@ LLM should see.
|
||||
|
||||
## Problems with LLMs
|
||||
|
||||
At first glance, maybe the only step required is to restrict the LLM access to
|
||||
the trustworthy consumption layers? Not quite. LLMs have failure modes:
|
||||
At first glance, maybe the only step required is to restrict the LLM's access
|
||||
to the trustworthy consumption layers? Not quite. LLMs have failure modes:
|
||||
|
||||
**Inherited statistical defaults.** LLMs generate responses that reflect common
|
||||
analytical practice, and common practice is often wrong for a given dataset.
|
||||
@@ -133,9 +133,9 @@ analytical system built without discipline, and the mitigations are the same.
|
||||
The failure modes above are known. Good design addresses them the way
|
||||
engineering addresses any known failure: not by hoping the system behaves, but
|
||||
by constraining it so that it cannot fail in those ways --- or fails visibly
|
||||
when it does. Each of the following tips targets a specific failure mode: what
|
||||
the LLM can query, what metadata it sees, what context it can consult, and
|
||||
whether it has verified answers to reach for before writing new SQL.
|
||||
when it does. The following design choices address those failure modes ---
|
||||
along with the operational concerns of query safety and data freshness that
|
||||
come with letting an LLM execute SQL against a production warehouse.
|
||||
|
||||
### Restrict access to the analytical surface
|
||||
|
||||
@@ -225,8 +225,8 @@ A markdown document at a well-known location --- queryable via MCP --- can
|
||||
supply this context. Ask users to have the LLM consult the guide at the start
|
||||
of a session.
|
||||
|
||||
A usage section become load-bearing within a session. The LLM may not always
|
||||
follow it -- instrument logging to identify shortcomings. But it is
|
||||
The usage guide becomes load-bearing within a session. The LLM may not always
|
||||
follow it --- instrument logging to identify shortcomings. But it is
|
||||
a practical first step, and writing the guide forces the team to articulate
|
||||
institutional knowledge that is otherwise implicit.
|
||||
|
||||
@@ -259,12 +259,9 @@ Business users want to ask what happened, why, and what to do about it ---
|
||||
without learning the plumbing. MCP makes that interaction possible. But the
|
||||
interaction is only as trustworthy as the data behind it.
|
||||
|
||||
The work described here --- restricting access to curated consumption layers,
|
||||
surfacing metadata and grain, validating queries, pre-computing meaningful
|
||||
measurements, making freshness visible, encoding domain knowledge, and
|
||||
building a verified query library --- is what turns a bare database connection
|
||||
into a measurement instrument that an LLM can use responsibly. None of it is
|
||||
new. All of it is load-bearing.
|
||||
The work described here is what turns a bare database connection into
|
||||
a measurement instrument that an LLM can use responsibly. None of it is new.
|
||||
All of it is load-bearing.
|
||||
|
||||
As Falconer and O'Keefe [have argued][falconer], AI will not save you from
|
||||
your data modeling problems. This post takes that observation one step further:
|
||||
|
||||
Reference in New Issue
Block a user