Open source · MIT licensedContext7 plugin

A Data Quality Tool that tells you what and why.

Statistical drift detection, column-level lineage, and causal discovery — for dbt, warehouses, and data lakes. A Python library, CLI, and Web app — all MIT licensed. Not just the Python library, like the others.

Built forClickHouseandBigQueryfirst.·Postgres · Snowflake · others — WIPcontributors welcome ↗

New · LLM Wiki semantic layer

Your Trello board is already a semantic layer. dqt extracts it.

Dump tickets, SQL, and BI reports into raw/. Point Claude Code at the vault — it synthesises dataset descriptions, metric definitions, and causal edges into wiki/. No manual YAML authoring.

Based on Karpathy's LLM Wiki pattern ↗
★ Star on GitHub →

30+

detector algorithms

30+

declarative checks

9+

warehouse engines

100B+

rows validated (and counting)

MIT

no vendor lock-in

The hour after the alert

Most DQ tools tell you a row count dropped.
They don't tell you why.

You set a threshold. It fires. Slack lights up. Now you're bouncing between dbt docs, the warehouse, and your BI tool — trying to figure out which upstream model changed, whether the spike in nulls explains the dashboard regression, and whether this is worth waking the on-call engineer for.

dqt was built for the part that comes after the alert. It reads your dbt manifest, parses your warehouse SQL into a column-level lineage graph, runs 30+ statistical detectors, and discovers causal relationships across your metrics — so the next time something moves, you already know what moved it.

Without dqt

orders.amount null_fraction ≥ 0.05 — threshold exceeded

Now what? Go dig through git log, dbt docs, warehouse history…

With dqt

orders.amount null_fraction = 12.4% (baseline 0.3%)

Causal trace: stg_payments → orders → revenue. Upstream model stg_payments introduced a schema break 6h ago. E-value = 3.2.

Four layers. One library.

Statistical detectors

Every column. Every run.

MAD, double-MAD, isolation forest, KS, STL residual z-scores, adjusted boxplot fences. Plus completeness, validity, freshness, schema-change, and SQL-assertion checks. Every detector returns the same (verdict, score, plain_english) shape.

mad_outlier_fraction · ks_pvalue · stl_residual_zscore · isolation_forest_fraction

Column-level lineage

Parsed from your SQL.

dqt walks your dbt manifest and warehouse DDL with sqlglot to build a column-level dependency graph. From any incident, get an automatic blast radius — every downstream table and metric, ranked by exposure.

LLM Wiki · Semantic layer

raw/ holds facts. wiki/ holds knowledge.

dqt uses Karpathy's LLM Wiki pattern. Dump your Trello tickets, SQL files, and BI reports into raw/. Point Claude Code at the vault. It synthesises wiki/ — dataset descriptions, metric definitions, causal edges — from the artifacts your team already has. YAML contracts compatible with dbt's semantic_models.yml.

raw/tickets/ · raw/sql/ · raw/reports/ → wiki/metrics/ · wiki/lineage/

Causal discovery

Granger. PCMCI+. Transfer Entropy.

dqt runs causal discovery across your metric time series, prunes edges with stability selection, and proposes directed metric→metric relationships annotated with lag, confidence, and E-values. Every edge reviewed by a human before entering the production DAG.

The only DQ tool that ships causal discovery.

Karpathy's LLM Wiki pattern

Your data warehouse
already has documentation.
It's in your Trello board.

Every BI request your GTM team filed is a semantic definition waiting to be extracted. The ticket says what the metric means. The SQL says how it's computed. The report says what thresholds matter.

dqt uses Karpathy's LLM Wiki structure: raw/ for atomic source documents, wiki/ for synthesised knowledge. Point Claude Code at the vault and it writes the semantic layer for you — from the artifacts your team already has.

Read the full workflow guide →
1

Export Trello tickets + attachments

SQL files, report HTMLs, metric definitions

2

Put them in raw/

raw/tickets/ · raw/sql/ · raw/reports/ · raw/schema/

3

Point Claude Code at the vault

cd vault && claude .

4

Claude Code synthesises wiki/

datasets, metrics, lineage, causal edges — grounded in your actual data

5

dqt generates per-column docs + checks

write_vault() · dqt run checks.yaml

Recommended workflow

Use dqt with Claude Code.

Three plugins turn Claude Code into a grounded data-quality engineer that knows your warehouse, knows the dqt API, and can run checks from natural language.

Up-to-date dqt docs

Connects Claude Code to dqt's live documentation and source — all 38 detector slugs, the exact YAML schema, and adapter protocol. No training-data lag.

  • Write checks from business rules
  • Pick the right detector for your data shape
  • Debug failures with current API knowledge

Agentic development skills

Gives Claude Code structured workflows for planning, executing, and reviewing multi-step tasks — essential for building out a full dqt check suite or semantic layer from scratch.

  • Plan + execute check suites step by step
  • TDD for detector configs
  • Subagent-driven semantic layer build

Live warehouse access

Each warehouse publishes its own MCP — e.g. mcp-clickhouse for ClickHouse, or the Postgres MCP server. Connect Claude Code to your warehouse and it can inspect live schemas, sample real distributions, and write dqt checks grounded in your actual data.

  • Schema introspection from live tables
  • Sample-driven detector recommendations
  • Auto-generate semantic.yaml from DDL
Claude Code · all three plugins active

you

Look at my orders table and write dqt checks for amount_usd. Flag outliers and distribution shift.

claude code

Querying your warehouse via MCP... orders.amount_usd: right-skewed, p99=$4,820. Context7 confirms: use adjusted_boxplot_fraction (handles skew via medcouple) + ks_pvalue for drift. checks: - table_name: orders column_name: amount_usd detector_slug: adjusted_boxplot_fraction - table_name: orders column_name: amount_usd detector_slug: ks_pvalue params: {alpha: 0.01}

Statistical & ML algorithms · 32

mad_outlier_fractiondouble_mad_outlier_fractionzscore_outlier_fractionadjusted_boxplot_fractionauto_outlier_fractionisolation_forest_fractiongrubbsgeneralized_esdiqr_fenceks_pvaluewasserstein_1psikl_divergencejs_divergencemmdadwinchi_square_driftstl_residual_zscorebocpdcusumpage_hinkleymatrix_profileholt_wintersprophet_anomalymahalanobis_distancelofone_class_svmhbosecodcramers_vmutual_informationbenford_law_fit

Declarative checks · 28

null_fractioncompletenessdate_part_missing_fractionuniquenesscomposite_uniquenessvolumenumeric_meanvalue_in_rangemax_in_rangemin_in_rangemedian_in_rangesum_in_rangestddev_in_rangecardinality_in_rangequantile_in_rangerow_count_in_rangeset_membershipset_exclusionregex_matchstring_length_rangestring_case_violationdate_formatcolumn_pair_comparisonreferential_integrity_ratemonotonicityfreshness_seconds_behindschema_changesql_assertion_violation

Three lines to your first check.

Runs in notebooks. Runs in CI.
No server required.

from dqt import Check, Runner, MemoryStore

check = Check(
    schema_name="public",
    table_name="orders",
    column_name="amount",
    detector_slug="mad_outlier_fraction",
)

result = Runner(MemoryStore()).run(check, adapter)

print(result.plain_english)
# → "0.82% of values are outliers — within the 1% warn threshold"

No server required. The optional FastAPI service and dashboard are there when you want them — and stay out of the way when you don't.

Where dqt sits.

We borrowed the best ideas.
Then shipped the parts they don't have.

Causal discovery isn't a nice-to-have — it's the difference between “orders are down” and “orders are down because the EU marketing-spend job missed its 06:00 run.”

CapabilitydqtGreat ExpectationsSodaElementaryDataplex
Open source (MIT)partial
30+ statistical detectors~limited~
Column-level lineagepartial
Causal discovery
AI-grounded incident explainerpartial
pip install, runs offlinepartialpartial
No vendor lock-inpartialpartial

Drop it in next to the tools you already use.

dbtreads manifest.json and semantic_models.yml directly
Airflow · Dagster · Prefectruns as one Python task
Snowflake · BigQuery · Postgres · Databricksadapter-based; bring your own connection
OpenLineageingests events from any non-dbt pipeline
DuckDBembedded analytics engine for sample-level stats

Install it. Point it at your warehouse.
See your first incident in five minutes.

Open source · MIT licensed · Python 3.12+ · No telemetry · No signup · No credit card

About the author

Anton Barr is a data geek, getting things done since 1972 and vibe-coding at unreasonable hours. A student of (shitsu): quality, substance, the inner nature of a thing. dqt is a personal project — the data quality tool he kept wishing existed.