install dodo. load the extension. run your .do file.
dodo is a DuckDB extension that reads legacy .do workflows and executes them on DuckDB, preserving familiar data-cleaning scripts while moving execution to a modern analytical engine.
install in 30 seconds
From any DuckDB shell — desktop, CLI, Python, R — the install procedure is the same. dodo is distributed through the DuckDB community extension repository.
sql// in any duckdb session
INSTALL dodo FROM community;
LOAD dodo;
a 60-second tour
The example below loads a CSV, filters rows, computes a column, aggregates by group, and prints the result — all in .do syntax. No SELECT, no FROM, no joins.
Semicolons at the end of each line are a DuckDB REPL convention — the REPL needs them to know when a statement ends. Inside .do files, each line is one statement and no semicolons are needed.
Run the same script as a single command from the shell:
Every use materializes into dodo._current and starts a fresh CTE chain. Every transformation appends a step; nothing executes until you hit a terminal command like list, count, or summarize. See how it works.
Two ways to get the dodo DuckDB extension: from the community extension repository (recommended), or from source. Looking for the standalone compiler? See dodoc installation.
from the duckdb extension repo
This is the path you want for almost any normal use. Works in the DuckDB CLI, the DuckDB UI, and every client library (Python, R, Java, Node, Rust).
sql// one time, persists across sessions
INSTALL dodo FROM community;
LOAD dodo;
After loading, the do, use, generate, collapse and other commands are registered as DuckDB statements. There's no separate REPL — you stay in the DuckDB shell.
from source
Build from source if you want to track the latest commit, or you need to run dodo on an architecture that isn't yet on the extension repo.
dodo requires DuckDB ≥ 0.10.2. Older builds don't carry the extension entrypoint dodo uses to register its statements. Run SELECT version(); if you're unsure.
standalone compiler (dodoc)
If you only need to translate .do files to SQL — without running them in DuckDB — the dodoc standalone compiler is a single binary with no dependencies. See the dodoc installation page for download links and build instructions.
dodo doesn't execute commands one at a time. Each transformation appends a step to a lazy CTE chain. Nothing runs until you ask for a result.
two modes
The decision happens at use time:
Materialized (default).use "file.csv", clear reads the file once into dodo._current. Subsequent commands build a CTE chain on top of that table.
Lazy.use "file.csv", lazy skips materialization. The file is re-read every time a terminal command executes — useful when the file is huge and you only inspect a few rows.
the cte chain
Each transformation appends a step. Nothing executes until a terminal command. The example below uses four commands; only list triggers a DuckDB execution.
.do · sql// what dodo does under the hood
use"firms.csv", clear→CREATE TABLE dodo._current AS SELECT*FROMread_csv('firms.csv')
_s0 AS (SELECT*FROM dodo._current)
keepif year >=2020→ _s1 AS (SELECT*FROM _s0 WHERE year >=2020)
generate profit = rev - cost
→ _s2 AS (SELECT*, rev - cost AS profit FROM _s1)
list→WITH _s0 AS (...), _s1 AS (...), _s2 AS (...)
SELECT*FROM _s2; -- executes here
Inspect the current chain at any point with SELECT * FROM dodo._chain;. Each row is one step, with its source command and the SQL it appends.
duckdb ui integration
Set SET dodo_live_view = true; and dodo creates a _dodo_data view after each command. The DuckDB UI data panel auto-refreshes when this view changes, so your data updates as you type.
sql
SET dodo_live_view =true;
SELECT*FROM dodo._history; -- every command in this session
why a chain, not a graph
The chain is strictly linear. That's not a constraint — it's the point. .do scripts read as a sequence of mutations, and undo is just "pop the last step." A DAG would let you fork the lineage but you'd also lose the undo contract that makes interactive sessions pleasant.
Load data from a file or an existing DuckDB table. Replaces the current dataset and starts a new CTE chain.
syntax
usesource [, clear | lazy | table]
source is either a quoted file path (CSV / Parquet / JSON) or a bare identifier referring to an existing DuckDB table. Pass clear to drop the previous dataset before loading.
Keep specific columns, or keep rows matching a condition. The same verb does both — dodo picks the behavior from the argument shape.
syntax
keepvarlistkeepifconditionkeepvarlistifcondition
examples
.do
// columns onlykeep id revenue year;
// rows only — boolean expression after `if`keepif year >=2020;
keepif!missing(revenue) & sector =="finance";
// both at oncekeep id revenue if year >=2020;
sql translation
keep id revenue if year >= 2020 appends a CTE that selects only those columns, filtered by the predicate:
sql
_sN AS (
SELECT id, revenue
FROM _s(N-1)
WHERE year >=2020
)
notes
To do the opposite — drop columns or filter out rows — see drop.
Comparisons use SQL semantics: NULL propagates. Use missing(x) to test for nulls explicitly.
String literals are double-quoted. Single quotes work too but double is canonical in .do syntax.
-- generate profit = revenue - cost
_sN AS (SELECT*, revenue - cost AS profit FROM _s(N-1))
-- generate high_rev = revenue > 1000 if year >= 2020
_sN AS (
SELECT*,
CASE WHEN year >=2020THEN revenue >1000ELSE NULL ENDAS high_rev
FROM _s(N-1)
)
notes
The new column must not already exist. Use replace to overwrite.
Expressions support arithmetic, string, date, and the full Stata-style function set — see expression translation.
generate is row-wise. For aggregate or windowed computations (mean by group, row numbers), use egen.
// mean revenue, total revenue, row count per sector × yearcollapse (mean) avg_rev = revenue
(sum) total = revenue
(count) n = id,
by(sector year);
// implicit names — keeps the original column namecollapse (mean) revenue profit, by(sector);
sql translation
sql
_sN AS (
SELECT
sector, year,
AVG(revenue) AS avg_rev,
SUM(revenue) AS total,
COUNT(id) AS n
FROM _s(N-1)
GROUP BY sector, year
)
notes
After collapse, the only columns that survive are the by() group keys and the aggregated newvars. Use egen instead if you want to keep all original rows alongside a group statistic.
An empty by() collapses the entire dataset to a single row.
Stat names map 1:1 to DuckDB aggregates; p25/p75/median compile to QUANTILE_CONT.
// one-to-one by composite keymerge 1:1 id year using"other_data.csv";
// many-to-one lookup — keep only matched rowsmerge m:1 sector using"sector_names.csv", keep(match);
// pull a single column over, don't add _merge indicatormerge 1:1 id using"extra.csv", keepusing(new_var) nogenerate;
By default merge creates a _merge indicator column with values 1 (master only), 2 (using only), or 3 (matched). Pass nogenerate to suppress it.
sql translation
sql
-- merge 1:1 id year using "other.csv"
_sN AS (
SELECT m.*, u.*,
CASEWHEN m.id IS NULLTHEN2WHEN u.id IS NULLTHEN1ELSE3ENDAS _merge
FROM _s(N-1) m
FULL OUTER JOINread_csv('other.csv') u
USING (id, year)
)
notes
Cardinality is enforced. 1:1 raises if the join produces duplicates on either side. This is a feature; it catches bad merges before they corrupt downstream steps.
keep(match) is an INNER JOIN. keep(master) is a LEFT join. The default is FULL OUTER with the _merge indicator.
using accepts a file path, a table name, or another CTE — the same resolution rules as use.
cardinality is not a hint
Stata users sometimes treat 1:1 as documentation. In dodo it's a contract — if either side has duplicate keys, the merge fails loudly with the offending key printed. To allow duplicates, use m:1, 1:m, or m:m.
Every expression you write inside generate, replace, or keep if compiles to SQL under the hood. Here's the full mapping.
numeric & logical
.do syntax
sql equivalent
log(x)
LN(x)
abs(x)
ABS(x)
round(x, 2)
ROUND(x, 2)
missing(x)
x IS NULL
cond(a, b, c)
CASE WHEN a THEN b ELSE c END
inrange(x, 1, 10)
x BETWEEN 1 AND 10
inlist(x, 1, 2, 3)
x IN (1, 2, 3)
string
.do syntax
sql equivalent
substr(s, 1, 3)
SUBSTRING(s, 1, 3)
strlen(s)
LENGTH(s)
strlower(s)
LOWER(s)
strupper(s)
UPPER(s)
strtrim(s)
TRIM(s)
real(s)
CAST(s AS DOUBLE)
row position
.do syntax
sql equivalent
_n
ROW_NUMBER() OVER (...)
_N
COUNT(*) OVER (...)
x[_n-1]
LAG(x, 1) OVER (...) · positional, no gap check
time-series operators
These require tsset or xtset to declare the panel/time structure first. The L./F./D. family is gap-aware — it returns NULL if the previous period is missing instead of silently slipping to the row before.
.do syntax
sql equivalent
L.x
gap-aware LAG(x) OVER (...)
F.x
gap-aware LEAD(x) OVER (...)
D.x
x - L.x
positional vs. gap-aware
x[_n-1] is the previous row. L.x is the previous period. If your data has missing years (or your panel is unbalanced), these return different values. Use L. by default unless you specifically want the row-before semantics.
Run a .do file end-to-end. No semicolons needed inside the file — newlines separate statements.
syntax
dopath
examples
sql
do"analysis/clean.do";
list; // inspect results after the script runs
The script uses standard Stata-style comment markers:
.doclean.do
* this is a line comment// this is also a comment/* this is a
block comment */use"data.csv", clearkeepif year >=2020// inline commentgenerate profit =///
revenue - cost // line continuation with ///
notes
Terminal commands inside .do files are skipped — inspect results interactively after the script finishes.
Working directory follows the calling session, not the script's location. Use absolute paths if you need stability.
Errors halt execution and surface the offending line number. The CTE chain rolls back to its pre-do state.
Compile .do files to SQL without installing DuckDB. A single binary, no dependencies, designed for CI/CD pipelines and SQL preview workflows.
what is dodoc?
dodoc is a standalone CLI that reads .do files and outputs the equivalent SQL — the same translation the dodo DuckDB extension performs, but without needing DuckDB at all.
It shares the same core parser as the dodo extension, so every command the extension understands, dodoc understands too.
why use it?
No DuckDB dependency — useful in environments where you can't install DuckDB (locked-down CI runners, lightweight containers).
Preview SQL before running — pipe the output to a file, inspect it, then feed it to DuckDB (or another database) only when you're ready.
CI/CD integration — compile .do files as a build step, validate the SQL, commit the output.
Pipe-friendly — reads from stdin, writes to stdout, composes with Unix tools.
relation to the dodo extension
two tools, one parser
The dodo extension runs inside DuckDB and executes the translated SQL immediately. dodoc runs outside DuckDB and only produces SQL text. Both share the same parser and produce identical SQL for the same input.
Read from stdin or file, output SQL to stdout or file. Pipe into DuckDB or save for later.
stdin to stdout
Pipe .do commands directly:
bash// pipe from echo
› echo 'use "data.csv", clear
keep if year >= 2020
generate profit = revenue - cost' | dodoc
compile a file
Pass the .do file as an argument:
bash// file to stdout
› dodoc analysis/clean.do
Write to a file with -o:
bash// file to file
› dodoc analysis/clean.do -o analysis/clean.sql
annotated output
The --annotate flag adds the original .do command as a SQL comment above each translated statement, making the output easier to read and debug:
bash// annotated SQL output
› dodoc --annotate analysis/clean.do
piping to duckdb
The most common pattern: compile, then execute. Pipe dodoc output directly into DuckDB:
bash// compile and run in one step
› dodoc script.do | duckdb
dodoc vs dodo extension
When piping to DuckDB this way, DuckDB does not need the dodo extension installed — it receives plain SQL. The extension is only needed when you type .do commands directly in the DuckDB REPL.
all flags
Flag
Description
-o, --output FILE
Write SQL to FILE instead of stdout
--annotate
Emit original .do command as a SQL comment before each statement
--terminal
Also emit SQL for terminal commands (list, count, etc.)