dodo dodo / docs v0.1.1
docs / getting started
old scripts. new pond.
getting started
install dodo. load the extension. run your .do file.

dodo is a DuckDB extension that reads legacy .do workflows and executes them on DuckDB, preserving familiar data-cleaning scripts while moving execution to a modern analytical engine.

install in 30 seconds

From any DuckDB shell — desktop, CLI, Python, R — the install procedure is the same. dodo is distributed through the DuckDB community extension repository.

sql // in any duckdb session
INSTALL dodo FROM community;
LOAD dodo;

a 60-second tour

The example below loads a CSV, filters rows, computes a column, aggregates by group, and prints the result — all in .do syntax. No SELECT, no FROM, no joins.

.do analysis/firms.do
use "firms.csv", clear;
keep if year >= 2020;
generate profit = revenue - cost;
collapse (mean) avg_profit = profit, by(sector);
list;
about semicolons

Semicolons at the end of each line are a DuckDB REPL convention — the REPL needs them to know when a statement ends. Inside .do files, each line is one statement and no semicolons are needed.

Run the same script as a single command from the shell:

sql // run the .do file end-to-end
do "analysis/firms.do";
list;
// output ┌────────────┬──────────────┐ sector avg_profit ├────────────┼──────────────┤ finance 4,182.40 mfg 1,704.22 retail 902.81 services 2,318.55 └────────────┴──────────────┘ 4 rows · ran in 38ms · cte chain: _s0 → _s1 → _s2 → _s3

jump to a command

The reference is grouped by what the command does. The four you'll touch every day:

where state lives

Every use materializes into dodo._current and starts a fresh CTE chain. Every transformation appends a step; nothing executes until you hit a terminal command like list, count, or summarize. See how it works.

next steps

  • Run an existing .do file with the do command.
  • If you wrote SQL by hand before, skim expression translation to learn the cheat-sheet between missing(x) and x IS NULL.
  • Hook the DuckDB UI into a live view so every step shows up in the data panel.
  • Don't want to install DuckDB? Use dodoc to compile .do files to SQL as a standalone CLI — then pipe the output into any database.
edit this page on github → last updated · 2026-02-14
docs / installation

Installation

Two ways to get the dodo DuckDB extension: from the community extension repository (recommended), or from source. Looking for the standalone compiler? See dodoc installation.

from the duckdb extension repo

This is the path you want for almost any normal use. Works in the DuckDB CLI, the DuckDB UI, and every client library (Python, R, Java, Node, Rust).

sql // one time, persists across sessions
INSTALL dodo FROM community;
LOAD dodo;

After loading, the do, use, generate, collapse and other commands are registered as DuckDB statements. There's no separate REPL — you stay in the DuckDB shell.

from source

Build from source if you want to track the latest commit, or you need to run dodo on an architecture that isn't yet on the extension repo.

bash // requires git, cmake, a c++17 compiler
 git clone --recurse-submodules https://github.com/codedthinking/dodo.git
 cd dodo && make release
 ./build/release/duckdb

The release target produces a DuckDB binary with dodo statically linked. Start it and the extension is already loaded — no INSTALL/LOAD required.

verify it loaded

Drop dodo_version() into any session and the extension reports back.

sql // sanity check
SELECT dodo_version(), dodo_build();
┌───────────────┬──────────────────────────┐ dodo_version dodo_build ├───────────────┼──────────────────────────┤ 0.1.0 2026-02-12 / 7a3b1c2 └───────────────┴──────────────────────────┘
duckdb version

dodo requires DuckDB ≥ 0.10.2. Older builds don't carry the extension entrypoint dodo uses to register its statements. Run SELECT version(); if you're unsure.

standalone compiler (dodoc)

If you only need to translate .do files to SQL — without running them in DuckDB — the dodoc standalone compiler is a single binary with no dependencies. See the dodoc installation page for download links and build instructions.

docs / how it works

How it works

dodo doesn't execute commands one at a time. Each transformation appends a step to a lazy CTE chain. Nothing runs until you ask for a result.

two modes

The decision happens at use time:

  • Materialized (default). use "file.csv", clear reads the file once into dodo._current. Subsequent commands build a CTE chain on top of that table.
  • Lazy. use "file.csv", lazy skips materialization. The file is re-read every time a terminal command executes — useful when the file is huge and you only inspect a few rows.

the cte chain

Each transformation appends a step. Nothing executes until a terminal command. The example below uses four commands; only list triggers a DuckDB execution.

.do · sql // what dodo does under the hood
use "firms.csv", clear
     CREATE TABLE dodo._current AS SELECT * FROM read_csv('firms.csv')
       _s0 AS (SELECT * FROM dodo._current)

keep if year >= 2020
     _s1 AS (SELECT * FROM _s0 WHERE year >= 2020)

generate profit = rev - cost
     _s2 AS (SELECT *, rev - cost AS profit FROM _s1)

list
     WITH _s0 AS (...), _s1 AS (...), _s2 AS (...)
       SELECT * FROM _s2;   -- executes here

Inspect the current chain at any point with SELECT * FROM dodo._chain;. Each row is one step, with its source command and the SQL it appends.

duckdb ui integration

Set SET dodo_live_view = true; and dodo creates a _dodo_data view after each command. The DuckDB UI data panel auto-refreshes when this view changes, so your data updates as you type.

sql
SET dodo_live_view = true;
SELECT * FROM dodo._history;   -- every command in this session
why a chain, not a graph

The chain is strictly linear. That's not a constraint — it's the point. .do scripts read as a sequence of mutations, and undo is just "pop the last step." A DAG would let you fork the lineage but you'd also lose the undo contract that makes interactive sessions pleasant.

docs / commands / io / use

use

Load data from a file or an existing DuckDB table. Replaces the current dataset and starts a new CTE chain.

syntax

use source [, clear | lazy | table]

source is either a quoted file path (CSV / Parquet / JSON) or a bare identifier referring to an existing DuckDB table. Pass clear to drop the previous dataset before loading.

examples

.do // from files
use "data/firms.csv", clear;
use "data/firms.parquet", clear;
use "s3://bucket/firms.parquet", clear;
.do // from an existing duckdb table
use firms_2024, clear;
use main.firms_2024, clear;
.do // lazy mode — no materialization
use "data/firms.csv", lazy;
head 10;     // re-reads the file each time

sql translation

The default (materialized) use "firms.csv", clear compiles to:

sql
DROP TABLE IF EXISTS dodo._current;
CREATE TABLE dodo._current AS
  SELECT * FROM read_csv('firms.csv', header = true);

notes

  • clear is required if the current dataset isn't empty — dodo refuses to silently shadow existing data.
  • File formats are detected by extension. Override with explicit options if needed.
  • Lazy mode pairs well with head and count; it pairs poorly with iterative generate chains, since each terminal command re-reads the source.
not yet supported

Stata .dta files are not a use source. They require a separate reader; see known limitations for the workaround.

docs / commands / transform / keep

keep

Keep specific columns, or keep rows matching a condition. The same verb does both — dodo picks the behavior from the argument shape.

syntax

keep varlist keep if condition keep varlist if condition

examples

.do
// columns only
keep id revenue year;

// rows only — boolean expression after `if`
keep if year >= 2020;
keep if !missing(revenue) & sector == "finance";

// both at once
keep id revenue if year >= 2020;

sql translation

keep id revenue if year >= 2020 appends a CTE that selects only those columns, filtered by the predicate:

sql
_sN AS (
  SELECT id, revenue
  FROM _s(N-1)
  WHERE year >= 2020
)

notes

  • To do the opposite — drop columns or filter out rows — see drop.
  • Comparisons use SQL semantics: NULL propagates. Use missing(x) to test for nulls explicitly.
  • String literals are double-quoted. Single quotes work too but double is canonical in .do syntax.
docs / commands / transform / generate

generate

Create a new column from an expression. Optionally guarded by if — rows that fail the condition get NULL for the new column.

syntax

generate newvar = expression [if condition]

examples

.do
generate profit = revenue - cost;
generate ln_rev = log(revenue);
generate high_rev = revenue > 1000 if year >= 2020;
generate sector_label = cond(sector == "fin", "Finance", sector);

sql translation

sql
-- generate profit = revenue - cost
_sN AS (SELECT *, revenue - cost AS profit FROM _s(N-1))

-- generate high_rev = revenue > 1000 if year >= 2020
_sN AS (
  SELECT *,
    CASE WHEN year >= 2020
         THEN revenue > 1000 ELSE NULL END AS high_rev
  FROM _s(N-1)
)

notes

  • The new column must not already exist. Use replace to overwrite.
  • Expressions support arithmetic, string, date, and the full Stata-style function set — see expression translation.
  • generate is row-wise. For aggregate or windowed computations (mean by group, row numbers), use egen.
docs / commands / transform / collapse

collapse

Aggregate the dataset, reducing rows to one per group. Multiple aggregators in one call, optional by().

syntax

collapse (stat) newvar = var [(stat) newvar = var] ... [, by(groupvars)]

Supported stats: mean, sum, count, min, max, sd, median, p25, p75, first, last.

examples

.do
// mean revenue, total revenue, row count per sector × year
collapse (mean) avg_rev = revenue
         (sum)  total   = revenue
         (count) n      = id,
         by(sector year);

// implicit names — keeps the original column name
collapse (mean) revenue profit, by(sector);

sql translation

sql
_sN AS (
  SELECT
    sector, year,
    AVG(revenue) AS avg_rev,
    SUM(revenue) AS total,
    COUNT(id)    AS n
  FROM _s(N-1)
  GROUP BY sector, year
)

notes

  • After collapse, the only columns that survive are the by() group keys and the aggregated newvars. Use egen instead if you want to keep all original rows alongside a group statistic.
  • An empty by() collapses the entire dataset to a single row.
  • Stat names map 1:1 to DuckDB aggregates; p25/p75/median compile to QUANTILE_CONT.
docs / commands / transform / merge

merge

Join two datasets. The cardinality (1:1, m:1, 1:m, m:m) is declared explicitly so dodo can verify it and bail loudly if your assumption is wrong.

syntax

merge cardinality keys using source [, keep(match | master | using)] [, keepusing(vars)] [, nogenerate]

examples

.do
// one-to-one by composite key
merge 1:1 id year using "other_data.csv";

// many-to-one lookup — keep only matched rows
merge m:1 sector using "sector_names.csv", keep(match);

// pull a single column over, don't add _merge indicator
merge 1:1 id using "extra.csv", keepusing(new_var) nogenerate;

By default merge creates a _merge indicator column with values 1 (master only), 2 (using only), or 3 (matched). Pass nogenerate to suppress it.

sql translation

sql
-- merge 1:1 id year using "other.csv"
_sN AS (
  SELECT m.*, u.*,
    CASE
      WHEN m.id IS NULL THEN 2
      WHEN u.id IS NULL THEN 1
      ELSE 3
    END AS _merge
  FROM _s(N-1) m
  FULL OUTER JOIN read_csv('other.csv') u
    USING (id, year)
)

notes

  • Cardinality is enforced. 1:1 raises if the join produces duplicates on either side. This is a feature; it catches bad merges before they corrupt downstream steps.
  • keep(match) is an INNER JOIN. keep(master) is a LEFT join. The default is FULL OUTER with the _merge indicator.
  • using accepts a file path, a table name, or another CTE — the same resolution rules as use.
cardinality is not a hint

Stata users sometimes treat 1:1 as documentation. In dodo it's a contract — if either side has duplicate keys, the merge fails loudly with the offending key printed. To allow duplicates, use m:1, 1:m, or m:m.

docs / guides / expression translation

Expression translation

Every expression you write inside generate, replace, or keep if compiles to SQL under the hood. Here's the full mapping.

numeric & logical

.do syntaxsql equivalent
log(x)LN(x)
abs(x)ABS(x)
round(x, 2)ROUND(x, 2)
missing(x)x IS NULL
cond(a, b, c)CASE WHEN a THEN b ELSE c END
inrange(x, 1, 10)x BETWEEN 1 AND 10
inlist(x, 1, 2, 3)x IN (1, 2, 3)

string

.do syntaxsql equivalent
substr(s, 1, 3)SUBSTRING(s, 1, 3)
strlen(s)LENGTH(s)
strlower(s)LOWER(s)
strupper(s)UPPER(s)
strtrim(s)TRIM(s)
real(s)CAST(s AS DOUBLE)

row position

.do syntaxsql equivalent
_nROW_NUMBER() OVER (...)
_NCOUNT(*) OVER (...)
x[_n-1]LAG(x, 1) OVER (...) · positional, no gap check

time-series operators

These require tsset or xtset to declare the panel/time structure first. The L./F./D. family is gap-aware — it returns NULL if the previous period is missing instead of silently slipping to the row before.

.do syntaxsql equivalent
L.xgap-aware LAG(x) OVER (...)
F.xgap-aware LEAD(x) OVER (...)
D.xx - L.x
positional vs. gap-aware

x[_n-1] is the previous row. L.x is the previous period. If your data has missing years (or your panel is unbalanced), these return different values. Use L. by default unless you specifically want the row-before semantics.

docs / commands / io / save

save

Write the current data to disk or to an in-memory DuckDB table.

syntax

save target [, replace] [, table]

examples

.do
save "output.csv", replace;
save "output.parquet", replace;
save my_table, replace table;     // in-memory duckdb table

notes

  • Format is inferred from the extension. .csv, .parquet, .json, .tsv are supported.
  • The table option writes to an in-memory DuckDB table rather than disk. This is the only way to reach the data from straight SQL.
  • save is a terminal command — it executes the chain.

append

Stack another dataset below the current one. Columns are matched by name; missing columns become NULL.

syntax

append using source

examples

.do
append using "more_firms.csv";
append using firms_2025;       // existing duckdb table

sql translation

sql
_sN AS (
  SELECT * FROM _s(N-1)
  UNION ALL BY NAME
  SELECT * FROM read_csv('more_firms.csv')
)

do

Run a .do file end-to-end. No semicolons needed inside the file — newlines separate statements.

syntax

do path

examples

sql
do "analysis/clean.do";
list;    // inspect results after the script runs

The script uses standard Stata-style comment markers:

.doclean.do
* this is a line comment
// this is also a comment
/* this is a
   block comment */

use "data.csv", clear

keep if year >= 2020    // inline comment
generate profit = ///
    revenue - cost          // line continuation with ///

notes

  • Terminal commands inside .do files are skipped — inspect results interactively after the script finishes.
  • Working directory follows the calling session, not the script's location. Use absolute paths if you need stability.
  • Errors halt execution and surface the offending line number. The CTE chain rolls back to its pre-do state.

clear

Drop the current dataset and all associated state — the CTE chain, the history table, registered tempfiles, panel declarations.

syntax

clear [all]

examples

.do
clear;           // drop current dataset + chain
clear all;       // also drop tempfiles, labels, panel declarations
docs/import delimited

import delimited

Read a CSV file. Alias for use with a CSV path — exists for Stata-script compatibility.

.do
import delimited "data/survey.csv", clear;
docs/export delimited

export delimited

Write the current data to a CSV file. Alias for save with a CSV path.

.do
export delimited using "output.csv", replace;
docs/drop

drop

Drop columns by name, or drop rows matching a condition. Inverse of keep.

drop varlist
drop if condition
.do
drop temp_var debug_flag;
drop if missing(revenue);
docs/replace

replace

Overwrite an existing column's values. Unlike generate, the target column must already exist.

replace var = expression [if condition]
.do
replace revenue = 0 if missing(revenue);
replace name = "Unknown" if missing(name);
docs/rename

rename

Rename a column. Idempotent — renaming to the existing name is a no-op.

rename old_name new_name
.do
rename old_name new_name;
docs/sort

sort

Sort rows by one or more columns. Default ascending.

sort varlist [, desc]
.do
sort year;
sort revenue, desc;
sort sector year;     // composite, both ascending
docs/order

order

Reorder columns. Listed columns move to the front; everything else keeps its relative order.

order varlist
.do
order year id name;
docs/egen

egen

Create a column using a window or aggregate function — optionally grouped. Unlike collapse, the original rows survive.

egen newvar = fn(expr) [, by(groupvars)]
.do
egen mean_rev = mean(revenue), by(sector);
egen row_num = seq(), by(sector);
egen total = sum(revenue);

Compiles to a WINDOW clause with the appropriate PARTITION BY.

docs/mvencode

mvencode

Replace missing values across one or more columns with a fill value.

mvencode varlist, mv(fill)
.do
mvencode revenue profit, mv(0);
docs/reshape

reshape

Pivot the dataset between long and wide formats. i() is the identifier column; j() is the column whose values become wide columns (or vice versa).

reshape long | wide stubname, i(id) j(period)
.do
// wide → long
reshape long revenue, i(id) j(year);

// long → wide
reshape wide revenue, i(id) j(year);

Compiles to PIVOT / UNPIVOT.

docs/duplicates drop

duplicates drop

Remove duplicate rows. With no arguments, considers all columns; with a varlist, only those.

duplicates drop [varlist]
.do
duplicates drop;              // all columns
duplicates drop id year;      // by specific columns
docs/expand

expand

Replicate rows. Either a constant count or a per-row count column.

expand n | count_col [, generate(indicator)]
.do
expand 3;                          // triple every row
expand count, generate(copy);     // variable expansion + indicator
docs / dodoc / overview

dodoc standalone compiler

Compile .do files to SQL without installing DuckDB. A single binary, no dependencies, designed for CI/CD pipelines and SQL preview workflows.

what is dodoc?

dodoc is a standalone CLI that reads .do files and outputs the equivalent SQL — the same translation the dodo DuckDB extension performs, but without needing DuckDB at all.

It shares the same core parser as the dodo extension, so every command the extension understands, dodoc understands too.

why use it?

  • No DuckDB dependency — useful in environments where you can't install DuckDB (locked-down CI runners, lightweight containers).
  • Preview SQL before running — pipe the output to a file, inspect it, then feed it to DuckDB (or another database) only when you're ready.
  • CI/CD integration — compile .do files as a build step, validate the SQL, commit the output.
  • Pipe-friendly — reads from stdin, writes to stdout, composes with Unix tools.

relation to the dodo extension

two tools, one parser

The dodo extension runs inside DuckDB and executes the translated SQL immediately. dodoc runs outside DuckDB and only produces SQL text. Both share the same parser and produce identical SQL for the same input.

docs / dodoc / installation

Install dodoc

Download a pre-built binary for your platform, or build from source.

download pre-built binaries

Pre-built binaries are available on the GitHub Releases page for three platforms:

After downloading, extract and install:

bash // macOS / Linux
 tar xzf dodoc-macos-arm64.tar.gz
 sudo install dodoc /usr/local/bin/

build from source

Requires git, make, and a C++17 compiler.

bash // clone and build
 git clone --recurse-submodules https://github.com/codedthinking/dodo.git
 cd dodo
 make dodoc
 sudo make dodoc-install

This installs the dodoc binary to /usr/local/bin/.

verify

bash // check it works
 dodoc --help
docs / dodoc / usage

Using dodoc

Read from stdin or file, output SQL to stdout or file. Pipe into DuckDB or save for later.

stdin to stdout

Pipe .do commands directly:

bash // pipe from echo
 echo 'use "data.csv", clear
keep if year >= 2020
generate profit = revenue - cost' | dodoc

compile a file

Pass the .do file as an argument:

bash // file to stdout
 dodoc analysis/clean.do

Write to a file with -o:

bash // file to file
 dodoc analysis/clean.do -o analysis/clean.sql

annotated output

The --annotate flag adds the original .do command as a SQL comment above each translated statement, making the output easier to read and debug:

bash // annotated SQL output
 dodoc --annotate analysis/clean.do

piping to duckdb

The most common pattern: compile, then execute. Pipe dodoc output directly into DuckDB:

bash // compile and run in one step
 dodoc script.do | duckdb
dodoc vs dodo extension

When piping to DuckDB this way, DuckDB does not need the dodo extension installed — it receives plain SQL. The extension is only needed when you type .do commands directly in the DuckDB REPL.

all flags

FlagDescription
-o, --output FILEWrite SQL to FILE instead of stdout
--annotateEmit original .do command as a SQL comment before each statement
--terminalAlso emit SQL for terminal commands (list, count, etc.)
-h, --helpShow help message