docs / getting started

old scripts. new pond.

getting started

install dodo. load the extension. run your .do file.

dodo is a DuckDB extension that reads legacy .do workflows and executes them on DuckDB, preserving familiar data-cleaning scripts while moving execution to a modern analytical engine.

why dodo

Open source and free

MIT license. Runs anywhere DuckDB runs — laptop, server, cloud, embedded. No vendor lock-in, no license fees. The compiler and extension will always be open source.

Speed

Your pipeline runs in seconds, not minutes. Fast enough to rerun your entire analysis every time something changes — a continuous-integration mindset for data work.

Scale

Work with billions of rows in Parquet, connect to databases, and let DuckDB's optimizer do the heavy lifting. Lazy loading means intermediate steps are never computed unnecessarily.

Iterative and reproducible

coming soon

Named checkpoints, unlimited undo/redo, near-real-time feedback as you revise code. The entire pipeline can be recorded and replayed for reproducibility.

install in 30 seconds

From any DuckDB shell — desktop, CLI, Python, R — the install procedure is the same. dodo is distributed through the DuckDB community extension repository.

sql // in any duckdb session

INSTALL dodo FROM community;
LOAD dodo;

a 60-second tour

The example below loads a CSV, filters rows, computes a column, aggregates by group, and prints the result — all in .do syntax. No SELECT, no FROM, no joins.

.do analysis/firms.do

use "firms.csv", clear;
keep if year >= 2020;
generate profit = revenue - cost;
collapse (mean) avg_profit = profit, by(sector);
list;

about semicolons

Semicolons at the end of each line are a DuckDB REPL convention — the REPL needs them to know when a statement ends. Inside .do files, each line is one statement and no semicolons are needed.

Run the same script as a single command from the shell:

sql // run the .do file end-to-end

do "analysis/firms.do";
list;

› // output ┌────────────┬──────────────┐ │ sector │ avg_profit │ ├────────────┼──────────────┤ │ finance │ 4,182.40 │ │ mfg │ 1,704.22 │ │ retail │ 902.81 │ │ services │ 2,318.55 │ └────────────┴──────────────┘ 4 rows · ran in 38ms · cte chain: _s0 → _s1 → _s2 → _s3

jump to a command

The reference is grouped by what the command does. The four you'll touch every day:

use

Load a file or table into the current dataset.

keep / drop

Filter rows or select columns — both with a single verb.

generate

Create a new column from an expression, optionally guarded by if.

collapse

Aggregate by group — reduce rows to one per group.

where state lives

Every use materializes into dodo._current and starts a fresh CTE chain. Every transformation appends a step; nothing executes until you hit a terminal command like list, count, or summarize. See how it works.

next steps

Run an existing .do file with the do command.
If you wrote SQL by hand before, skim expression translation to learn the cheat-sheet between missing(x) and x IS NULL.
Hook the DuckDB UI into a live view so every step shows up in the data panel.
Don't want to install DuckDB? Use dodoc to compile .do files to SQL as a standalone CLI — then pipe the output into any database.

edit this page on github → last updated · 2026-02-14

docs / installation

Installation

Two ways to get the dodo DuckDB extension: from the community extension repository (recommended), or from source. Looking for the standalone compiler? See dodoc installation.

from the duckdb extension repo

This is the path you want for almost any normal use. Works in the DuckDB CLI, the DuckDB UI, and every client library (Python, R, Java, Node, Rust).

sql // one time, persists across sessions

INSTALL dodo FROM community;
LOAD dodo;

After loading, the do, use, generate, collapse and other commands are registered as DuckDB statements. There's no separate REPL — you stay in the DuckDB shell.

from source

Build from source if you want to track the latest commit, or you need to run dodo on an architecture that isn't yet on the extension repo.

bash // requires git, cmake, a c++17 compiler

› git clone --recurse-submodules https://github.com/codedthinking/dodo.git
› cd dodo && make release
› ./build/release/duckdb

The release target produces a DuckDB binary with dodo statically linked. Start it and the extension is already loaded — no INSTALL/LOAD required.

verify it loaded

Drop dodo_version() into any session and the extension reports back.

sql // sanity check

SELECT dodo_version(), dodo_build();

┌───────────────┬──────────────────────────┐ │ dodo_version │ dodo_build │ ├───────────────┼──────────────────────────┤ │ 0.1.0 │ 2026-02-12 / 7a3b1c2 │ └───────────────┴──────────────────────────┘

duckdb version

dodo requires DuckDB ≥ 0.10.2. Older builds don't carry the extension entrypoint dodo uses to register its statements. Run SELECT version(); if you're unsure.

standalone compiler (dodoc)

If you only need to translate .do files to SQL — without running them in DuckDB — the dodoc standalone compiler is a single binary with no dependencies. See the dodoc installation page for download links and build instructions.

docs / how it works

How it works

dodo doesn't execute commands one at a time. Each transformation appends a step to a lazy CTE chain. Nothing runs until you ask for a result.

two modes

The decision happens at use time:

Materialized (default). use "file.csv", clear reads the file once into dodo._current. Subsequent commands build a CTE chain on top of that table.
Lazy. use "file.csv", lazy skips materialization. The file is re-read every time a terminal command executes — useful when the file is huge and you only inspect a few rows.

the cte chain

Each transformation appends a step. Nothing executes until a terminal command. The example below uses four commands; only list triggers a DuckDB execution.

.do · sql // what dodo does under the hood

use "firms.csv", clear
   →  CREATE TABLE dodo._current AS SELECT * FROM read_csv('firms.csv')
       _s0 AS (SELECT * FROM dodo._current)

keep if year >= 2020
   →  _s1 AS (SELECT * FROM _s0 WHERE year >= 2020)

generate profit = rev - cost
   →  _s2 AS (SELECT *, rev - cost AS profit FROM _s1)

list
   →  WITH _s0 AS (...), _s1 AS (...), _s2 AS (...)
       SELECT * FROM _s2;   -- executes here

Inspect the current chain at any point with SELECT * FROM dodo._chain;. Each row is one step, with its source command and the SQL it appends.

duckdb ui integration

Set SET dodo_live_view = true; and dodo creates a _dodo_data view after each command. The DuckDB UI data panel auto-refreshes when this view changes, so your data updates as you type.

sql

SET dodo_live_view = true;
SELECT * FROM dodo._history;   -- every command in this session

why a chain, not a graph

The chain is strictly linear. That's not a constraint — it's the point. .do scripts read as a sequence of mutations, and undo is just "pop the last step." A DAG would let you fork the lineage but you'd also lose the undo contract that makes interactive sessions pleasant.

docs / commands / io / use

use

Load data from a file or an existing DuckDB table. Replaces the current dataset and starts a new CTE chain.

syntax

use source [, clear | lazy | table]

source is either a quoted file path (CSV / Parquet / JSON / .dta) or a bare identifier referring to an existing DuckDB table. Pass clear to drop the previous dataset before loading. Variable labels from .dta files are automatically imported.

examples

.do // from files

use "data/firms.csv", clear;
use "data/firms.parquet", clear;
use "data/firms.dta", clear;
use "s3://bucket/firms.parquet", clear;

.do // from an existing duckdb table

use firms_2024, clear;
use main.firms_2024, clear;

.do // lazy mode — no materialization

use "data/firms.csv", lazy;
head 10;     // re-reads the file each time

sql translation

The default (materialized) use "firms.csv", clear compiles to:

sql

DROP TABLE IF EXISTS dodo._current;
CREATE TABLE dodo._current AS
  SELECT * FROM read_csv('firms.csv', header = true);

notes

clear is required if the current dataset isn't empty — dodo refuses to silently shadow existing data.
File formats are detected by extension. Override with explicit options if needed.
Lazy mode pairs well with head and count; it pairs poorly with iterative generate chains, since each terminal command re-reads the source.

native .dta support

Stata .dta files (formats 117–121, Stata 13–18) are read natively — no external extensions needed. Variable labels are automatically imported into dodo metadata.

docs / commands / transform / keep

keep

Keep specific columns, or keep rows matching a condition. The same verb does both — dodo picks the behavior from the argument shape.

syntax

keep varlist keep if condition keep varlist if condition

examples

.do

// columns only
keep id revenue year;

// rows only — boolean expression after `if`
keep if year >= 2020;
keep if !missing(revenue) & sector == "finance";

// both at once
keep id revenue if year >= 2020;

sql translation

keep id revenue if year >= 2020 appends a CTE that selects only those columns, filtered by the predicate:

sql

_sN AS (
  SELECT id, revenue
  FROM _s(N-1)
  WHERE year >= 2020
)

notes

To do the opposite — drop columns or filter out rows — see drop.
Comparisons use SQL semantics: NULL propagates. Use missing(x) to test for nulls explicitly.
String literals are double-quoted. Single quotes work too but double is canonical in .do syntax.

docs / commands / transform / generate

generate

Create a new column from an expression. Optionally guarded by if — rows that fail the condition get NULL for the new column.

syntax

generate newvar = expression [if condition]

examples

.do

generate profit = revenue - cost;
generate ln_rev = log(revenue);
generate high_rev = revenue > 1000 if year >= 2020;
generate sector_label = cond(sector == "fin", "Finance", sector);

sql translation

sql

-- generate profit = revenue - cost
_sN AS (SELECT *, revenue - cost AS profit FROM _s(N-1))

-- generate high_rev = revenue > 1000 if year >= 2020
_sN AS (
  SELECT *,
    CASE WHEN year >= 2020
         THEN revenue > 1000 ELSE NULL END AS high_rev
  FROM _s(N-1)
)

notes

The new column must not already exist. Use replace to overwrite.
Expressions support arithmetic, string, date, and the full Stata-style function set — see expression translation.
generate is row-wise. For aggregate or windowed computations (mean by group, row numbers), use egen.

docs / commands / transform / collapse

collapse

Aggregate the dataset, reducing rows to one per group. Multiple aggregators in one call, optional by().

syntax

collapse (stat) newvar = var [(stat) newvar = var] ... [, by(groupvars)]

Supported stats: mean, sum, count, min, max, sd, median, p25, p75, first, last.

examples

.do

// mean revenue, total revenue, row count per sector × year
collapse (mean) avg_rev = revenue
         (sum)  total   = revenue
         (count) n      = id,
         by(sector year);

// implicit names — keeps the original column name
collapse (mean) revenue profit, by(sector);

sql translation

sql

_sN AS (
  SELECT
    sector, year,
    AVG(revenue) AS avg_rev,
    SUM(revenue) AS total,
    COUNT(id)    AS n
  FROM _s(N-1)
  GROUP BY sector, year
)

notes

After collapse, the only columns that survive are the by() group keys and the aggregated newvars. Use egen instead if you want to keep all original rows alongside a group statistic.
An empty by() collapses the entire dataset to a single row.
Stat names map 1:1 to DuckDB aggregates; p25/p75/median compile to QUANTILE_CONT.

docs / commands / transform / merge

merge

Join two datasets. The cardinality (1:1, m:1, 1:m, m:m) is declared explicitly so dodo can verify it and bail loudly if your assumption is wrong.

syntax

merge cardinality keys using source [, keep(match | master | using)] [, keepusing(vars)] [, nogenerate]

examples

.do

// one-to-one by composite key
merge 1:1 id year using "other_data.csv";

// many-to-one lookup — keep only matched rows
merge m:1 sector using "sector_names.csv", keep(match);

// pull a single column over, don't add _merge indicator
merge 1:1 id using "extra.csv", keepusing(new_var) nogenerate;

By default merge creates a _merge indicator column with values 1 (master only), 2 (using only), or 3 (matched). Pass nogenerate to suppress it.

sql translation

sql

-- merge 1:1 id year using "other.csv"
_sN AS (
  SELECT m.*, u.*,
    CASE
      WHEN m.id IS NULL THEN 2
      WHEN u.id IS NULL THEN 1
      ELSE 3
    END AS _merge
  FROM _s(N-1) m
  FULL OUTER JOIN read_csv('other.csv') u
    USING (id, year)
)

notes

Cardinality is enforced. 1:1 raises if the join produces duplicates on either side. This is a feature; it catches bad merges before they corrupt downstream steps.
keep(match) is an INNER JOIN. keep(master) is a LEFT join. The default is FULL OUTER with the _merge indicator.
using accepts a file path, a table name, or another CTE — the same resolution rules as use.

cardinality is not a hint

Stata users sometimes treat 1:1 as documentation. In dodo it's a contract — if either side has duplicate keys, the merge fails loudly with the offending key printed. To allow duplicates, use m:1, 1:m, or m:m.

docs / guides / expression translation

Expression translation

Every expression you write inside generate, replace, or keep if compiles to SQL under the hood. Here's the full mapping.

numeric & logical

.do syntax	sql equivalent
`log(x)`	`LN(x)`
`abs(x)`	`ABS(x)`
`round(x, 2)`	`ROUND(x, 2)`
`missing(x)`	`x IS NULL`
`cond(a, b, c)`	`CASE WHEN a THEN b ELSE c END`
`inrange(x, 1, 10)`	`x BETWEEN 1 AND 10`
`inlist(x, 1, 2, 3)`	`x IN (1, 2, 3)`
`!expr`	`NOT expr`
`.` (bare dot)	`NULL` (missing value)
`missing(x, y, z)`	`(x IS NULL OR y IS NULL OR z IS NULL)`
`int(x)`	`CAST(x AS INTEGER)`

string

.do syntax	sql equivalent
`substr(s, 1, 3)`	`SUBSTRING(s, 1, 3)`
`strlen(s)`	`LENGTH(s)`
`strlower(s)`	`LOWER(s)`
`strupper(s)`	`UPPER(s)`
`strtrim(s)`	`TRIM(s)`
`real(s)`	`CAST(s AS DOUBLE)`
`substr(s, k, .)`	`SUBSTRING(s, k)` (to end of string)

row position

.do syntax	sql equivalent
`_n`	`ROW_NUMBER() OVER (...)`
`_N`	`COUNT(*) OVER (...)`
`x[_n-1]`	`LAG(x, 1) OVER (...)` · positional, no gap check

time-series operators

These require tsset or xtset to declare the panel/time structure first. The L./F./D. family is gap-aware — it returns NULL if the previous period is missing instead of silently slipping to the row before.

.do syntax	sql equivalent
`L.x`	gap-aware `LAG(x) OVER (...)`
`F.x`	gap-aware `LEAD(x) OVER (...)`
`D.x`	`x - L.x`

positional vs. gap-aware

x[_n-1] is the previous row. L.x is the previous period. If your data has missing years (or your panel is unbalanced), these return different values. Use L. by default unless you specifically want the row-before semantics.

docs / guides / running .do files

Running .do files

.do files use newlines instead of semicolons. Each line is one command. All standard comment styles work.

comments and continuation

.do// analysis/clean.do

* This is a line comment
// This is also a comment
/* This is a
   block comment */

use "data.csv", clear

keep if year >= 2020   // inline comment
generate profit = ///
    revenue - cost     // line continuation with ///

running a script

From any DuckDB session:

sql

do "analysis/clean.do";
list;    -- inspect results after the script runs

terminal commands are skipped

Inside .do files, terminal commands like list, count, summarize, and tabulate are skipped. They are meant for interactive use — run them after the script finishes. Transformation and side-effect commands (keep, generate, save, export) execute normally.

standalone compilation

The dodoc compiler translates .do files to SQL without DuckDB. Pipe from stdin or pass a file:

bash

dodoc analysis/clean.do > clean.sql
cat analysis/clean.do | dodoc -o clean.sql

docs / guides / DuckDB UI integration

DuckDB UI integration

dodo can push live data into the DuckDB UI so you see results update as you type commands.

live view

Enable the live view and dodo creates a _dodo_data view after each transformation. The DuckDB UI data panel auto-refreshes when this view changes.

sql

SET dodo_live_view = true;
use "firms.csv", clear;
keep if year >= 2020;   -- _dodo_data updates automatically

history table

When data is materialized, dodo._history records every command and its undo status. Query it directly or view it in the UI sidebar.

sql

SELECT * FROM dodo._history;

inspecting generated SQL

Use show sql to see the full formatted query with -- [source] comments. Control formatting with SET dodo_format_sql and SET dodo_sql_comments.

docs / guides / migration guide

Migration guide

Moving an existing .do workflow to dodo. Most scripts work with minor adjustments.

what works

The core data-manipulation verbs — use, keep, drop, generate, replace, rename, sort, egen, collapse, reshape, merge, append — use the same syntax. Expression functions like log(), missing(), cond(), inrange(), and time-series operators (L., F., D.) translate directly. See expression translation for the full mapping.

semicolons

In the DuckDB REPL, every statement needs a trailing semicolon. Inside .do files, semicolons are optional — newlines delimit commands, just like the original.

data formats

dodo reads CSV, Parquet, JSON, and .dta files natively. The built-in .dta reader supports formats 117–121 (Stata 13–18), including variable labels, value labels, strL strings, and all numeric types. No external extensions required.

programming constructs

dodo supports local, global, and scalar macros, as well as foreach and forvalues loops, display, and assert. See the programming section for details.

Key difference: in the original, macro substitution is purely textual. In dodo, literal assignments work the same way, but expression assignments involving runtime values (like _N) use DuckDB's SET VARIABLE mechanism instead. Loop variables are scoped to the loop body and destroyed when the loop ends, unlike the original which keeps the last value.

Not supported: program define, if/else control flow (the command-level branching, not the if qualifier), mata matrix operations, and r()/e() stored results. The levelsof ... , local() option is also not available.

keyword conflicts

describe and summarize are also SQL keywords. dodo handles the conflict automatically when data is loaded — the commands route to dodo, not to DuckDB's DESCRIBE/SUMMARIZE. Use codebook as an unambiguous alias for describe if you prefer.

docs / guides / known limitations

Known limitations

dodo covers the most common data-manipulation commands. Here is what it does not do yet.

programming constructs

local, global, scalar, foreach, forvalues, display, and assert are all supported. See the programming section.

Not supported: program define (user-defined programs), if/else control flow (command-level branching), mata (matrix operations), r()/e() stored results from estimation commands, and the levelsof ... , local() option. These have no SQL equivalent. Use a host language for scripting and dodo for data steps.

.dta files

Native .dta support covers formats 117–121 (Stata 13–18). Format 115 (Stata 10–12) is not supported. All 27 extended missing value codes (.a–.z) are mapped to a single NULL.

mvencode _all

mvencode _all, mv(0) is not yet supported. List column names explicitly instead.

reshape wide

reshape wide currently supports one value variable. Multiple value variables in a single reshape wide are not yet implemented. As a workaround, reshape one variable at a time and merge.

keyword conflicts

describe and summarize are SQL keywords. dodo intercepts them when data is loaded, but if no dataset is in memory, they fall through to DuckDB's native behavior. Use codebook instead of describe for an unambiguous alternative.

docs / commands / io / save

save

Write the current data to disk or to an in-memory DuckDB table.

syntax

save target [, replace] [, table]

examples

.do

save "output.csv", replace;
save "output.parquet", replace;
save "output.dta", replace;        // native .dta writer
save my_table, replace table;     // in-memory duckdb table

notes

Format is inferred from the extension. .csv, .parquet, .json, .tsv, and .dta are supported. Variable labels are preserved in .dta output.
The table option writes to an in-memory DuckDB table rather than disk. This is the only way to reach the data from straight SQL.
save is a terminal command — it executes the chain.

docs/commands/io/append

append

Stack another dataset below the current one. Columns are matched by name; missing columns become NULL.

syntax

append using source

examples

.do

append using "more_firms.csv";
append using firms_2025;       // existing duckdb table

sql translation

sql

_sN AS (
  SELECT * FROM _s(N-1)
  UNION ALL BY NAME
  SELECT * FROM read_csv('more_firms.csv')
)

docs/commands/io/do

do

Run a .do file end-to-end. No semicolons needed inside the file — newlines separate statements.

syntax

do path

examples

sql

do "analysis/clean.do";
list;    // inspect results after the script runs

The script uses standard Stata-style comment markers:

.doclean.do

* this is a line comment
// this is also a comment
/* this is a
   block comment */

use "data.csv", clear

keep if year >= 2020    // inline comment
generate profit = ///
    revenue - cost          // line continuation with ///

notes

Terminal commands inside .do files are skipped — inspect results interactively after the script finishes.
Working directory follows the calling session, not the script's location. Use absolute paths if you need stability.
Errors halt execution and surface the offending line number. The CTE chain rolls back to its pre-do state.

docs/commands/io/clear

clear

Drop the current dataset and all associated state — the CTE chain, the history table, registered tempfiles, panel declarations.

syntax

clear [all]

examples

.do

clear;           // drop current dataset + chain
clear all;       // also drop tempfiles, labels, panel declarations

docs/import delimited

import delimited

Read a CSV file. Alias for use with a CSV path — exists for Stata-script compatibility.

.do

import delimited "data/survey.csv", clear;

docs/export delimited

export delimited

Write the current data to a CSV file. Alias for save with a CSV path.

.do

export delimited using "output.csv", replace;

docs/drop

drop

Drop columns by name, or drop rows matching a condition. Inverse of keep.

drop varlist
drop if condition

.do

drop temp_var debug_flag;
drop if missing(revenue);

docs/replace

replace

Overwrite an existing column's values. Unlike generate, the target column must already exist.

replace var = expression [if condition]

.do

replace revenue = 0 if missing(revenue);
replace name = "Unknown" if missing(name);

docs/rename

rename

Rename a column. Idempotent — renaming to the existing name is a no-op.

rename old_name new_name

.do

rename old_name new_name;

docs/sort

sort

Sort rows by one or more columns. Default ascending.

sort varlist [, desc]

.do

sort year;
sort revenue, desc;
sort sector year;     // composite, both ascending

docs/order

order

Reorder columns. Listed columns move to the front; everything else keeps its relative order.

order varlist

.do

order year id name;

docs/egen

egen

Create a column using a window or aggregate function — optionally grouped. Unlike collapse, the original rows survive.

egen newvar = fn(expr) [, by(groupvars)]

.do

egen mean_rev = mean(revenue), by(sector);
egen row_num = seq(), by(sector);
egen total = sum(revenue);

Compiles to a WINDOW clause with the appropriate PARTITION BY.

docs/mvencode

mvencode

Replace missing values across one or more columns with a fill value.

mvencode varlist, mv(fill)

.do

mvencode revenue profit, mv(0);

docs/reshape

reshape

Pivot the dataset between long and wide formats. i() is the identifier column; j() is the column whose values become wide columns (or vice versa).

reshape long | wide stubname, i(id) j(period)

.do

// wide → long
reshape long revenue, i(id) j(year);

// long → wide
reshape wide revenue, i(id) j(year);

Compiles to PIVOT / UNPIVOT.

docs/duplicates drop

duplicates drop

Remove duplicate rows. With no arguments, considers all columns; with a varlist, only those.

duplicates drop [varlist]

.do

duplicates drop;              // all columns
duplicates drop id year;      // by specific columns

docs/expand

expand

Replicate rows. Either a constant count or a per-row count column.

expand n | count_col [, generate(indicator)]

.do

expand 3;                          // triple every row
expand count, generate(copy);     // variable expansion + indicator

docs/list

list

Show data. With no arguments, displays all rows and columns.

list [varlist] [if condition]

.do

list;
list id revenue if year >= 2020;

docs/count

count

Count the number of rows, optionally restricted by a condition.

count [if condition]

.do

count;
count if revenue > 1000;

docs/head / tail

head / tail

Show the first or last N rows. Defaults to 10 if N is omitted.

head [n]
tail [n]

.do

head 10;
tail 5;

docs/describe

describe

Show column names and types. Alias: codebook.

describe
codebook

.do

describe;
codebook;

docs/summarize

summarize

Compute summary statistics: N, mean, sd, min, p25, p50, p75, max.

summarize varname [if condition]

.do

summarize revenue;
summarize revenue if year >= 2020;

docs/tabulate

tabulate

Frequency table for one variable, or cross-tabulation for two.

tabulate var1 [var2] [if condition]

.do

tabulate sector;
tabulate sector year;

docs/history

history

Show the list of commands executed in the current session. Also queryable as SELECT * FROM dodo._history.

history

.do

history;

docs/undo / redo

undo / redo

Roll back or re-apply transformations. A new transformation after undo clears the redo stack.

undo [n]
redo [n]

.do

use "firms.csv", clear;
keep if year >= 2020;
generate profit = revenue - cost;
undo;        // removes generate
redo;        // restores generate
undo 2;      // rolls back two steps

docs/preserve / restore

preserve / restore

Save a checkpoint of the current dataset and restore it later.

preserve
restore

.do

preserve;
drop if revenue < 100;
list;
restore;
list;

docs/tempfile

tempfile

Register a name for temporary in-memory storage. Use with save and use to stash and retrieve intermediate datasets.

tempfile name

.do

tempfile cleaned;
save cleaned, replace;
// ... other work ...
use cleaned, clear;

docs/xtset / tsset

xtset / tsset

Declare panel or time-series structure to enable L., F., and D. lag/lead/difference operators.

xtset panelvar timevar
tsset timevar

.do

xtset firm_id year;
generate lag_rev = L.revenue;
generate diff = D.revenue;

docs/label

label

Attach metadata to columns and values.

label variable varname "text"
label define name val "text" ...
label values var labelname

.do

label variable revenue "Annual revenue (USD)";
label define sector_lbl 1 "Manufacturing" 2 "Services";
label values sector sector_lbl;

docs/bysort / by

bysort / by

Run a command within groups defined by one or more variables.

bysort vars [(sort_vars)]: command
by vars: command

.do

bysort sector (year): generate row = _n;
bysort firm_id (year): generate cum_rev = sum(revenue);

docs/show sql

show sql

Display the generated SQL for the current CTE chain. The output is formatted with indentation and -- [source] comments mapping each step back to the original command.

show sql

.do

use "firms.csv", clear;
keep if year >= 2020;
generate profit = revenue - cost;
show sql;

Control formatting with SET dodo_format_sql = true|false and comments with SET dodo_sql_comments = true|false.

docs/commands/programming/local

local

Define a local macro that can be substituted into later commands with `name' syntax.

syntax

local name value ...
local name = expr
local ++name
local --name

Without the = sign, the rest of the line is stored as literal text. With =, the right-hand side is evaluated as an expression.

expansion

Use `name' (backtick + name + single-quote) to substitute a local macro. The substitution happens at compile time before the command is parsed.

.do

* literal assignment — stores text
local outcome revenue
generate log_y = log(`outcome')

* expression assignment — evaluates the expression
local threshold = 100 * 1.5
keep if revenue > `threshold'

* increment / decrement
local counter = 0
local ++counter
* counter is now 1

identifier vs. value position

A macro in identifier position (column name, command name) is spliced directly into the command text. A macro in value position (inside an expression after =) is substituted as a literal value.

.do

* identifier position — `var' becomes a column name
local var wage
summarize `var'

* value position — `cutoff' becomes a number in the expression
local cutoff = 50000
keep if `var' > `cutoff'

how variable substitution works in dodo

Literal assignments (local x hello) store text and substitute it directly into commands — same behavior as the original.

Expression assignments (local x = 5) try compile-time evaluation first. If the expression resolves to a literal (e.g., 100 * 1.5), the result is stored and substituted the same way as a literal assignment.

Runtime references: when an expression contains values that can only be known at query time (like _N), dodo uses DuckDB's SET VARIABLE and getvariable() mechanism. The value is evaluated by DuckDB at runtime, not by the compiler.

Loop scoping: loop variables (the index in foreach / forvalues) are scoped to the loop body and destroyed when the loop ends. In the original, the loop variable keeps the last value after the loop finishes.

Namespace independence: the same name can exist in the local, global, and scalar namespaces independently. They do not shadow each other.

docs/commands/programming/global

global

Define a global macro, accessible with $name or ${name} syntax.

syntax

global name value ...
global name = expr

Without =, the rest of the line is stored as literal text. With =, the expression is evaluated. Expansion uses $name or ${name} (the braces form avoids ambiguity when the macro name is followed by other text).

examples

.do

global controls age education experience
generate score = $controls   // expands to: age education experience

global base_year = 2015
keep if year >= ${base_year}

locals vs. globals

In dodo, both local and global macros are resolved at compile time. The distinction is syntactic: locals use `name', globals use $name. The same substitution and scoping rules from the local page apply.

docs/commands/programming/scalar

scalar

Store a named value — text, integer, or float — that can be referenced by its bare name in expressions.

syntax

scalar [define] name = expr
scalar list
scalar drop name

examples

.do

scalar define cutoff = 1000
keep if revenue > cutoff

scalar pi = 3.14159
generate area = pi * radius^2

scalar list       // show all scalars
scalar drop cutoff

notes

Scalars are referenced by their bare name in expressions — no backticks or dollar signs.
The define keyword is optional; scalar x = 5 and scalar define x = 5 are equivalent.
Scalars can hold text strings, integers, or floating-point values.
Scalars live in their own namespace, independent of local and global macros.

docs/commands/programming/foreach

foreach

Loop over a list, repeating the body once per element. The loop variable is substituted with `var' syntax.

syntax

foreach var in list { body }
foreach var of local macname { body }
foreach var of global macname { body }
foreach var of numlist spec { body }

examples

.do

* loop over a literal list
foreach v in wage hours bonus {
    generate log_`v' = log(`v')
}

* loop over items stored in a local
local outcomes revenue profit margin
foreach y of local outcomes {
    replace `y' = 0 if missing(`y')
}

* loop over a numlist
foreach yr of numlist 2010/2020 {
    generate d`yr' = (year == `yr')
}

notes

Each iteration adds CTE steps to the query — the data is transformed once per loop pass.
The loop variable (`v', `y', etc.) is scoped to the loop body and destroyed when the loop ends.
The opening brace { must appear on the same line as foreach. The closing brace } must be on its own line.

docs/commands/programming/forvalues

forvalues

Loop over a range of numbers.

syntax

forvalues var = #1/#2 { body }
forvalues var = #1(#d)#2 { body }

The first form increments by 1 from #1 to #2. The second form increments by #d.

examples

.do

* simple range
forvalues i = 1/5 {
    generate lag`i' = x[_n - `i']
}

* with step size
forvalues yr = 2000(5)2020 {
    generate d`yr' = (year == `yr')
}

notes

Same scoping as foreach: the loop variable is destroyed when the loop ends.
The opening brace { must appear on the same line as forvalues. The closing brace } must be on its own line.

docs/commands/programming/display

display

Print a value or expression to the console. Returns a single-row result.

syntax

display "text"
display expr

examples

.do

display "hello, world"
display 2 + 2
display sqrt(144)

notes

display is a terminal command — it executes the expression immediately and returns one row.
Macro references are expanded before evaluation: display `x' prints the value of local x.

docs/commands/programming/assert

assert

Check that a condition holds for every row. Throws an error if any row violates it.

syntax

assert condition

examples

.do

assert revenue >= 0
assert !missing(firm_id)
assert year >= 2000 & year <= 2025

notes

If any row fails the condition, dodo raises an error and reports the number of violations.
assert does not modify the data — it is a check only.

docs/commands/programming/levelsof

levelsof

Return the distinct values of a variable.

syntax

levelsof varname

examples

.do

levelsof sector
levelsof year

notes

levelsof is a terminal command that returns the distinct values of the specified variable.
The local() option from the original is not supported. Use levelsof for inspection only.

docs/commands/programming/compress

compress

Accepted for compatibility. No-op — DuckDB handles storage types automatically.

syntax

compress

notes

In the original, compress shrinks variable storage types to the smallest type that fits the data. DuckDB manages storage automatically, so this command is silently accepted and does nothing.

docs / dodoc / overview

dodoc standalone compiler

Compile .do files to SQL without installing DuckDB. A single binary, no dependencies, designed for CI/CD pipelines and SQL preview workflows.

what is dodoc?

dodoc is a standalone CLI that reads .do files and outputs the equivalent SQL — the same translation the dodo DuckDB extension performs, but without needing DuckDB at all.

It shares the same core parser as the dodo extension, so every command the extension understands, dodoc understands too.

why use it?

No DuckDB dependency — useful in environments where you can't install DuckDB (locked-down CI runners, lightweight containers).
Preview SQL before running — pipe the output to a file, inspect it, then feed it to DuckDB (or another database) only when you're ready.
CI/CD integration — compile .do files as a build step, validate the SQL, commit the output.
Pipe-friendly — reads from stdin, writes to stdout, composes with Unix tools.

relation to the dodo extension

two tools, one parser

The dodo extension runs inside DuckDB and executes the translated SQL immediately. dodoc runs outside DuckDB and only produces SQL text. Both share the same parser and produce identical SQL for the same input.

docs / dodoc / installation

Install dodoc

One command to install. Choose your preferred method.

install script (macOS / Linux)

The fastest way — detects your platform and downloads the right binary:

bash // one-liner install

› curl -fsSL https://getdodo.dev/install.sh | sh

Installs to /usr/local/bin (or ~/.local/bin if no sudo). Works on macOS (arm64 & x86_64) and Linux (x86_64 & arm64).

homebrew (macOS / Linux)

bash // via homebrew

› brew install codedthinking/tap/dodoc

download pre-built binaries

Pre-built binaries are available on the GitHub Releases page for five platforms:

macOS arm64

Apple Silicon (M1/M2/M3/M4). Download dodoc-macos-arm64.tar.gz.

macOS x86_64

Intel Macs. Download dodoc-macos-x86_64.tar.gz.

Linux x86_64

64-bit Linux (statically linked). Download dodoc-linux-x86_64.tar.gz.

Linux arm64

ARM64 Linux (statically linked). Download dodoc-linux-arm64.tar.gz.

Windows x86_64

64-bit Windows. Download dodoc-windows-x86_64.zip.

After downloading, extract and install:

bash // macOS / Linux

› tar xzf dodoc-macos-arm64.tar.gz
› sudo install dodoc /usr/local/bin/

build from source

Requires git, make, and a C++17 compiler.

bash // clone and build

› git clone --recurse-submodules https://github.com/codedthinking/dodo.git
› cd dodo
› make dodoc
› sudo make dodoc-install

This installs the dodoc binary to /usr/local/bin/.

verify

bash // check it works

› dodoc --help

docs / dodoc / usage

Using dodoc

Read from stdin or file, output SQL to stdout or file. Pipe into DuckDB or save for later.

stdin to stdout

Pipe .do commands directly:

bash // pipe from echo

› echo 'use "data.csv", clear
keep if year >= 2020
generate profit = revenue - cost' | dodoc

compile a file

Pass the .do file as an argument:

bash // file to stdout

› dodoc analysis/clean.do

Write to a file with -o:

bash // file to file

› dodoc analysis/clean.do -o analysis/clean.sql

annotated output

The --annotate flag adds the original .do command as a SQL comment above each translated statement, making the output easier to read and debug:

bash // annotated SQL output

› dodoc --annotate analysis/clean.do

piping to duckdb

The most common pattern: compile, then execute. Pipe dodoc output directly into DuckDB:

bash // compile and run in one step

› dodoc script.do | duckdb

dodoc vs dodo extension

When piping to DuckDB this way, DuckDB does not need the dodo extension installed — it receives plain SQL. The extension is only needed when you type .do commands directly in the DuckDB REPL.

all flags

Flag	Description
`-o, --output FILE`	Write SQL to FILE instead of stdout
`--annotate`	Emit original .do command as a SQL comment before each statement
`--terminal`	Also emit SQL for terminal commands (list, count, etc.)
`-h, --help`	Show help message