Berlin V1-V5 reproducibility¶

End-to-end reproduction guidance for the paper’s 5 model variants on the synthetic 96-zone Berlin scenario.

Reproducibility tiers¶

Tier	What	Wall-clock	Prerequisites
Tier 1	`pip install agent-urban-planning` + `import agent_urban_planning`	<30 s	Python 3.9+
Tier 2	`python examples/01_quickstart_two_zone.py` (5-variant smoke test)	<10 s	Tier 1
Tier 3a	V1 baseline + shock	~3 hr	Tier 2 + bundled Berlin data (in git, not PyPI)
Tier 3b	V2 / V3 baseline + shock	~3 hr each	Tier 3a
Tier 3c	V4 baseline + shock	~3 hr	Tier 3a + LLM credits (~$5)
Tier 4 (cache replay)	V5 baseline + shock from bundled cache	~5 min	Tier 3a + bundled `data/berlin/llm_cache_v5/`
Tier 4 (live)	V5 baseline + shock with live LLM calls	~10 hr	Tier 3a + LLM credits ($30-50)

Step-by-step¶

Tier 1: Install¶

pip install agent-urban-planning
python -c "import agent_urban_planning as aup; print(aup.__version__)"
# → 0.1.0

Tier 2: Smoke test (no data needed)¶

python examples/01_quickstart_two_zone.py

Should print all 5 paper variants instantiated successfully.

Tier 3+ requires git clone¶

The bundled Berlin Ortsteile NPZ files are git-only (excluded from PyPI sdist). Tier 3 and Tier 4 require:

git clone https://github.com/MASE-eLab/agent-urban-planning.git
cd agent-urban-planning
pip install -e ".[llm,plot,berlin]"

Tier 3a: V1 (no LLM)¶

python examples/02_berlin_replication/run_v1_softmax.py

Outputs:

output/berlin_v1_softmax/per_zone.csv
output/berlin_v1_shock_east_west/per_zone.csv

Numerical match to dev repo’s V1: within 1e-3 numerical tolerance (V1 is deterministic at seed 42).

Tier 3b: V2, V3 (no LLM)¶

python examples/02_berlin_replication/run_v2_argmax_frechet.py
python examples/02_berlin_replication/run_v3_argmax_normal.py

Each ~3 hr. V2 and V3 are stochastic but seeded — deterministic at seed 42.

Tier 3c: V4 (LLM elicitation)¶

python examples/02_berlin_replication/run_v4_hybrid.py --llm-provider codex-cli

Requires codex CLI authenticated. Cost: ~$5 in API credits.

Tier 4 (cache replay): V5 without LLM credits¶

python examples/02_berlin_replication/run_v5_score_all.py --no-llm

Replays bundled cache at data/berlin/llm_cache_v5/. ~5 min wall-clock.

Tier 4 (live): V5 with live LLM¶

python examples/02_berlin_replication/run_v5_score_all.py --llm-provider codex-cli

Cost: $30-50. Wall-clock: ~10 hr. Reproduces baseline + shock from scratch.

After all variants complete¶

Each output/{variant}/per_zone.csv and output/{variant}_shock_east_west/per_zone.csv is a 96-row CSV with columns zone_id, Q_sim, HR_sim, HM_sim, wage_sim, Q_obs, HR_obs, HM_obs, wage_obs. Use these directly for cross-variant comparisons:

import pandas as pd
v1 = pd.read_csv("output/berlin_v1_softmax/per_zone.csv")
v1_shock = pd.read_csv("output/berlin_v1_softmax_shock_east_west/per_zone.csv")
dlog_Q = (v1_shock.Q_sim / v1.Q_sim).pipe(lambda s: s.apply("log"))
print(dlog_Q.describe())

The paper’s headline cross-variant moments table and choropleths are in figures/comparison_moments.csv, figures/berlin_dlogQ.png, figures/berlin_dlogW.png (bundled in the repo).

Numerical reproducibility expectations¶

Variant	Seed-determinism	Tolerance vs dev repo
V1 (Baseline-softmax)	Fully deterministic	exact
V2 (Baseline-ABM argmax)	Seeded stochastic	<1e-3 numerical
V3 (Normal-ABM argmax)	Seeded stochastic	<1e-3 numerical
V4 (Hybrid-ABM)	Seeded + LLM cache	<1e-2 (LLM elicitation noise)
V5 (LLM-ABM, cache replay)	Cache-deterministic	exact
V5 (LLM-ABM, live)	Provider + temperature dependent	qualitative match only

Troubleshooting¶

“Bundled Berlin data missing”¶

You ran pip install instead of git clone. Re-clone the repo.

LLM provider not configured¶

# Verify codex-cli auth
codex login

# Or use Anthropic SDK
export ANTHROPIC_API_KEY=...
python ... --llm-provider anthropic

Numerical divergence¶

Verify seed=42 (the paper default).
For V5: use the bundled cache (--no-llm mode) or the same provider + temperature as the paper (codex-cli, temperature=0).
Live LLM runs at different seeds will not be bit-identical to the paper.