Berlin V1-V5 reproducibility¶
End-to-end reproduction guidance for the paper’s 5 model variants on the synthetic 96-zone Berlin scenario.
Reproducibility tiers¶
Tier |
What |
Wall-clock |
Prerequisites |
|---|---|---|---|
Tier 1 |
|
<30 s |
Python 3.9+ |
Tier 2 |
|
<10 s |
Tier 1 |
Tier 3a |
V1 baseline + shock |
~3 hr |
Tier 2 + bundled Berlin data (in git, not PyPI) |
Tier 3b |
V2 / V3 baseline + shock |
~3 hr each |
Tier 3a |
Tier 3c |
V4 baseline + shock |
~3 hr |
Tier 3a + LLM credits (~$5) |
Tier 4 (cache replay) |
V5 baseline + shock from bundled cache |
~5 min |
Tier 3a + bundled |
Tier 4 (live) |
V5 baseline + shock with live LLM calls |
~10 hr |
Tier 3a + LLM credits ($30-50) |
Step-by-step¶
Tier 1: Install¶
pip install agent-urban-planning
python -c "import agent_urban_planning as aup; print(aup.__version__)"
# → 0.1.0
Tier 2: Smoke test (no data needed)¶
python examples/01_quickstart_two_zone.py
Should print all 5 paper variants instantiated successfully.
Tier 3+ requires git clone¶
The bundled Berlin Ortsteile NPZ files are git-only (excluded from PyPI sdist). Tier 3 and Tier 4 require:
git clone https://github.com/MASE-eLab/agent-urban-planning.git
cd agent-urban-planning
pip install -e ".[llm,plot,berlin]"
Tier 3a: V1 (no LLM)¶
python examples/02_berlin_replication/run_v1_softmax.py
Outputs:
output/berlin_v1_softmax/per_zone.csvoutput/berlin_v1_shock_east_west/per_zone.csv
Numerical match to dev repo’s V1: within 1e-3 numerical tolerance (V1 is deterministic at seed 42).
Tier 3b: V2, V3 (no LLM)¶
python examples/02_berlin_replication/run_v2_argmax_frechet.py
python examples/02_berlin_replication/run_v3_argmax_normal.py
Each ~3 hr. V2 and V3 are stochastic but seeded — deterministic at seed 42.
Tier 3c: V4 (LLM elicitation)¶
python examples/02_berlin_replication/run_v4_hybrid.py --llm-provider codex-cli
Requires codex CLI authenticated. Cost: ~$5 in API credits.
Tier 4 (cache replay): V5 without LLM credits¶
python examples/02_berlin_replication/run_v5_score_all.py --no-llm
Replays bundled cache at data/berlin/llm_cache_v5/. ~5 min wall-clock.
Tier 4 (live): V5 with live LLM¶
python examples/02_berlin_replication/run_v5_score_all.py --llm-provider codex-cli
Cost: $30-50. Wall-clock: ~10 hr. Reproduces baseline + shock from scratch.
After all variants complete¶
Each output/{variant}/per_zone.csv and
output/{variant}_shock_east_west/per_zone.csv is a 96-row CSV with
columns zone_id, Q_sim, HR_sim, HM_sim, wage_sim, Q_obs, HR_obs, HM_obs, wage_obs. Use these directly for cross-variant comparisons:
import pandas as pd
v1 = pd.read_csv("output/berlin_v1_softmax/per_zone.csv")
v1_shock = pd.read_csv("output/berlin_v1_softmax_shock_east_west/per_zone.csv")
dlog_Q = (v1_shock.Q_sim / v1.Q_sim).pipe(lambda s: s.apply("log"))
print(dlog_Q.describe())
The paper’s headline cross-variant moments table and choropleths are
in figures/comparison_moments.csv, figures/berlin_dlogQ.png,
figures/berlin_dlogW.png (bundled in the repo).
Numerical reproducibility expectations¶
Variant |
Seed-determinism |
Tolerance vs dev repo |
|---|---|---|
V1 (Baseline-softmax) |
Fully deterministic |
exact |
V2 (Baseline-ABM argmax) |
Seeded stochastic |
<1e-3 numerical |
V3 (Normal-ABM argmax) |
Seeded stochastic |
<1e-3 numerical |
V4 (Hybrid-ABM) |
Seeded + LLM cache |
<1e-2 (LLM elicitation noise) |
V5 (LLM-ABM, cache replay) |
Cache-deterministic |
exact |
V5 (LLM-ABM, live) |
Provider + temperature dependent |
qualitative match only |
Troubleshooting¶
“Bundled Berlin data missing”¶
You ran pip install instead of git clone. Re-clone the repo.
LLM provider not configured¶
# Verify codex-cli auth
codex login
# Or use Anthropic SDK
export ANTHROPIC_API_KEY=...
python ... --llm-provider anthropic
Numerical divergence¶
Verify seed=42 (the paper default).
For V5: use the bundled cache (
--no-llmmode) or the same provider + temperature as the paper (codex-cli, temperature=0).Live LLM runs at different seeds will not be bit-identical to the paper.
See also¶
Shock analysis methodology — methodology for the East-West Express τ shock
Berlin V1-V5 replication — task-oriented walkthrough