Introduction
13‑week plan (2 × 75‑min per week)
Week 1 – Setup, Colab, Git/GitHub
- Lec A: Local Python + VS Code; Colab basics (GPU, Drive mount, persistence limits), repo cloning in Colab,
requirements.txt
, seeds. - Lec B: Git essentials, branching, PRs, code review etiquette,
.gitignore
, Git‑LFS do’s/don’ts (quota pitfalls). - Deliverable: Team repo with a Colab notebook that runs and logs environment info; one PR merged.
Week 2 – Reproducible reporting (Quarto) + RStudio cameo
- Lec A: Quarto for Python: parameters, caching, citations; publish to GitHub Pages.
- Lec B (15–25 min cameo): RStudio + Quarto rendering (so they can read R‑centric docs later), then back to Python.
- Deliverable: Parameterized EDA report (symbol, date range as params).
Week 3 – Unix for data work + automation
- Lec A: Shell basics (pipes, redirects),
grep/sed/awk
,find/xargs
, regex. - Lec B: Shell scripts, simple Makefile/justfile targets;
rsync
, quick SSH/tmux tour. - Deliverable:
make get-data
andmake report
run end‑to‑end.
Week 4 – SQL I (schemas, joins)
- Lec A: SQLite in repo; schema design for OHLCV + metadata; SELECT/JOIN/GROUP BY.
- Lec B: Window functions; indices;
pandas.read_sql
pipelines. - Deliverable: SQL notebook producing a tidy table ready for modeling.
Week 5 – pandas for time series
- Lec A: Cleaning, types, missing, merges;
groupby
, pivot; Parquet I/O. - Lec B: Time‑series ops: resampling, rolling windows, shifting/lagging, calendar effects.
- Deliverable: Cleaned Parquet dataset + feature snapshot.
Week 6 – APIs & Web scraping (ethics + caching)
- Lec A: HTTP basics,
requests
, pagination, auth, retries, backoff; don’t hard‑code keys (python‑dotenv
). - Lec B: BeautifulSoup, CSS selectors, robots.txt, throttling; cache raw pulls; persist to SQL/Parquet.
- Deliverable: One external data source ingested with caching & schema checks.
Week 7 – Quality: tests, lint, minimal CI
- Lec A:
pytest
(2–3 meaningful tests), data validation (light Pandera or custom checks), logging, type hints. - Lec B: Pre‑commit (black, ruff, nbstripout), GitHub Actions to run tests + lint on PRs (fast jobs only).
- Deliverable: CI badge green; failing test demonstrates leakage prevention or schema guard.
Week 8 – Time‑series baselines & backtesting
- Lec A: Problem framing; horizon, step size; MAE/sMAPE/MASE; rolling‑origin evaluation.
- Lec B: Baselines: naive/seasonal‑naive; quick ARIMA/Prophet or sklearn regressor with lags.
- Deliverable: Baseline model card + backtest plot in Quarto.
Week 9 – Finance‑specific evaluation & leakage control
- Lec A: Feature timing & label definition (t+1 returns, multi‑step horizons), survivorship bias, look‑ahead traps, data snooping.
- Lec B: Walk‑forward / expanding window, embargoed splits, drift detection; error analysis by regime (volatility bins, bull/bear).
- Deliverable: A robust evaluation plan + revised splits; leakage test added to
pytest
.
Week 10 – PyTorch fundamentals
- Lec A: Tensors, autograd, datasets/dataloaders for windows; training loop, early stopping; GPU in Colab; mixed precision.
- Lec B: A small LSTM/TCN baseline for forecasting; monitoring loss/metrics; save best weights.
- Deliverable: PyTorch baseline surpasses classical baseline on at least one metric.
Week 11 – Transformers for sequences (tiny GPT)
- Lec A: Attention from scratch; tiny char‑level GPT (embeddings, positions, single head → multi‑head), sanity‑check overfitting on toy data.
- Lec B: Adapt to time series: window embedding, causal masking, regression head; ablation (context length, heads, dropout) within Colab budget.
- Deliverable: Transformer results + one ablation figure; notes on compute/time.
Week 12 – Productivity at scale (lightweight)
- Lec A: Packaging a small library (
src/
layout,pyproject.toml
), simple CLI (Typer) for batch inference; config via YAML. - Lec B: Optional FastAPI endpoint demo (local only) + reproducibility audit (fresh‑clone run).
- Deliverable: Tagged release
v1.0-rc
, CLI can score a held‑out period and write a report.
Week 13 – Communication & showcase
- Lec A: Poster + abstract workshop; tell the error‑analysis story; figure polish; README & model card.
- Lec B: In‑class presentations + final feedback; plan for continuing to the Spring symposium (next‑steps backlog).
- Deliverable: Poster draft, 250‑word abstract, and a reproducible repo ready to extend.
Project spine
- Milestones: W1 repo & env → W3 automated data pipeline → W6 external data → W7 CI green → W8 baselines → W9 robust eval plan → W10 PyTorch baseline → W11 tiny Transformer → W12 release candidate → W13 poster & talk.
- Tracking (minimal): log experiments to a simple CSV (
results/experiments.csv
) and keep a Quarto “lab notebook.” - Data strategy: keep raw data out of Git (use
make get-data
); store processed Parquet under 100MB if you must commit; otherwise regenerate. Use Git‑LFS only for small, immutable artifacts to avoid quota pain. - Secrets:
.env
withpython‑dotenv
+.env
in.gitignore
. For Colab, use environment variables or a JSON in Drive (not committed).