Introduction


13‑week plan (2 × 75‑min per week)

Week 1 – Setup, Colab, Git/GitHub

  • Lec A: Local Python + VS Code; Colab basics (GPU, Drive mount, persistence limits), repo cloning in Colab, requirements.txt, seeds.
  • Lec B: Git essentials, branching, PRs, code review etiquette, .gitignore, Git‑LFS do’s/don’ts (quota pitfalls).
  • Deliverable: Team repo with a Colab notebook that runs and logs environment info; one PR merged.

Week 2 – Reproducible reporting (Quarto) + RStudio cameo

  • Lec A: Quarto for Python: parameters, caching, citations; publish to GitHub Pages.
  • Lec B (15–25 min cameo): RStudio + Quarto rendering (so they can read R‑centric docs later), then back to Python.
  • Deliverable: Parameterized EDA report (symbol, date range as params).

Week 3 – Unix for data work + automation

  • Lec A: Shell basics (pipes, redirects), grep/sed/awk, find/xargs, regex.
  • Lec B: Shell scripts, simple Makefile/justfile targets; rsync, quick SSH/tmux tour.
  • Deliverable: make get-data and make report run end‑to‑end.

Week 4 – SQL I (schemas, joins)

  • Lec A: SQLite in repo; schema design for OHLCV + metadata; SELECT/JOIN/GROUP BY.
  • Lec B: Window functions; indices; pandas.read_sql pipelines.
  • Deliverable: SQL notebook producing a tidy table ready for modeling.

Week 5 – pandas for time series

  • Lec A: Cleaning, types, missing, merges; groupby, pivot; Parquet I/O.
  • Lec B: Time‑series ops: resampling, rolling windows, shifting/lagging, calendar effects.
  • Deliverable: Cleaned Parquet dataset + feature snapshot.

Week 6 – APIs & Web scraping (ethics + caching)

  • Lec A: HTTP basics, requests, pagination, auth, retries, backoff; don’t hard‑code keys (python‑dotenv).
  • Lec B: BeautifulSoup, CSS selectors, robots.txt, throttling; cache raw pulls; persist to SQL/Parquet.
  • Deliverable: One external data source ingested with caching & schema checks.

Week 7 – Quality: tests, lint, minimal CI

  • Lec A: pytest (2–3 meaningful tests), data validation (light Pandera or custom checks), logging, type hints.
  • Lec B: Pre‑commit (black, ruff, nbstripout), GitHub Actions to run tests + lint on PRs (fast jobs only).
  • Deliverable: CI badge green; failing test demonstrates leakage prevention or schema guard.

Week 8 – Time‑series baselines & backtesting

  • Lec A: Problem framing; horizon, step size; MAE/sMAPE/MASE; rolling‑origin evaluation.
  • Lec B: Baselines: naive/seasonal‑naive; quick ARIMA/Prophet or sklearn regressor with lags.
  • Deliverable: Baseline model card + backtest plot in Quarto.

Week 9 – Finance‑specific evaluation & leakage control

  • Lec A: Feature timing & label definition (t+1 returns, multi‑step horizons), survivorship bias, look‑ahead traps, data snooping.
  • Lec B: Walk‑forward / expanding window, embargoed splits, drift detection; error analysis by regime (volatility bins, bull/bear).
  • Deliverable: A robust evaluation plan + revised splits; leakage test added to pytest.

Week 10 – PyTorch fundamentals

  • Lec A: Tensors, autograd, datasets/dataloaders for windows; training loop, early stopping; GPU in Colab; mixed precision.
  • Lec B: A small LSTM/TCN baseline for forecasting; monitoring loss/metrics; save best weights.
  • Deliverable: PyTorch baseline surpasses classical baseline on at least one metric.

Week 11 – Transformers for sequences (tiny GPT)

  • Lec A: Attention from scratch; tiny char‑level GPT (embeddings, positions, single head → multi‑head), sanity‑check overfitting on toy data.
  • Lec B: Adapt to time series: window embedding, causal masking, regression head; ablation (context length, heads, dropout) within Colab budget.
  • Deliverable: Transformer results + one ablation figure; notes on compute/time.

Week 12 – Productivity at scale (lightweight)

  • Lec A: Packaging a small library (src/ layout, pyproject.toml), simple CLI (Typer) for batch inference; config via YAML.
  • Lec B: Optional FastAPI endpoint demo (local only) + reproducibility audit (fresh‑clone run).
  • Deliverable: Tagged release v1.0-rc, CLI can score a held‑out period and write a report.

Week 13 – Communication & showcase

  • Lec A: Poster + abstract workshop; tell the error‑analysis story; figure polish; README & model card.
  • Lec B: In‑class presentations + final feedback; plan for continuing to the Spring symposium (next‑steps backlog).
  • Deliverable: Poster draft, 250‑word abstract, and a reproducible repo ready to extend.

Project spine

  • Milestones: W1 repo & env → W3 automated data pipeline → W6 external data → W7 CI green → W8 baselines → W9 robust eval plan → W10 PyTorch baseline → W11 tiny Transformer → W12 release candidate → W13 poster & talk.
  • Tracking (minimal): log experiments to a simple CSV (results/experiments.csv) and keep a Quarto “lab notebook.”
  • Data strategy: keep raw data out of Git (use make get-data); store processed Parquet under 100MB if you must commit; otherwise regenerate. Use Git‑LFS only for small, immutable artifacts to avoid quota pain.
  • Secrets: .env with python‑dotenv + .env in .gitignore. For Colab, use environment variables or a JSON in Drive (not committed).