1  Session 1 — Dev environment & Colab workflow


1.1 Session 1 — Dev environment & Colab workflow

1.1.1 Learning goals

By the end of class, students can:

  1. Mount Google Drive in Colab and work in a persistent course folder.
  2. Clone a GitHub repo into Drive (or create a project folder if no repo yet).
  3. Create and install from a soft‑pinned requirements.txt.
  4. Verify environment info (Python, OS, library versions) and GPU availability.
  5. Use a reproducibility seed pattern (NumPy + PyTorch) and validate it.
  6. Save a simple system check report to the repo.

1.2 Agenda (75 min)

  • (5 min) Course framing: how we’ll work this semester
  • (12 min) Slides & demo: Colab + Drive persistence; project folders; soft vs hard pins
  • (8 min) Slides & demo: reproducibility basics (seeds, RNG, deterministic ops)
  • (35 min) In‑class lab (Colab): mount Drive → clone/create project → requirements → environment check → reproducibility check → write report
  • (10 min) Wrap‑up, troubleshooting, and homework briefing

1.3 Main Points

Why Colab + Drive

  • Colab gives you GPUs and a clean Python every session.
  • The runtime is ephemeral. Anything under /content disappears.
  • Mount Drive and work under /content/drive/MyDrive/... to persist code and outputs.

Project layout (today’s minimal)

project/
  reports/
  notebooks/
  data/
  requirements.txt
  system_check.ipynb

(We’ll add src/, tests, CI in later sessions.)

Pins: soft vs hard

  • Soft pins (e.g., pandas>=2.2,<3.0) keep you compatible across machines.
  • Hard pins (exact versions) are for releases. Today we’ll use soft pins, then freeze to requirements-lock.txt in homework.

Reproducibility basics

  • Fix seeds for random, NumPy, PyTorch (and CUDA if present).
  • Disable nondeterministic cuDNN behavior for repeatability in simple models.
  • Beware: some ops remain nondeterministic on GPU; we’ll use simple ones.

Minimal Git today

  • If you already have a repo: clone it into G-Drive.
  • If not: create a folder; later you can upload the notebook via GitHub web UI.
  • Full Git workflow (branch/PR/CI) starts next session.

1.4 In‑class Lab (35 min)

Instructor tip: Put these as sequential Colab cells. Students should run them top‑to‑bottom. Replace placeholders like YOUR_USERNAME / YOUR_REPO before class if you already created a starter repo. If not, tell them to use the “no‑repo” path in Step 3B.

1.4.1 1) Mount Google Drive and create a course folder

# Colab cell
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

COURSE_DIR = "/content/drive/MyDrive/dspt25"  # change if you prefer another path
PROJECT_NAME = "unified-stocks"               # course project folder/repo name

Save it as system_check.ipynb.

# Colab cell: make directories and cd into project folder
import os, pathlib
base = pathlib.Path(COURSE_DIR)
proj = base / PROJECT_NAME  # / is overloaded to create the path
for p in [base, proj, proj/"reports", proj/"notebooks", proj/"data"]:
    p.mkdir(parents=True, exist_ok=True)

import os
os.chdir(proj)
print("Working in:", os.getcwd())

1.4.2 2) (Optional) If you already have a GitHub repo, clone it into Drive

Pick A or B (not both).

A. Clone an existing repo (recommended if you created a starter repo)

# Colab cell: clone via HTTPS (public or your private; for private, you can upload later instead of pushing from Colab) ONLY clone onetime. If run this notebook again, skip this cell. 
REPO_URL = "https://github.com/YOUR_ORG_OR_USERNAME/YOUR_REPO.git"  # <- change me
import subprocess, os
os.chdir(base)  # clone next to your project folder
subprocess.run(["git", "clone", REPO_URL], check=True) # check if there is an error inseat of silent. 
# Optionally, use that cloned repo as the working directory:(uncomment the lines below if do this)
# REPO_NAME = REPO_URL.split("/")[-1].replace(".git","")
# os.chdir(base/REPO_NAME)
# print("Working in:", os.getcwd())
os.chdir(proj) # change back to proj dir
print("Working in:", os.getcwd())

B. No repo yet? Stay with the folder we created. You’ll upload files via GitHub web UI after class.

1.4.3 3) Create a soft‑pinned requirements.txt and install

# Colab cell: write a soft-pinned requirements.txt
req = """\
pandas>=2.2,<3.0
numpy>=2.0.0,<3.0
pyarrow>=15,<17
matplotlib>=3.8,<4.0
scikit-learn>=1.6,<2.0
yfinance>=0.2,<0.3
python-dotenv>=1.0,<2.0
"""
open("requirements.txt","w").write(req)
print(open("requirements.txt").read())
# Colab cell: install (quietly). Torch is usually preinstalled in Colab; we'll check separately.
!pip install -q -r requirements.txt
# Colab cell: PyTorch check. If not available (rare in Colab), install CPU-only as a fallback.
try:
    import torch
    print("PyTorch:", torch.__version__)
except Exception as e:
    print("PyTorch not found; installing CPU-only wheel as fallback...")
    !pip install -q torch
    import torch
    print("PyTorch:", torch.__version__)

1.4.4 4) Environment report (Python/OS/lib versions, GPU availability)

# Colab cell: environment info + GPU check
import sys, platform, json, time
import pandas as pd
import numpy as np

env = {
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
    "python": sys.version,
    "os": platform.platform(),
    "pandas": pd.__version__,
    "numpy": np.__version__,
}

try:
    import torch
    env["torch"] = torch.__version__
    env["cuda_available"] = bool(torch.cuda.is_available())
    env["cuda_device"] = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
except Exception as e:
    env["torch"] = "not importable"
    env["cuda_available"] = False
    env["cuda_device"] = "CPU"

print(env)
os.makedirs("reports", exist_ok=True)
with open("reports/environment.json","w") as f:
    json.dump(env, f, indent=2)

1.4.5 5) Reproducibility seed utility + quick validation

# Colab cell: reproducibility helpers
import random
import numpy as np

def set_seed(seed: int = 42, deterministic_torch: bool = True):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        if deterministic_torch:
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False
            try:
                torch.use_deterministic_algorithms(True)
            except Exception:
                pass
    except Exception:
        pass

def sample_rng_fingerprint(n=5, seed=42):
    set_seed(seed)
    a = np.random.rand(n).round(6).tolist()
    try:
        import torch
        b = torch.rand(n).tolist()
        b = [round(x,6) for x in b]
    except Exception:
        b = ["torch-missing"]*n
    return {"numpy": a, "torch": b}

f1 = sample_rng_fingerprint(n=6, seed=123)
f2 = sample_rng_fingerprint(n=6, seed=123)
print("Fingerprint #1:", f1)
print("Fingerprint #2:", f2)
print("Match:", f1 == f2)

with open("reports/seed_fingerprint.json","w") as f:
    json.dump({"f1": f1, "f2": f2, "match": f1==f2}, f, indent=2)

1.4.6 6) Create (or verify) tickers_25.csv for the course

# Colab cell: create stock list if it doesn't exist yet
import pandas as pd, os
tickers = [
    "AAPL","MSFT","AMZN","GOOGL","META","NVDA","TSLA","JPM","JNJ","V",
    "PG","HD","BAC","XOM","CVX","PFE","KO","DIS","NFLX","INTC",
    "CSCO","ORCL","T","VZ","WMT"
]
path = "tickers_25.csv"
if not os.path.exists(path):
    pd.DataFrame({"ticker": tickers}).to_csv(path, index=False)
pd.read_csv(path).head()

1.4.7 7) (Optional) Prove GPU works by allocating a small tensor

# Colab cell: tiny GPU smoke test (safe if CUDA available)
import torch, time

# change back to not use deterministic_algorithm to do the matrix computation
# torch.use_deterministic_algorithms(False)

device = "cuda" if torch.cuda.is_available() else "cpu"
x = torch.randn(1000, 1000, device=device)
y = x @ x.T
print("Device:", device, "| y shape:", y.shape, "| mean:", y.float().mean().item())

1.4.8 8) Save a short Markdown environment report

# Colab cell: write a small Markdown summary for humans
from textwrap import dedent
summary = dedent(f"""
# System Check

- Timestamp: {env['timestamp']}
- Python: `{env['python']}`
- OS: `{env['os']}`
- pandas: `{env['pandas']}` | numpy: `{env['numpy']}` | torch: `{env['torch']}`
- CUDA available: `{env['cuda_available']}` | Device: `{env['cuda_device']}`

## RNG Fingerprint
- Match on repeated seeds: `{f1 == f2}`
- numpy: `{f1['numpy']}`
- torch: `{f1['torch']}`
""").strip()

open("reports/system_check.md","w").write(summary)
print(summary)

1.4.9 Save the file as system_check.ipynb. To do it automatically, you can use the following code:


# Colab cell: save this notebook as system_check.ipynb  
from google.colab import  _message
notebook_name = "system_check.ipynb"
# Create the 'notebooks' subdirectory path
out_dir = proj / "notebooks"
out_path = out_dir / notebook_name

# Make sure the folder exists
out_dir.mkdir(parents=True, exist_ok=True)

# Get the CURRENT notebook JSON from Colab
resp = _message.blocking_request('get_ipynb', timeout_sec=10)
nb = resp.get('ipynb') if isinstance(resp, dict) else None

# Basic sanity check: ensure there are cells
if not nb or not isinstance(nb, dict) or not nb.get('cells'):
    raise RuntimeError("Could not capture the current notebook contents (no cells returned). "
                       "Try running this cell again after a quick edit, or use File → Save a copy in Drive once.")

# Write to Drive
with open(out_path, 'w', encoding='utf-8') as f:
    json.dump(nb, f, ensure_ascii=False, indent=2)

print("Saved notebook to:", out_path)

What to submit after class (if you already have a GitHub repo): For today, students may upload system_check.ipynb, reports/environment.json, and reports/system_check.md via the GitHub web UI (Add file → Upload files). We’ll do proper pushes/PRs next session.


1.5 Troubleshooting notes (share in class)

  • Drive won’t mount: Refresh the Colab tab, run the mount cell again, re‑authorize Google permissions.
  • pip install hangs: Rerun; if it persists, restart runtime (Runtime → Restart session) and re‑run from the top.
  • PyTorch mismatch: If Colab has Torch preinstalled, don’t upgrade it. If you installed a CPU wheel by mistake and want GPU later, it’s usually easiest to restart runtime.
  • Path confusion: Print os.getcwd() often; ensure you’re inside your project folder under /content/drive/MyDrive/....

1.6 Homework (due before Session 2)

Goal: Produce a reproducible system snapshot and a seed‑verified mini experiment, then upload to your repo (via GitHub web UI if you’re not comfortable pushing yet).

1.6.1 Part A — Freeze your environment

  1. From the same Colab runtime (after installing), create a lock file:

    # Colab cell: freeze exact versions
    !pip freeze > requirements-lock.txt
    print("Wrote requirements-lock.txt with exact versions")
    !head -n 20 requirements-lock.txt
  2. Add a note to README.md explaining the difference between:

    • requirements.txt (soft pins for development) and
    • requirements-lock.txt (exact versions used today).

1.6.2 Part B — Reproducibility mini‑experiment

Create notebooks/reproducibility_demo.ipynb with the following cells (students copy/paste):

1) Setup & data generation

import numpy as np, torch, random, json, os, time

def set_seed(seed=123):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    try:
        torch.use_deterministic_algorithms(True)
    except Exception:
        pass

def make_toy(n=512, d=10, noise=0.1, seed=123):
    set_seed(seed)
    X = torch.randn(n, d)
    true_w = torch.randn(d, 1)
    y = X @ true_w + noise * torch.randn(n, 1)
    return X, y, true_w

device = "cuda" if torch.cuda.is_available() else "cpu"
X, y, true_w = make_toy()
X, y = X.to(device), y.to(device)

2) Minimal training loop (linear model)

def train_once(lr=0.05, steps=300, seed=123):
    set_seed(seed)
    model = torch.nn.Linear(X.shape[1], 1, bias=False).to(device)
    opt = torch.optim.SGD(model.parameters(), lr=lr)
    loss_fn = torch.nn.MSELoss()
    losses=[]
    for t in range(steps):
        opt.zero_grad(set_to_none=True)
        yhat = model(X)
        loss = loss_fn(yhat, y)
        loss.backward()
        opt.step()
        losses.append(loss.item())
    return model.weight.detach().cpu().numpy(), losses[-1]

w1, final_loss1 = train_once(seed=2025)
w2, final_loss2 = train_once(seed=2025)

print("Final loss 1:", round(final_loss1, 6))
print("Final loss 2:", round(final_loss2, 6))
print("Weights equal:", np.allclose(w1, w2, atol=1e-7))

3) Save results JSON

os.makedirs("reports", exist_ok=True)
result = {
    "device": device,
    "final_loss1": float(final_loss1),
    "final_loss2": float(final_loss2),
    "weights_equal": bool(np.allclose(w1, w2, atol=1e-7)),
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
}
with open("reports/reproducibility_results.json","w") as f:
    json.dump(result, f, indent=2)
result

Expected outcome: the two runs with the same seed should produce the same final loss and identical weights (within tolerance). If on GPU, deterministic settings should keep this stable for this simple model.

1.6.3 Part C — Add a .env.example

Create a placeholder for API keys we’ll use later:

env_example = """\
# Example environment variables (do NOT commit a real .env with secrets)
ALPHA_VANTAGE_KEY=
FRED_API_KEY=
"""
open(".env.example", "w").write(env_example)
print(open(".env.example").read())

1.6.4 Part D — Upload to GitHub

Until we set up pushes/PRs next class, use the GitHub web UI:

  • Upload: system_check.ipynb, reports/environment.json, reports/system_check.md, requirements.txt, requirements-lock.txt, notebooks/reproducibility_demo.ipynb, reports/reproducibility_results.json, .env.example.
  • If you already cloned a repo in class and are comfortable pushing, you may push from your laptop instead. Do not paste tokens into notebooks.

1.6.5 Grading (pass/revise)

  • requirements.txt present; requirements-lock.txt present and non‑empty.
  • system_check.ipynb runs and writes reports/system_check.md + environment.json.
  • reproducibility_demo.ipynb demonstrates identical results across repeated runs with same seed and writes reports/reproducibility_results.json.
  • .env.example present with placeholders.

1.7 Key points

  • “Colab is ephemeral; persist to Drive.”
  • “Soft pins now; freeze later.”
  • “Seeds are necessary but not sufficient—watch for nondeterministic ops.”
  • “Never store secrets (API keys) in the repo; use .env and keep a .env.example.”

That’s it for Session 1. In Session 2 we’ll set up Git basics and Git‑LFS and move from uploading via web UI to branch/PR workflows.