1 Session 1 — Dev environment & Colab workflow
1.1 Session 1 — Dev environment & Colab workflow
1.1.1 Learning goals
By the end of class, students can:
- Mount Google Drive in Colab and work in a persistent course folder.
- Clone a GitHub repo into Drive (or create a project folder if no repo yet).
- Create and install from a soft‑pinned
requirements.txt
. - Verify environment info (Python, OS, library versions) and GPU availability.
- Use a reproducibility seed pattern (NumPy + PyTorch) and validate it.
- Save a simple system check report to the repo.
1.2 Agenda (75 min)
- (5 min) Course framing: how we’ll work this semester
- (12 min) Slides & demo: Colab + Drive persistence; project folders; soft vs hard pins
- (8 min) Slides & demo: reproducibility basics (seeds, RNG, deterministic ops)
- (35 min) In‑class lab (Colab): mount Drive → clone/create project → requirements → environment check → reproducibility check → write report
- (10 min) Wrap‑up, troubleshooting, and homework briefing
1.3 Main Points
Why Colab + Drive
- Colab gives you GPUs and a clean Python every session.
- The runtime is ephemeral. Anything under
/content
disappears. - Mount Drive and work under
/content/drive/MyDrive/...
to persist code and outputs.
Project layout (today’s minimal)
project/
reports/
notebooks/
data/
requirements.txt
system_check.ipynb
(We’ll add src/
, tests, CI in later sessions.)
Pins: soft vs hard
- Soft pins (e.g.,
pandas>=2.2,<3.0
) keep you compatible across machines. - Hard pins (exact versions) are for releases. Today we’ll use soft pins, then freeze to
requirements-lock.txt
in homework.
Reproducibility basics
- Fix seeds for random, NumPy, PyTorch (and CUDA if present).
- Disable nondeterministic cuDNN behavior for repeatability in simple models.
- Beware: some ops remain nondeterministic on GPU; we’ll use simple ones.
Minimal Git today
- If you already have a repo: clone it into G-Drive.
- If not: create a folder; later you can upload the notebook via GitHub web UI.
- Full Git workflow (branch/PR/CI) starts next session.
1.4 In‑class Lab (35 min)
Instructor tip: Put these as sequential Colab cells. Students should run them top‑to‑bottom. Replace placeholders like
YOUR_USERNAME
/YOUR_REPO
before class if you already created a starter repo. If not, tell them to use the “no‑repo” path in Step 3B.
1.4.1 1) Mount Google Drive and create a course folder
# Colab cell
from google.colab import drive
'/content/drive', force_remount=True)
drive.mount(
= "/content/drive/MyDrive/dspt25" # change if you prefer another path
COURSE_DIR = "unified-stocks" # course project folder/repo name PROJECT_NAME
Save it as system_check.ipynb
.
# Colab cell: make directories and cd into project folder
import os, pathlib
= pathlib.Path(COURSE_DIR)
base = base / PROJECT_NAME # / is overloaded to create the path
proj for p in [base, proj, proj/"reports", proj/"notebooks", proj/"data"]:
=True, exist_ok=True)
p.mkdir(parents
import os
os.chdir(proj)print("Working in:", os.getcwd())
1.4.2 2) (Optional) If you already have a GitHub repo, clone it into Drive
Pick A or B (not both).
A. Clone an existing repo (recommended if you created a starter repo)
# Colab cell: clone via HTTPS (public or your private; for private, you can upload later instead of pushing from Colab) ONLY clone onetime. If run this notebook again, skip this cell.
= "https://github.com/YOUR_ORG_OR_USERNAME/YOUR_REPO.git" # <- change me
REPO_URL import subprocess, os
# clone next to your project folder
os.chdir(base) "git", "clone", REPO_URL], check=True) # check if there is an error inseat of silent.
subprocess.run([# Optionally, use that cloned repo as the working directory:(uncomment the lines below if do this)
# REPO_NAME = REPO_URL.split("/")[-1].replace(".git","")
# os.chdir(base/REPO_NAME)
# print("Working in:", os.getcwd())
# change back to proj dir
os.chdir(proj) print("Working in:", os.getcwd())
B. No repo yet? Stay with the folder we created. You’ll upload files via GitHub web UI after class.
1.4.3 3) Create a soft‑pinned requirements.txt
and install
# Colab cell: write a soft-pinned requirements.txt
= """\
req pandas>=2.2,<3.0
numpy>=2.0.0,<3.0
pyarrow>=15,<17
matplotlib>=3.8,<4.0
scikit-learn>=1.6,<2.0
yfinance>=0.2,<0.3
python-dotenv>=1.0,<2.0
"""
open("requirements.txt","w").write(req)
print(open("requirements.txt").read())
# Colab cell: install (quietly). Torch is usually preinstalled in Colab; we'll check separately.
!pip install -q -r requirements.txt
# Colab cell: PyTorch check. If not available (rare in Colab), install CPU-only as a fallback.
try:
import torch
print("PyTorch:", torch.__version__)
except Exception as e:
print("PyTorch not found; installing CPU-only wheel as fallback...")
!pip install -q torch
import torch
print("PyTorch:", torch.__version__)
1.4.4 4) Environment report (Python/OS/lib versions, GPU availability)
# Colab cell: environment info + GPU check
import sys, platform, json, time
import pandas as pd
import numpy as np
= {
env "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
"python": sys.version,
"os": platform.platform(),
"pandas": pd.__version__,
"numpy": np.__version__,
}
try:
import torch
"torch"] = torch.__version__
env["cuda_available"] = bool(torch.cuda.is_available())
env["cuda_device"] = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
env[except Exception as e:
"torch"] = "not importable"
env["cuda_available"] = False
env["cuda_device"] = "CPU"
env[
print(env)
"reports", exist_ok=True)
os.makedirs(with open("reports/environment.json","w") as f:
=2) json.dump(env, f, indent
1.4.5 5) Reproducibility seed utility + quick validation
# Colab cell: reproducibility helpers
import random
import numpy as np
def set_seed(seed: int = 42, deterministic_torch: bool = True):
random.seed(seed)
np.random.seed(seed)try:
import torch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)if deterministic_torch:
= True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark try:
True)
torch.use_deterministic_algorithms(except Exception:
pass
except Exception:
pass
def sample_rng_fingerprint(n=5, seed=42):
set_seed(seed)= np.random.rand(n).round(6).tolist()
a try:
import torch
= torch.rand(n).tolist()
b = [round(x,6) for x in b]
b except Exception:
= ["torch-missing"]*n
b return {"numpy": a, "torch": b}
= sample_rng_fingerprint(n=6, seed=123)
f1 = sample_rng_fingerprint(n=6, seed=123)
f2 print("Fingerprint #1:", f1)
print("Fingerprint #2:", f2)
print("Match:", f1 == f2)
with open("reports/seed_fingerprint.json","w") as f:
"f1": f1, "f2": f2, "match": f1==f2}, f, indent=2) json.dump({
1.4.6 6) Create (or verify) tickers_25.csv
for the course
# Colab cell: create stock list if it doesn't exist yet
import pandas as pd, os
= [
tickers "AAPL","MSFT","AMZN","GOOGL","META","NVDA","TSLA","JPM","JNJ","V",
"PG","HD","BAC","XOM","CVX","PFE","KO","DIS","NFLX","INTC",
"CSCO","ORCL","T","VZ","WMT"
]= "tickers_25.csv"
path if not os.path.exists(path):
"ticker": tickers}).to_csv(path, index=False)
pd.DataFrame({ pd.read_csv(path).head()
1.4.7 7) (Optional) Prove GPU works by allocating a small tensor
# Colab cell: tiny GPU smoke test (safe if CUDA available)
import torch, time
# change back to not use deterministic_algorithm to do the matrix computation
# torch.use_deterministic_algorithms(False)
= "cuda" if torch.cuda.is_available() else "cpu"
device = torch.randn(1000, 1000, device=device)
x = x @ x.T
y print("Device:", device, "| y shape:", y.shape, "| mean:", y.float().mean().item())
1.4.8 8) Save a short Markdown environment report
# Colab cell: write a small Markdown summary for humans
from textwrap import dedent
= dedent(f"""
summary # System Check
- Timestamp: {env['timestamp']}
- Python: `{env['python']}`
- OS: `{env['os']}`
- pandas: `{env['pandas']}` | numpy: `{env['numpy']}` | torch: `{env['torch']}`
- CUDA available: `{env['cuda_available']}` | Device: `{env['cuda_device']}`
## RNG Fingerprint
- Match on repeated seeds: `{f1 == f2}`
- numpy: `{f1['numpy']}`
- torch: `{f1['torch']}`
""").strip()
open("reports/system_check.md","w").write(summary)
print(summary)
1.4.9 Save the file as system_check.ipynb
. To do it automatically, you can use the following code:
# Colab cell: save this notebook as system_check.ipynb
from google.colab import _message
= "system_check.ipynb"
notebook_name # Create the 'notebooks' subdirectory path
= proj / "notebooks"
out_dir = out_dir / notebook_name
out_path
# Make sure the folder exists
=True, exist_ok=True)
out_dir.mkdir(parents
# Get the CURRENT notebook JSON from Colab
= _message.blocking_request('get_ipynb', timeout_sec=10)
resp = resp.get('ipynb') if isinstance(resp, dict) else None
nb
# Basic sanity check: ensure there are cells
if not nb or not isinstance(nb, dict) or not nb.get('cells'):
raise RuntimeError("Could not capture the current notebook contents (no cells returned). "
"Try running this cell again after a quick edit, or use File → Save a copy in Drive once.")
# Write to Drive
with open(out_path, 'w', encoding='utf-8') as f:
=False, indent=2)
json.dump(nb, f, ensure_ascii
print("Saved notebook to:", out_path)
What to submit after class (if you already have a GitHub repo): For today, students may upload
system_check.ipynb
,reports/environment.json
, andreports/system_check.md
via the GitHub web UI (Add file → Upload files). We’ll do proper pushes/PRs next session.
1.6 Homework (due before Session 2)
Goal: Produce a reproducible system snapshot and a seed‑verified mini experiment, then upload to your repo (via GitHub web UI if you’re not comfortable pushing yet).
1.6.1 Part A — Freeze your environment
From the same Colab runtime (after installing), create a lock file:
# Colab cell: freeze exact versions !pip freeze > requirements-lock.txt print("Wrote requirements-lock.txt with exact versions") !head -n 20 requirements-lock.txt
Add a note to
README.md
explaining the difference between:requirements.txt
(soft pins for development) andrequirements-lock.txt
(exact versions used today).
1.6.2 Part B — Reproducibility mini‑experiment
Create notebooks/reproducibility_demo.ipynb
with the following cells (students copy/paste):
1) Setup & data generation
import numpy as np, torch, random, json, os, time
def set_seed(seed=123):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)= True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark try:
True)
torch.use_deterministic_algorithms(except Exception:
pass
def make_toy(n=512, d=10, noise=0.1, seed=123):
set_seed(seed)= torch.randn(n, d)
X = torch.randn(d, 1)
true_w = X @ true_w + noise * torch.randn(n, 1)
y return X, y, true_w
= "cuda" if torch.cuda.is_available() else "cpu"
device = make_toy()
X, y, true_w = X.to(device), y.to(device) X, y
2) Minimal training loop (linear model)
def train_once(lr=0.05, steps=300, seed=123):
set_seed(seed)= torch.nn.Linear(X.shape[1], 1, bias=False).to(device)
model = torch.optim.SGD(model.parameters(), lr=lr)
opt = torch.nn.MSELoss()
loss_fn =[]
lossesfor t in range(steps):
=True)
opt.zero_grad(set_to_none= model(X)
yhat = loss_fn(yhat, y)
loss
loss.backward()
opt.step()
losses.append(loss.item())return model.weight.detach().cpu().numpy(), losses[-1]
= train_once(seed=2025)
w1, final_loss1 = train_once(seed=2025)
w2, final_loss2
print("Final loss 1:", round(final_loss1, 6))
print("Final loss 2:", round(final_loss2, 6))
print("Weights equal:", np.allclose(w1, w2, atol=1e-7))
3) Save results JSON
"reports", exist_ok=True)
os.makedirs(= {
result "device": device,
"final_loss1": float(final_loss1),
"final_loss2": float(final_loss2),
"weights_equal": bool(np.allclose(w1, w2, atol=1e-7)),
"timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
}with open("reports/reproducibility_results.json","w") as f:
=2)
json.dump(result, f, indent result
Expected outcome: the two runs with the same seed should produce the same final loss and identical weights (within tolerance). If on GPU, deterministic settings should keep this stable for this simple model.
1.6.3 Part C — Add a .env.example
Create a placeholder for API keys we’ll use later:
= """\
env_example # Example environment variables (do NOT commit a real .env with secrets)
ALPHA_VANTAGE_KEY=
FRED_API_KEY=
"""
open(".env.example", "w").write(env_example)
print(open(".env.example").read())
1.6.4 Part D — Upload to GitHub
Until we set up pushes/PRs next class, use the GitHub web UI:
- Upload:
system_check.ipynb
,reports/environment.json
,reports/system_check.md
,requirements.txt
,requirements-lock.txt
,notebooks/reproducibility_demo.ipynb
,reports/reproducibility_results.json
,.env.example
. - If you already cloned a repo in class and are comfortable pushing, you may push from your laptop instead. Do not paste tokens into notebooks.
1.6.5 Grading (pass/revise)
requirements.txt
present;requirements-lock.txt
present and non‑empty.system_check.ipynb
runs and writesreports/system_check.md
+environment.json
.reproducibility_demo.ipynb
demonstrates identical results across repeated runs with same seed and writesreports/reproducibility_results.json
..env.example
present with placeholders.
1.7 Key points
- “Colab is ephemeral; persist to Drive.”
- “Soft pins now; freeze later.”
- “Seeds are necessary but not sufficient—watch for nondeterministic ops.”
- “Never store secrets (API keys) in the repo; use
.env
and keep a.env.example
.”
That’s it for Session 1. In Session 2 we’ll set up Git basics and Git‑LFS and move from uploading via web UI to branch/PR workflows.