6 Session 6 — Make/Automation + rsync + ssh/tmux (survey)
Assumptions: You’re using the same Drive‑mounted repo from prior sessions (e.g., unified-stocks-teamX
). No trading advice—this is educational.
6.1 Session 6 — Make/Automation + rsync + ssh/tmux (75 min)
6.1.1 Learning goals
Students will be able to:
- Explain how Make turns scripts into a reproducible pipeline via targets, dependencies, and incremental builds.
- Create and use a Makefile with helpful defaults, variables, and a
help
target. - Use rsync to back up project artifacts and understand
--delete
and exclude patterns. - Understand the ssh key flow and a tmux workflow for long‑running jobs (survey).
6.2 Agenda (75 min)
- (12 min) Slides: Why automation? How Make models dependencies & incremental builds; best practices
- (10 min) Slides: rsync fundamentals;
ssh
keys & config;tmux
workflow (survey) - (33 min) In‑class lab (Colab): create scripts → author Makefile → run
make all
→ rsync backup - (10 min) Wrap‑up & troubleshooting
- (10 min) Buffer
6.3 Slides
6.3.1 Why Make for DS pipelines?
- Encodes your workflow as targets that depend on files or other targets.
- Incremental: only rebuilds what changed.
- Plays nicely with CI (
make all
from a clean clone). - Stable across OSes; no new runtime to learn.
Core syntax
target: dependencies
<TAB> recipe commands
- Use variables:
PY := python
,QUARTO := quarto
. - Use PHONY for meta‑targets that don’t produce files.
- Prefer deterministic outputs: fixed seeds, pinned versions, stable paths.
6.3.2 rsync basics
rsync -avh SRC/ DST/
→ syncs directory trees, preserving metadata.--delete
makes DST exactly match SRC (removes files not in SRC).--exclude
to skip folders (--exclude 'raw/'
).- Remote with SSH:
rsync -avz -e ssh SRC/ user@host:/path/
.
6.3.3 ssh keys & tmux (survey)
- Keys:
ssh-keygen -t ed25519 -C "you@school.edu"
; add the public key to servers/GitHub; keep private key private. ~/.ssh/config
holds named hosts;ssh myhpc
uses that stanza.- tmux: start
tmux new -s train
; detach (Ctrl-b d
); list (tmux ls
); reattach (tmux attach -t train
). Keeps jobs alive on remote shells.
6.4 In‑class lab (33 min)
We’ll create three tiny scripts and a Makefile that ties them together:
scripts/get_prices.py
→data/raw/prices.csv
(Yahoo viayfinance
, with synthetic fallback)scripts/build_features.py
→data/processed/features.parquet
scripts/backup.sh
→ rsync your artifacts tobackups/<timestamp>/
Makefile
→make all
runs end‑to‑end;make report
renders Quarto;make backup
syncs artifacts.
6.4.1 0) Mount Drive and set repo path
from google.colab import drive
'/content/drive', force_remount=True)
drive.mount(
= "YOUR_GITHUB_USERNAME_OR_ORG" # <- change
REPO_OWNER = "unified-stocks-teamX" # <- change
REPO_NAME = "/content/drive/MyDrive/dspt25"
BASE_DIR = f"{BASE_DIR}/{REPO_NAME}"
REPO_DIR
import pathlib, os, subprocess
=True, exist_ok=True)
pathlib.Path(BASE_DIR).mkdir(parentsif not pathlib.Path(REPO_DIR).exists():
raise SystemExit("Repo not found in Drive. Clone it first (see Session 2/3).")
os.chdir(REPO_DIR)print("Working dir:", os.getcwd())
6.4.2 1) Quick tool checks (Make, rsync, Quarto)
import subprocess, shutil
def check(cmd):
try:
= subprocess.check_output(cmd, text=True)
out print(cmd[0], "OK")
except Exception as e:
print(cmd[0], "NOT FOUND")
"make", "--version"])
check(["rsync", "--version"])
check(["quarto", "--version"]) check([
If Quarto is missing, re-run the installer from Session 3 before make report
.
6.4.3 2) Script: scripts/get_prices.py
from pathlib import Path
"scripts").mkdir(exist_ok=True)
Path(
= r"""#!/usr/bin/env python
get_py import argparse, sys, time
from pathlib import Path
import pandas as pd, numpy as np
def fetch_yf(ticker, start, end):
import yfinance as yf
df = yf.download(ticker, start=start, end=end, auto_adjust=True, progress=False)
if df is None or df.empty:
raise RuntimeError("empty")
df = df.rename(columns=str.lower)[["close","volume"]]
df.index.name = "date"
df = df.reset_index()
df["ticker"] = ticker
return df[["ticker","date","close","volume"]]
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--tickers", default="tickers_25.csv")
ap.add_argument("--start", default="2020-01-01")
ap.add_argument("--end", default="")
ap.add_argument("--out", default="data/raw/prices.csv")
args = ap.parse_args()
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
tickers = pd.read_csv(args.tickers)["ticker"].dropna().unique().tolist()
rows = []
for t in tickers:
try:
df = fetch_yf(t, args.start, args.end or None)
except Exception:
# synthetic fallback
idx = pd.bdate_range(args.start, args.end or pd.Timestamp.today().date())
rng = np.random.default_rng(42 + hash(t)%1000)
r = rng.normal(0, 0.01, len(idx))
price = 100*np.exp(np.cumsum(r))
vol = rng.integers(1e5, 5e6, len(idx))
df = pd.DataFrame({"ticker": t, "date": idx, "close": price, "volume": vol})
df["date"] = pd.to_datetime(df["date"]).dt.date
df["adj_close"] = df["close"]
df = df.drop(columns=["close"])
df["log_return"] = np.log(df["adj_close"]).diff().fillna(0.0)
rows.append(df)
allp = pd.concat(rows, ignore_index=True)
allp = allp[["ticker","date","adj_close","volume","log_return"]]
allp.to_csv(out, index=False)
print("Wrote", out, "rows:", len(allp))
if __name__ == "__main__":
sys.exit(main())
"""
open("scripts/get_prices.py","w").write(get_py)
import os, stat
"scripts/get_prices.py", os.stat("scripts/get_prices.py").st_mode | stat.S_IEXEC)
os.chmod(print("Created scripts/get_prices.py")
6.4.4 3) Script: scripts/build_features.py
= r"""#!/usr/bin/env python
feat_py import argparse
from pathlib import Path
import pandas as pd, numpy as np
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--input", default="data/raw/prices.csv")
ap.add_argument("--out", default="data/processed/features.parquet")
ap.add_argument("--roll", type=int, default=20)
args = ap.parse_args()
df = pd.read_csv(args.input, parse_dates=["date"])
df = df.sort_values(["ticker","date"])
# groupwise lags
df["r_1d"] = df["log_return"]
for k in (1,2,3):
df[f"lag{k}"] = df.groupby("ticker")["r_1d"].shift(k)
df["roll_mean"] = (df.groupby("ticker")["r_1d"]
.rolling(args.roll, min_periods=args.roll//2).mean()
.reset_index(level=0, drop=True))
df["roll_std"] = (df.groupby("ticker")["r_1d"]
.rolling(args.roll, min_periods=args.roll//2).std()
.reset_index(level=0, drop=True))
out = Path(args.out)
out.parent.mkdir(parents=True, exist_ok=True)
# Save compactly
df.to_parquet(out, index=False)
print("Wrote", out, "rows:", len(df))
if __name__ == "__main__":
main()
"""
open("scripts/build_features.py","w").write(feat_py)
import os, stat
"scripts/build_features.py", os.stat("scripts/build_features.py").st_mode | stat.S_IEXEC)
os.chmod(print("Created scripts/build_features.py")
6.4.5 4) Script: scripts/backup.sh
(rsync)
= r"""#!/usr/bin/env bash
backup_sh # Sync selected artifacts to backups/<timestamp> using rsync.
# Usage: scripts/backup.sh [DEST_ROOT]
set -euo pipefail
ROOT="${1:-backups}"
STAMP="$(date +%Y%m%d-%H%M%S)"
DEST="${ROOT}/run-${STAMP}"
mkdir -p "$DEST"
# What to back up (adjust as needed)
INCLUDE=("data/processed" "reports" "docs")
for src in "${INCLUDE[@]}"; do
if [[ -d "$src" ]]; then
echo "Syncing $src -> $DEST/$src"
rsync -avh --delete --exclude 'raw/' --exclude 'interim/' "$src"/ "$DEST/$src"/
fi
done
echo "Backup complete at $DEST"
"""
open("scripts/backup.sh","w").write(backup_sh)
import os, stat
"scripts/backup.sh", os.stat("scripts/backup.sh").st_mode | stat.S_IEXEC)
os.chmod(print("Created scripts/backup.sh")
6.4.6 5) Makefile (robust, with variables and help
)
= r"""# Makefile — unified-stocks
makefile SHELL := /bin/bash
.SHELLFLAGS := -eu -o pipefail -c
PY := python
QUARTO := quarto
START ?= 2020-01-01
END ?= 2025-08-01
ROLL ?= 30
DATA_RAW := data/raw/prices.csv
FEATS := data/processed/features.parquet
REPORT := docs/reports/eda.html
# Default target
.DEFAULT_GOAL := help
.PHONY: help all clean clobber qa report backup
help: ## Show help for each target
@awk 'BEGIN {FS = ":.*##"; printf "Available targets:\n"} /^[a-zA-Z0-9_\-]+:.*##/ {printf " \033[36m%-18s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)
all: $(DATA_RAW) $(FEATS) report backup ## Run the full pipeline and back up artifacts
$(DATA_RAW): scripts/get_prices.py tickers_25.csv
$(PY) scripts/get_prices.py --tickers tickers_25.csv --start $(START) --end $(END) --out $(DATA_RAW)
$(FEATS): scripts/build_features.py $(DATA_RAW) scripts/qa_csv.sh
# Basic QA first
scripts/qa_csv.sh $(DATA_RAW)
$(PY) scripts/build_features.py --input $(DATA_RAW) --out $(FEATS) --roll $(ROLL)
report: $(REPORT) ## Render Quarto EDA to docs/
$(REPORT): reports/eda.qmd _quarto.yml docs/style.css
$(QUARTO) render reports/eda.qmd -P symbol:AAPL -P start_date=$(START) -P end_date=$(END) -P rolling=$(ROLL) --output-dir docs/
@test -f $(REPORT) || (echo "Report not generated." && exit 1)
backup: ## Rsync selected artifacts to backups/<timestamp>/
./scripts/backup.sh
clean: ## Remove intermediate artifacts (safe)
rm -rf data/interim
rm -rf data/processed/*.parquet || true
clobber: clean ## Remove generated reports and backups (dangerous)
rm -rf docs/reports || true
rm -rf backups || true
"""
open("Makefile","w").write(makefile)
print(open("Makefile").read())
Note: The Makefile expects
scripts/qa_csv.sh
from Session 5. If a student missed it, setscripts/qa_csv.sh
to a no‑op or remove that dependency temporarily.
6.4.7 6) Try the pipeline
import subprocess, os, textwrap, sys
print(subprocess.check_output(["make", "help"], text=True))
# Fetch raw, build features, render report, back up artifacts
import subprocess
print(subprocess.check_output(["make", "all"], text=True))
Confirm:
data/raw/prices.csv
existsdata/processed/features.parquet
existsdocs/reports/eda.html
rendersbackups/run-<timestamp>/
contains synced folders
6.4.8 7) (Optional) just
command‑runner
Optional: If
just
is available on your system, create ajustfile
that mirrors common Make targets. On Colab, installation may or may not be available; this is just for reference.
%%bash
set -e
if ! command -v just >/dev/null 2>&1; then
echo "just not found; skipping optional step."
exit 0
fi
cat > justfile << 'EOF'
# justfile — optional convenience recipes
set shell := ["bash", "-eu", "-o", "pipefail", "-c"]
start := "2020-01-01"
end := "2025-08-01"
roll := "30"
help:
\t@echo "Recipes: get-data, features, report, all, backup"
get-data:
\tpython scripts/get_prices.py --tickers tickers_25.csv --start {{start}} --end {{end}} --out data/raw/prices.csv
features:
\tbash -lc 'scripts/qa_csv.sh data/raw/prices.csv'
\tpython scripts/build_features.py --input data/raw/prices.csv --out data/processed/features.parquet --roll {{roll}}
report:
\tquarto render reports/eda.qmd -P symbol:AAPL -P start_date={{start}} -P end_date={{end}} -P rolling:{{roll}} --output-dir docs/
all: get-data features report
backup:
\t./scripts/backup.sh
EOF
echo "Wrote justfile (optional)."
6.4.9 8) ssh & tmux quickstarts (survey, run locally, not in Colab)
ssh key generation (local terminal):
ssh-keygen -t ed25519 -C "you@school.edu"
# Press enter to accept default path (~/.ssh/id_ed25519), set a passphrase (recommended)
cat ~/.ssh/id_ed25519.pub # copy this PUBLIC key where needed (GitHub/servers)
SSH config (~/.ssh/config
, local):
Host github
HostName github.com
User git
IdentityFile ~/.ssh/id_ed25519
AddKeysToAgent yes
IdentitiesOnly yes
Host myhpc
HostName login.hpc.university.edu
User your_netid
IdentityFile ~/.ssh/id_ed25519
Test GitHub SSH (local):
ssh -T git@github.com
tmux essentials (remote or local):
tmux new -s train # start session "train"
# ... run your long job ...
# detach: press Ctrl-b then d
tmux ls # list sessions
tmux attach -t train # reattach
tmux kill-session -t train # end session
6.5 Wrap‑up (10 min)
- Make codifies your pipeline; the file graph serves as your dependency DAG.
- Incremental builds save time: edit one script → only downstream targets rebuild.
- rsync is your friend for backups/snapshots; be deliberate with
--delete
. - ssh/tmux: you don’t need them in Colab, but you will on servers/HPC.
6.6 Homework (due before Session 7)
Goal: Extend your automation with a tiny baseline training & evaluation step and polish your Makefile.
6.6.1 Part A — Add a minimal baseline trainer
Create scripts/train_baseline.py
that learns a linear regression on lagged returns (toy baseline) and writes metrics.
= r"""#!/usr/bin/env python
train_py import argparse, json
from pathlib import Path
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--features", default="data/processed/features.parquet")
ap.add_argument("--out-metrics", default="reports/baseline_metrics.json")
args = ap.parse_args()
df = pd.read_parquet(args.features)
# Train/test split by date (last 20% for test)
df = df.dropna(subset=["lag1","lag2","lag3","r_1d"])
n = len(df)
split = int(n*0.8)
Xtr = df[["lag1","lag2","lag3"]].iloc[:split].values
ytr = df["r_1d"].iloc[:split].values
Xte = df[["lag1","lag2","lag3"]].iloc[split:].values
yte = df["r_1d"].iloc[split:].values
model = LinearRegression().fit(Xtr, ytr)
pred = model.predict(Xte)
mae = float(mean_absolute_error(yte, pred))
Path("reports").mkdir(exist_ok=True)
with open(args.out_metrics, "w") as f:
json.dump({"model":"linear(lag1,lag2,lag3)","test_mae":mae,"n_test":len(yte)}, f, indent=2)
print("Wrote", args.out_metrics, "MAE:", mae)
if __name__ == "__main__":
main()
"""
open("scripts/train_baseline.py","w").write(train_py)
import os, stat
"scripts/train_baseline.py", os.stat("scripts/train_baseline.py").st_mode | stat.S_IEXEC)
os.chmod(print("Created scripts/train_baseline.py")
6.6.2 Part B — Extend your Makefile with train
and all
Append these to your Makefile
:
# --- add after FEATS definition, near other targets ---
TRAIN_METRICS := reports/baseline_metrics.json
.PHONY: train
train: $(TRAIN_METRICS) ## Train toy baseline and write metrics
$(TRAIN_METRICS): scripts/train_baseline.py $(FEATS)
$(PY) scripts/train_baseline.py --features $(FEATS) --out-metrics $(TRAIN_METRICS)
# Update 'all' to include 'train'
# all: $(DATA_RAW) $(FEATS) report backup # OLD
# Replace with:
# all: $(DATA_RAW) $(FEATS) report train backup
Run:
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make train
cat reports/baseline_metrics.json
6.6.3 Part C — Add a help
description to every target and verify make help
Ensure each target in your Makefile
has a ##
comment. Run:
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make help
6.6.4 Part D — (Optional) Track small models/metrics with Git‑LFS
If you decide to save model artifacts (e.g., models/baseline.pkl
), track them:
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git lfs track "models/*.pkl"
git add .gitattributes
git commit -m "chore: track small model files via LFS"
(You can extend train_baseline.py
to save models/baseline.pkl
using joblib
.)
6.6.5 Part E — Commit & push
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git add scripts/*.py scripts/backup.sh Makefile reports/baseline_metrics.json
git status
git commit -m "feat: automated pipeline with Make (data->features->report->train) and rsync backup"
from getpass import getpass
import subprocess
= getpass("GitHub token (not stored): ")
token = "YOUR_GITHUB_USERNAME_OR_ORG"
REPO_OWNER = "unified-stocks-teamX"
REPO_NAME = f"https://{token}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
push_url "git", "push", push_url, "HEAD:main"], check=True)
subprocess.run([del token
6.6.6 Grading (pass/revise)
make all
runs from a fresh clone (with minimal edits for tokens/Quarto install) and produces:data/raw/prices.csv
,data/processed/features.parquet
,docs/reports/eda.html
,reports/baseline_metrics.json
, and abackups/run-*/
snapshot.Makefile
has helpfulhelp
output and variables (START
,END
,ROLL
).scripts/backup.sh
usesrsync -avh --delete
and excludesraw/
&interim/
.- (Optional) LFS tracking updated for models.
6.7 Key points:
- Make is a thin layer over shell commands—it doesn’t replace Python; it orchestrates it.
- Keep targets idempotent: running twice shouldn’t break; only rebuild on changes.
- Use
rsync
with care:--delete
is powerful—double‑check DEST paths. - ssh/tmux: you’ll want this the first time you run a long model on a remote machine.