6 Session 6 — Make/Automation + rsync + ssh/tmux (survey)

Assumptions: You’re using the same Drive‑mounted repo from prior sessions (e.g., unified-stocks-teamX). No trading advice—this is educational.

6.1 Session 6 — Make/Automation + rsync + ssh/tmux (75 min)

6.1.1 Learning goals

Students will be able to:

Explain how Make turns scripts into a reproducible pipeline via targets, dependencies, and incremental builds.
Create and use a Makefile with helpful defaults, variables, and a help target.
Use rsync to back up project artifacts and understand --delete and exclude patterns.
Understand the ssh key flow and a tmux workflow for long‑running jobs (survey).

6.2 Agenda (75 min)

(12 min) Slides: Why automation? How Make models dependencies & incremental builds; best practices
(10 min) Slides: rsync fundamentals; ssh keys & config; tmux workflow (survey)
(33 min) In‑class lab (Colab): create scripts → author Makefile → run make all → rsync backup
(10 min) Wrap‑up & troubleshooting
(10 min) Buffer

6.3 Slides

6.3.1 Why Make for DS pipelines?

Encodes your workflow as targets that depend on files or other targets.
Incremental: only rebuilds what changed.
Plays nicely with CI (make all from a clean clone).
Stable across OSes; no new runtime to learn.

Core syntax

target: dependencies
<TAB> recipe commands

Use variables: PY := python, QUARTO := quarto.
Use PHONY for meta‑targets that don’t produce files.
Prefer deterministic outputs: fixed seeds, pinned versions, stable paths.

6.3.2 rsync basics

rsync -avh SRC/ DST/ → syncs directory trees, preserving metadata.
--delete makes DST exactly match SRC (removes files not in SRC).
--exclude to skip folders (--exclude 'raw/').
Remote with SSH: rsync -avz -e ssh SRC/ user@host:/path/.

6.3.3 ssh keys & tmux (survey)

Keys: ssh-keygen -t ed25519 -C "you@school.edu"; add the public key to servers/GitHub; keep private key private.
~/.ssh/config holds named hosts; ssh myhpc uses that stanza.
tmux: start tmux new -s train; detach (Ctrl-b d); list (tmux ls); reattach (tmux attach -t train). Keeps jobs alive on remote shells.

6.4 In‑class lab (33 min)

We’ll create three tiny scripts and a Makefile that ties them together:

scripts/get_prices.py → data/raw/prices.csv (Yahoo via yfinance, with synthetic fallback)
scripts/build_features.py → data/processed/features.parquet
scripts/backup.sh → rsync your artifacts to backups/<timestamp>/
Makefile → make all runs end‑to‑end; make report renders Quarto; make backup syncs artifacts.

6.4.1 0) Mount Drive and set repo path

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

REPO_OWNER = "YOUR_GITHUB_USERNAME_OR_ORG"  # <- change
REPO_NAME  = "unified-stocks-teamX"         # <- change
BASE_DIR   = "/content/drive/MyDrive/dspt25"
REPO_DIR   = f"{BASE_DIR}/{REPO_NAME}"

import pathlib, os, subprocess
pathlib.Path(BASE_DIR).mkdir(parents=True, exist_ok=True)
if not pathlib.Path(REPO_DIR).exists():
    raise SystemExit("Repo not found in Drive. Clone it first (see Session 2/3).")
os.chdir(REPO_DIR)
print("Working dir:", os.getcwd())

6.4.2 1) Quick tool checks (Make, rsync, Quarto)

import subprocess, shutil
def check(cmd):
    try:
        out = subprocess.check_output(cmd, text=True)
        print(cmd[0], "OK")
    except Exception as e:
        print(cmd[0], "NOT FOUND")
check(["make", "--version"])
check(["rsync", "--version"])
check(["quarto", "--version"])

If Quarto is missing, re-run the installer from Session 3 before make report.

6.4.3 2) Script: `scripts/get_prices.py`

from pathlib import Path
Path("scripts").mkdir(exist_ok=True)

get_py = r"""#!/usr/bin/env python
import argparse, sys, time
from pathlib import Path
import pandas as pd, numpy as np

def fetch_yf(ticker, start, end):
    import yfinance as yf
    df = yf.download(ticker, start=start, end=end, auto_adjust=True, progress=False)
    if df is None or df.empty:
        raise RuntimeError("empty")
    df = df.rename(columns=str.lower)[["close","volume"]]
    df.index.name = "date"
    df = df.reset_index()
    df["ticker"] = ticker
    return df[["ticker","date","close","volume"]]

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--tickers", default="tickers_25.csv")
    ap.add_argument("--start", default="2020-01-01")
    ap.add_argument("--end", default="")
    ap.add_argument("--out", default="data/raw/prices.csv")
    args = ap.parse_args()

    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    tickers = pd.read_csv(args.tickers)["ticker"].dropna().unique().tolist()

    rows = []
    for t in tickers:
        try:
            df = fetch_yf(t, args.start, args.end or None)
        except Exception:
            # synthetic fallback
            idx = pd.bdate_range(args.start, args.end or pd.Timestamp.today().date())
            rng = np.random.default_rng(42 + hash(t)%1000)
            r = rng.normal(0, 0.01, len(idx))
            price = 100*np.exp(np.cumsum(r))
            vol = rng.integers(1e5, 5e6, len(idx))
            df = pd.DataFrame({"ticker": t, "date": idx, "close": price, "volume": vol})
        df["date"] = pd.to_datetime(df["date"]).dt.date
        df["adj_close"] = df["close"]
        df = df.drop(columns=["close"])
        df["log_return"] = np.log(df["adj_close"]).diff().fillna(0.0)
        rows.append(df)

    allp = pd.concat(rows, ignore_index=True)
    allp = allp[["ticker","date","adj_close","volume","log_return"]]
    allp.to_csv(out, index=False)
    print("Wrote", out, "rows:", len(allp))

if __name__ == "__main__":
    sys.exit(main())
"""
open("scripts/get_prices.py","w").write(get_py)
import os, stat
os.chmod("scripts/get_prices.py", os.stat("scripts/get_prices.py").st_mode | stat.S_IEXEC)
print("Created scripts/get_prices.py")

6.4.4 3) Script: `scripts/build_features.py`

feat_py = r"""#!/usr/bin/env python
import argparse
from pathlib import Path
import pandas as pd, numpy as np

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--input", default="data/raw/prices.csv")
    ap.add_argument("--out", default="data/processed/features.parquet")
    ap.add_argument("--roll", type=int, default=20)
    args = ap.parse_args()

    df = pd.read_csv(args.input, parse_dates=["date"])
    df = df.sort_values(["ticker","date"])
    # groupwise lags
    df["r_1d"] = df["log_return"]
    for k in (1,2,3):
        df[f"lag{k}"] = df.groupby("ticker")["r_1d"].shift(k)
    df["roll_mean"] = (df.groupby("ticker")["r_1d"]
                         .rolling(args.roll, min_periods=args.roll//2).mean()
                         .reset_index(level=0, drop=True))
    df["roll_std"]  = (df.groupby("ticker")["r_1d"]
                         .rolling(args.roll, min_periods=args.roll//2).std()
                         .reset_index(level=0, drop=True))
    out = Path(args.out)
    out.parent.mkdir(parents=True, exist_ok=True)
    # Save compactly
    df.to_parquet(out, index=False)
    print("Wrote", out, "rows:", len(df))

if __name__ == "__main__":
    main()
"""
open("scripts/build_features.py","w").write(feat_py)
import os, stat
os.chmod("scripts/build_features.py", os.stat("scripts/build_features.py").st_mode | stat.S_IEXEC)
print("Created scripts/build_features.py")

6.4.5 4) Script: `scripts/backup.sh` (rsync)

backup_sh = r"""#!/usr/bin/env bash
# Sync selected artifacts to backups/<timestamp> using rsync.
# Usage: scripts/backup.sh [DEST_ROOT]
set -euo pipefail
ROOT="${1:-backups}"
STAMP="$(date +%Y%m%d-%H%M%S)"
DEST="${ROOT}/run-${STAMP}"
mkdir -p "$DEST"

# What to back up (adjust as needed)
INCLUDE=("data/processed" "reports" "docs")

for src in "${INCLUDE[@]}"; do
  if [[ -d "$src" ]]; then
    echo "Syncing $src -> $DEST/$src"
    rsync -avh --delete --exclude 'raw/' --exclude 'interim/' "$src"/ "$DEST/$src"/
  fi
done

echo "Backup complete at $DEST"
"""
open("scripts/backup.sh","w").write(backup_sh)
import os, stat
os.chmod("scripts/backup.sh", os.stat("scripts/backup.sh").st_mode | stat.S_IEXEC)
print("Created scripts/backup.sh")

6.4.6 5) Makefile (robust, with variables and `help`)

makefile = r"""# Makefile — unified-stocks
SHELL := /bin/bash
.SHELLFLAGS := -eu -o pipefail -c

PY := python
QUARTO := quarto

START ?= 2020-01-01
END   ?= 2025-08-01
ROLL  ?= 30

DATA_RAW := data/raw/prices.csv
FEATS    := data/processed/features.parquet
REPORT   := docs/reports/eda.html

# Default target
.DEFAULT_GOAL := help

.PHONY: help all clean clobber qa report backup

help: ## Show help for each target
    @awk 'BEGIN {FS = ":.*##"; printf "Available targets:\n"} /^[a-zA-Z0-9_\-]+:.*##/ {printf "  \033[36m%-18s\033[0m %s\n", $$1, $$2}' $(MAKEFILE_LIST)

all: $(DATA_RAW) $(FEATS) report backup ## Run the full pipeline and back up artifacts

$(DATA_RAW): scripts/get_prices.py tickers_25.csv
    $(PY) scripts/get_prices.py --tickers tickers_25.csv --start $(START) --end $(END) --out $(DATA_RAW)

$(FEATS): scripts/build_features.py $(DATA_RAW) scripts/qa_csv.sh
    # Basic QA first
    scripts/qa_csv.sh $(DATA_RAW)
    $(PY) scripts/build_features.py --input $(DATA_RAW) --out $(FEATS) --roll $(ROLL)

report: $(REPORT) ## Render Quarto EDA to docs/
$(REPORT): reports/eda.qmd _quarto.yml docs/style.css
    $(QUARTO) render reports/eda.qmd -P symbol:AAPL -P start_date=$(START) -P end_date=$(END) -P rolling=$(ROLL) --output-dir docs/
    @test -f $(REPORT) || (echo "Report not generated." && exit 1)

backup: ## Rsync selected artifacts to backups/<timestamp>/
    ./scripts/backup.sh

clean: ## Remove intermediate artifacts (safe)
    rm -rf data/interim
    rm -rf data/processed/*.parquet || true

clobber: clean ## Remove generated reports and backups (dangerous)
    rm -rf docs/reports || true
    rm -rf backups || true
"""
open("Makefile","w").write(makefile)
print(open("Makefile").read())

Note: The Makefile expects scripts/qa_csv.sh from Session 5. If a student missed it, set scripts/qa_csv.sh to a no‑op or remove that dependency temporarily.

6.4.7 6) Try the pipeline

import subprocess, os, textwrap, sys
print(subprocess.check_output(["make", "help"], text=True))

# Fetch raw, build features, render report, back up artifacts
import subprocess
print(subprocess.check_output(["make", "all"], text=True))

Confirm:

data/raw/prices.csv exists
data/processed/features.parquet exists
docs/reports/eda.html renders
backups/run-<timestamp>/ contains synced folders

6.4.8 7) (Optional) `just` command‑runner

Optional: If just is available on your system, create a justfile that mirrors common Make targets. On Colab, installation may or may not be available; this is just for reference.

%%bash
set -e
if ! command -v just >/dev/null 2>&1; then
  echo "just not found; skipping optional step."
  exit 0
fi
cat > justfile << 'EOF'
# justfile — optional convenience recipes
set shell := ["bash", "-eu", "-o", "pipefail", "-c"]

start := "2020-01-01"
end   := "2025-08-01"
roll  := "30"

help:
\t@echo "Recipes: get-data, features, report, all, backup"

get-data:
\tpython scripts/get_prices.py --tickers tickers_25.csv --start {{start}} --end {{end}} --out data/raw/prices.csv

features:
\tbash -lc 'scripts/qa_csv.sh data/raw/prices.csv'
\tpython scripts/build_features.py --input data/raw/prices.csv --out data/processed/features.parquet --roll {{roll}}

report:
\tquarto render reports/eda.qmd -P symbol:AAPL -P start_date={{start}} -P end_date={{end}} -P rolling:{{roll}} --output-dir docs/

all: get-data features report

backup:
\t./scripts/backup.sh
EOF
echo "Wrote justfile (optional)."

6.4.9 8) ssh & tmux quickstarts (survey, run locally, not in Colab)

ssh key generation (local terminal):

ssh-keygen -t ed25519 -C "you@school.edu"
# Press enter to accept default path (~/.ssh/id_ed25519), set a passphrase (recommended)
cat ~/.ssh/id_ed25519.pub   # copy this PUBLIC key where needed (GitHub/servers)

SSH config (~/.ssh/config, local):

Host github
  HostName github.com
  User git
  IdentityFile ~/.ssh/id_ed25519
  AddKeysToAgent yes
  IdentitiesOnly yes

Host myhpc
  HostName login.hpc.university.edu
  User your_netid
  IdentityFile ~/.ssh/id_ed25519

Test GitHub SSH (local):

ssh -T git@github.com

tmux essentials (remote or local):

tmux new -s train              # start session "train"
# ... run your long job ...
# detach: press Ctrl-b then d
tmux ls                        # list sessions
tmux attach -t train           # reattach
tmux kill-session -t train     # end session

6.5 Wrap‑up (10 min)

Make codifies your pipeline; the file graph serves as your dependency DAG.
Incremental builds save time: edit one script → only downstream targets rebuild.
rsync is your friend for backups/snapshots; be deliberate with --delete.
ssh/tmux: you don’t need them in Colab, but you will on servers/HPC.

6.6 Homework (due before Session 7)

Goal: Extend your automation with a tiny baseline training & evaluation step and polish your Makefile.

6.6.1 Part A — Add a minimal baseline trainer

Create scripts/train_baseline.py that learns a linear regression on lagged returns (toy baseline) and writes metrics.

train_py = r"""#!/usr/bin/env python
import argparse, json
from pathlib import Path
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--features", default="data/processed/features.parquet")
    ap.add_argument("--out-metrics", default="reports/baseline_metrics.json")
    args = ap.parse_args()

    df = pd.read_parquet(args.features)
    # Train/test split by date (last 20% for test)
    df = df.dropna(subset=["lag1","lag2","lag3","r_1d"])
    n = len(df)
    split = int(n*0.8)
    Xtr = df[["lag1","lag2","lag3"]].iloc[:split].values
    ytr = df["r_1d"].iloc[:split].values
    Xte = df[["lag1","lag2","lag3"]].iloc[split:].values
    yte = df["r_1d"].iloc[split:].values

    model = LinearRegression().fit(Xtr, ytr)
    pred = model.predict(Xte)
    mae = float(mean_absolute_error(yte, pred))

    Path("reports").mkdir(exist_ok=True)
    with open(args.out_metrics, "w") as f:
        json.dump({"model":"linear(lag1,lag2,lag3)","test_mae":mae,"n_test":len(yte)}, f, indent=2)
    print("Wrote", args.out_metrics, "MAE:", mae)

if __name__ == "__main__":
    main()
"""
open("scripts/train_baseline.py","w").write(train_py)
import os, stat
os.chmod("scripts/train_baseline.py", os.stat("scripts/train_baseline.py").st_mode | stat.S_IEXEC)
print("Created scripts/train_baseline.py")

6.6.2 Part B — Extend your Makefile with `train` and `all`

Append these to your Makefile:

# --- add after FEATS definition, near other targets ---

TRAIN_METRICS := reports/baseline_metrics.json

.PHONY: train
train: $(TRAIN_METRICS) ## Train toy baseline and write metrics

$(TRAIN_METRICS): scripts/train_baseline.py $(FEATS)
    $(PY) scripts/train_baseline.py --features $(FEATS) --out-metrics $(TRAIN_METRICS)

# Update 'all' to include 'train'
# all: $(DATA_RAW) $(FEATS) report backup   # OLD
# Replace with:
# all: $(DATA_RAW) $(FEATS) report train backup

Run:

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make train
cat reports/baseline_metrics.json

6.6.3 Part C — Add a `help` description to every target and verify `make help`

Ensure each target in your Makefile has a ## comment. Run:

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make help

6.6.4 Part D — (Optional) Track small models/metrics with Git‑LFS

If you decide to save model artifacts (e.g., models/baseline.pkl), track them:

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git lfs track "models/*.pkl"
git add .gitattributes
git commit -m "chore: track small model files via LFS"

(You can extend train_baseline.py to save models/baseline.pkl using joblib.)

6.6.5 Part E — Commit & push

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git add scripts/*.py scripts/backup.sh Makefile reports/baseline_metrics.json
git status
git commit -m "feat: automated pipeline with Make (data->features->report->train) and rsync backup"

from getpass import getpass
import subprocess
token = getpass("GitHub token (not stored): ")
REPO_OWNER = "YOUR_GITHUB_USERNAME_OR_ORG"
REPO_NAME  = "unified-stocks-teamX"
push_url = f"https://{token}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
subprocess.run(["git", "push", push_url, "HEAD:main"], check=True)
del token

6.6.6 Grading (pass/revise)

make all runs from a fresh clone (with minimal edits for tokens/Quarto install) and produces: data/raw/prices.csv, data/processed/features.parquet, docs/reports/eda.html, reports/baseline_metrics.json, and a backups/run-*/ snapshot.
Makefile has helpful help output and variables (START, END, ROLL).
scripts/backup.sh uses rsync -avh --delete and excludes raw/ & interim/.
(Optional) LFS tracking updated for models.

6.7 Key points:

Make is a thin layer over shell commands—it doesn’t replace Python; it orchestrates it.
Keep targets idempotent: running twice shouldn’t break; only rebuild on changes.
Use rsync with care: --delete is powerful—double‑check DEST paths.
ssh/tmux: you’ll want this the first time you run a long model on a remote machine.