5  Session 5 — Unix/Shell Essentials for Data Work

Scope today: Filesystem navigation, pipes/redirects, grep/sed/awk, sort|uniq|wc|cut|tr|head|tail|tee, find|xargs, regex basics, and a small shell QA (Quality Assurance) script for CSV health checks.


5.0.1 Learning goals

By the end of class, students can:

  1. Navigate and manipulate files safely from the shell (relative vs absolute paths, quoting).
  2. Use pipes and redirection to build composable mini‑pipelines.
  3. Filter and transform text/CSV data with grep, sed, awk, and friends.
  4. Find files with find and operate on them with xargs / -exec safely.
  5. Write a small, defensive Bash script (set -euo pipefail) that performs data QA checks and returns a non‑zero exit code on failure.

5.1 Agenda (75 min)

  • (8 min) Why shell for data science; mental model of pipelines
  • (12 min) Core commands & patterns: pipes/redirects, quoting, regex, grep/sed/awk
  • (35 min) In‑class lab (Colab): filesystem → CSV pipelines → find/xargs → QA shell script
  • (10 min) Wrap‑up, troubleshooting, and homework briefing
  • (10 min) Buffer for slow installs / student issues

5.2 Slides

Why shell?

  • Fast iteration for data plumbing (ETL glue: Extract,-Transform-Load) and repeatable ops.
  • Works on any POSIX (Portable OS Interface: IEEE strandard for Unix-like system) host (your laptop, Colab VM, servers).
  • Lets you compose small tools with pipes: producer | filter | summarize > report.txt.

Mental model

  • Stream text through commands. Each command reads STDIN, writes STDOUT; | connects them.
  • Redirection: > (truncate to file), >> (append), < (read from file), 2> (errors).
  • Exit code: 0 success; non‑zero = error. Use && (only if success) and || (if failure).

Quoting

  • "double quotes" expand variables and backslashes;
  • 'single quotes' are literal (best for regex/cut/sed patterns);
  • Always quote paths that might contain spaces: "$FILE".

Regex quick guide

  • ^ start, $ end, . any char, * 0+, + 1+, ? 0/1, [A-Z] class, (foo|bar) alt.
  • Use grep -E (Extended Regex) for + and |. Plain grep is basic (Basic re).

CSV caution

  • Unix tools are line‑oriented. They’re fine for simple CSVs (no embedded commas/quotes).
  • For tricky CSVs, prefer Python/pandas. Today’s examples are simple CSVs.

5.3 In‑class lab (35 min)

Instructor tip: run these as separate Colab cells. Cells labeled “Bash” use %%bash. Cells labeled “Python” are only used to generate a small synthetic CSV we can play with offline (no API keys needed).

5.3.1 0) Mount Drive, set repo paths, and cd into the repo

Python (Colab)

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

REPO_OWNER = "YOUR_GITHUB_USERNAME_OR_ORG"   # <- change for your class
REPO_NAME  = "unified-stocks-teamX"          # <- change
BASE_DIR   = "/content/drive/MyDrive/dspt25"
REPO_DIR   = f"{BASE_DIR}/{REPO_NAME}"

import pathlib, os, subprocess
pathlib.Path(BASE_DIR).mkdir(parents=True, exist_ok=True)
if not pathlib.Path(REPO_DIR).exists():
    print("Repo not found in Drive; please clone it in Session 2/3 or adjust REPO_DIR.")
else:
    os.chdir(REPO_DIR)
    print("Working dir:", os.getcwd())

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/STAT4160"
pwd
ls -la

5.3.2 1) Create a small synthetic prices CSV to work with (safe & reproducible)

Python

# Generates data/raw/prices.csv with columns: ticker,date,adj_close,volume,log_return
import pandas as pd, numpy as np, os
from pathlib import Path

Path("data/raw").mkdir(parents=True, exist_ok=True)
tickers = pd.read_csv("tickers_25.csv")["ticker"].tolist() if os.path.exists("tickers_25.csv") else [
    "AAPL","MSFT","AMZN","GOOGL","META","NVDA","TSLA","JPM","JNJ","V",
    "PG","HD","BAC","XOM","CVX","PFE","KO","DIS","NFLX","INTC","CSCO","ORCL","T","VZ","WMT"
]
dates = pd.bdate_range("2020-01-01", periods=180)  # ~ 9 months
rng = np.random.default_rng(7)

frames=[]
for t in tickers:
    r = rng.normal(0, 0.01, len(dates))
    price = 100*np.exp(np.cumsum(r))
    vol = rng.integers(1e5, 5e6, len(dates))
    df = pd.DataFrame({"ticker": t, "date": dates, "adj_close": price, "volume": vol})
    df["log_return"] = np.log(df["adj_close"]).diff().fillna(0)
    frames.append(df)

out = pd.concat(frames, ignore_index=True)
out.to_csv("data/raw/prices.csv", index=False)
out.head()

5.3.3 2) Pipes & redirects warm‑up

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# How many lines? (including header)
wc -l data/raw/prices.csv | tee reports/prices_wc.txt

# First 5 lines, save to a sample
head -n 5 data/raw/prices.csv | tee data/raw/prices_sample.csv

# Show ticker column only (field 1), excluding header
cut -d, -f1 data/raw/prices.csv | tail -n +2 | head -n 10

5.3.4 3) grep filters (basic and extended) + regex

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# All rows for NVDA OR MSFT (extended regex with alternation)
grep -E '^(NVDA|MSFT),' data/raw/prices.csv | head -n 3

# Rows where ticker starts with a vowel (A, E, I, O, U)
grep -E '^(A|E|I|O|U)[A-Z]*,' data/raw/prices.csv | head -n 3

# Count rows per ticker quickly (just for demo)
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq -c | head

5.3.5 4) sed transformations (search/replace; in‑place edits)

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# Make a copy so we don't touch the raw file
cp data/raw/prices.csv data/interim/ || mkdir -p data/interim && cp data/raw/prices.csv data/interim
cp data/interim/prices.csv data/interim/prices_copy.csv

# Replace ISO date dashes with slashes (2020-01-02 -> 2020/01/02) in-place
sed -i '1!s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1\/\2\/\3/g' data/interim/prices_copy.csv

# Normalize ticker to lowercase (first column) using sed's capture groups and tolower via awk (hybrid example)
head -n 1 data/interim/prices_copy.csv > data/interim/prices_lower.csv
tail -n +2 data/interim/prices_copy.csv | awk -F, 'BEGIN{OFS=","}{ $1=tolower($1); print }' >> data/interim/prices_lower.csv

head -n 3 data/interim/prices_lower.csv

5.3.6 5) awk for CSV summarization

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# Compute mean log_return per ticker (skip header). -F, sets comma as field separator.
awk -F, 'NR>1 { sum[$1]+=$5; n[$1]++ } END { OFS=","; print "ticker","mean_log_return"; for (t in sum) print t, sum[t]/n[t] }' data/raw/prices.csv \
| sort -t, -k2,2nr | head

# Top 5 dates with highest absolute log_return for NVDA
awk -F, 'NR>1 && $1=="NVDA" { print $2, $5 }' data/raw/prices.csv \
| awk '{ if ($2<0) s=-$2; else s=$2; print $1","$2","s }' \
| sort -t, -k3,3nr | head -n 5
  • NR>1 skips header.
  • sum[$1] and n[$1] are associative arrays keyed by ticker.
  • We sort numerically on column 2 (mean) with -k2,2n or nr for descending.

5.3.7 6) sort | uniq deduping and comm to compare lists

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# Unique tickers actually present in the file
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq > data/interim/tickers_in_data.txt

# Compare to our canonical list from tickers_25.csv
cut -d, -f1 tickers_25.csv | tail -n +2 | sort > data/interim/tickers_25.txt

echo "Only in data:"; comm -23 data/interim/tickers_in_data.txt data/interim/tickers_25.txt | sed 's/^/  /'
echo "Only in canonical:"; comm -13 data/interim/tickers_in_data.txt data/interim/tickers_25.txt | sed 's/^/  /'

5.3.8 7) find and xargs (safe, null‑terminated)

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# Show all CSVs under data/, printing sizes
find data -type f -name "*.csv" -printf "%p,%s bytes\n" | sort | head

# Count lines in each CSV (null-safe for weird filenames)
find data -type f -name "*.csv" -print0 | xargs -0 -I{} sh -c 'echo -n "{},"; wc -l < "{}"'

# Gzip-compress any CSV larger than ~1MB (demo threshold: 1e6 bytes)
find data -type f -name "*.csv" -size +1000k -print0 | xargs -0 -I{} gzip -kf "{}"  # -k keeps original

Pattern: prefer -print0 | xargs -0 to safely handle spaces/newlines in filenames.

5.3.9 8) Build a defensive CSV QA script and run it

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p scripts

cat > scripts/qa_csv.sh << 'EOF'
#!/usr/bin/env bash
# Simple CSV health check
# Usage: scripts/qa_csv.sh path/to/file.csv required_columns_csv
set -euo pipefail
IFS=$'\n\t'

FILE="${1:-}"
REQUIRED="${2:-ticker,date,adj_close,volume,log_return}"

err() { echo "ERROR: $*" >&2; exit 1; }
[[ -z "$FILE" ]] && err "No CSV file provided."
[[ ! -f "$FILE" ]] && err "File not found: $FILE"

# 1) Non-empty and header present
LINES=$(wc -l < "$FILE" || true)
[[ "${LINES:-0}" -lt 2 ]] && err "File has <2 lines (missing data?): $FILE"

HEADER=$(head -n 1 "$FILE")
# 2) All required columns present
IFS=',' read -r -a req <<< "$REQUIRED"
for col in "${req[@]}"; do
  echo "$HEADER" | grep -q -E "(^|,)${col}(,|$)" || err "Missing required column: $col"
done

# 3) No obvious NA/blank values in required numeric cols (basic check)
NUMERIC="adj_close,volume,log_return"
IFS=',' read -r -a nums <<< "$NUMERIC"
for col in "${nums[@]}"; do
  # find column index
  idx=$(awk -F, -v COL="$col" 'NR==1{for(i=1;i<=NF;i++) if($i==COL) print i}' "$FILE")
  [[ -z "${idx:-}" ]] && err "Column not found: $col"
  # check any blank values from row 2 onward
  bad=$(awk -F, -v I="$idx" 'NR>1 && ($I=="" || $I=="NA") {c++} END{print c+0}' "$FILE")
  [[ "$bad" -gt 0 ]] && err "Found $bad blank/NA in column: $col"
done

echo "OK: $FILE passed basic CSV QA ($LINES lines)."
EOF

chmod +x scripts/qa_csv.sh

Run the QA script

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
scripts/qa_csv.sh data/raw/prices.csv

Intentionally break it (optional): open data/raw/prices.csv, blank out a value, and re‑run to watch it fail with non‑zero exit code.


5.4 Wrap‑up (10 min)

  • Shell is about composable building blocks. Learn 15 commands deeply; combine them fluently.
  • Prefer null‑safe find … -print0 | xargs -0 patterns; always quote variables: "$FILE".
  • For complex CSV logic, fall back to Python; but shell shines for quick filters and QA.
  • We’ll hook these into Make next session so one command runs your whole pipeline.

5.5 Homework (due before Session 6)

Goal: Practice and codify shell workflows into your project: (1) a data stats pipeline, (2) a per‑ticker split utility, (3) a Makefile target, and (4) a short shell‑only EDA text report.

5.5.1 Part A — Data stats pipeline (one‑liners saved to files)

Bash (Colab)

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p reports data/interim

# 1) Count lines and unique tickers
{ 
  echo "Lines (incl header): $(wc -l < data/raw/prices.csv)";
  echo "Unique tickers: $(cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | wc -l)";
} | tee reports/data_counts.txt

# 2) Top-10 days by absolute log_return across all tickers
tail -n +2 data/raw/prices.csv \
| awk -F, '{a=$5; if(a<0) a=-a; print $1","$2","$5","a}' \
| sort -t, -k4,4nr | head -n 10 \
| tee reports/top10_abs_moves.csv

# 3) Mean log_return per ticker (CSV)
awk -F, 'NR>1 { s[$1]+=$5; n[$1]++ } END { OFS=","; print "ticker,mean_log_return"; for(t in s) print t, s[t]/n[t] }' \
  data/raw/prices.csv | sort -t, -k2,2nr | tee reports/mean_return_by_ticker.csv

5.5.2 Part B — Split per‑ticker CSVs into data/interim/ticker=XYZ/ directories

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p data/interim

# Extract header once
HEADER=$(head -n 1 data/raw/prices.csv)

# Create per-ticker files with header + rows (null-safe not necessary here)
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | while read -r T; do
  mkdir -p "data/interim/ticker=${T}"
  {
    echo "$HEADER"
    awk -F, -v TK="$T" 'NR==1 || $1==TK' data/raw/prices.csv
  } > "data/interim/ticker=${T}/prices_${T}.csv"
done

# Verify one example
ls -la data/interim/ticker=AAPL | head

5.5.3 Part C — Add Makefile targets for QA and per‑ticker split

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

# Append or create a Makefile
{
  echo ""
  echo "qa:"
  echo "\tscripts/qa_csv.sh data/raw/prices.csv"
  echo ""
  echo "split-by-ticker:"
  echo "\tbash -c 'HEADER=\$(head -n 1 data/raw/prices.csv); cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | while read -r T; do mkdir -p data/interim/ticker=\$\$T; { echo \"\$\$HEADER\"; awk -F, -v TK=\"\$\$T\" '\"'NR==1 || \$1==TK'\"' data/raw/prices.csv; } > data/interim/ticker=\$\$T/prices_\$\$T.csv; done'"
} >> Makefile

cat Makefile

Run the targets

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make qa
make split-by-ticker

5.5.4 Part D — Shell‑only mini EDA report

Create reports/mini_eda.txt with three sections: counts, top moves, mean returns.

Bash

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"

{
  echo "# Mini EDA (shell-only)"
  echo "Generated: $(date)"
  echo
  echo "## Counts"
  echo "Lines (incl header): $(wc -l < data/raw/prices.csv)"
  echo "Unique tickers: $(cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | wc -l)"
  echo
  echo "## Top 5 absolute daily moves"
  tail -n +2 data/raw/prices.csv \
  | awk -F, '{a=$5; if(a<0) a=-a; print $1","$2","$5","a}' \
  | sort -t, -k4,4nr | head -n 5
  echo
  echo "## Mean log_return by ticker (top 10)"
  awk -F, 'NR>1 { s[$1]+=$5; n[$1]++ } END { for(t in s) printf "%s,%.6f\n", t, s[t]/n[t] }' \
    data/raw/prices.csv | sort -t, -k2,2nr | head -n 10
} | tee reports/mini_eda.txt

5.5.5 Part E — Commit & push your changes (use your short‑lived token as in Session 2)

Bash + Python (getpass)

%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git add scripts/qa_csv.sh data/raw/prices.csv data/interim reports/*.txt reports/*.csv Makefile
git status
git commit -m "feat(shell): data QA script, per-ticker split, shell mini-EDA, Make targets"
from getpass import getpass
import os, subprocess
token = getpass("GitHub token (not stored): ")
REPO_OWNER = "YOUR_GITHUB_USERNAME_OR_ORG"
REPO_NAME  = "unified-stocks-teamX"
push_url = f"https://{token}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
# Push current HEAD to main (adjust if you prefer a branch + PR)
subprocess.run(["git", "push", push_url, "HEAD:main"], check=True)
del token

5.5.6 Grading (pass/revise)

  • scripts/qa_csv.sh exists, is executable, and fails on malformed CSV, passes on clean CSV.
  • reports/data_counts.txt, reports/top10_abs_moves.csv, reports/mean_return_by_ticker.csv, and reports/mini_eda.txt generated.
  • make qa and make split-by-ticker run successfully.
  • Per‑ticker CSVs created under data/interim/ticker=XYZ/.

5.6 Key points

  • Quote variables and paths. Prefer -print0 | xargs -0 with find.
  • Fail fast in scripts (set -euo pipefail) and return non‑zero exit codes for CI (Continuous Integraion).
  • Shell is for plumbing—it complements, not replaces, Python.

Next session (6): “Make/just, rsync, ssh/tmux (survey)” and we’ll wire make get-data and make report into a reproducible one‑command pipeline.