5 Session 5 — Unix/Shell Essentials for Data Work
Scope today: Filesystem navigation, pipes/redirects,
grep
/sed
/awk
,sort|uniq|wc|cut|tr|head|tail|tee
,find|xargs
, regex basics, and a small shell QA (Quality Assurance) script for CSV health checks.
5.0.1 Learning goals
By the end of class, students can:
- Navigate and manipulate files safely from the shell (relative vs absolute paths, quoting).
- Use pipes and redirection to build composable mini‑pipelines.
- Filter and transform text/CSV data with
grep
,sed
,awk
, and friends. - Find files with
find
and operate on them withxargs
/-exec
safely. - Write a small, defensive Bash script (
set -euo pipefail
) that performs data QA checks and returns a non‑zero exit code on failure.
5.1 Agenda (75 min)
- (8 min) Why shell for data science; mental model of pipelines
- (12 min) Core commands & patterns: pipes/redirects, quoting, regex,
grep
/sed
/awk
- (35 min) In‑class lab (Colab): filesystem → CSV pipelines → find/xargs → QA shell script
- (10 min) Wrap‑up, troubleshooting, and homework briefing
- (10 min) Buffer for slow installs / student issues
5.2 Slides
Why shell?
- Fast iteration for data plumbing (ETL glue: Extract,-Transform-Load) and repeatable ops.
- Works on any POSIX (Portable OS Interface: IEEE strandard for Unix-like system) host (your laptop, Colab VM, servers).
- Lets you compose small tools with pipes:
producer | filter | summarize > report.txt
.
Mental model
- Stream text through commands. Each command reads STDIN, writes STDOUT;
|
connects them. - Redirection:
>
(truncate to file),>>
(append),<
(read from file),2>
(errors). - Exit code:
0
success; non‑zero = error. Use&&
(only if success) and||
(if failure).
Quoting
"double quotes"
expand variables and backslashes;'single quotes'
are literal (best for regex/cut/sed patterns);- Always quote paths that might contain spaces:
"$FILE"
.
Regex quick guide
^
start,$
end,.
any char,*
0+,+
1+,?
0/1,[A-Z]
class,(foo|bar)
alt.- Use
grep -E
(Extended Regex) for+
and|
. Plaingrep
is basic (Basic re).
CSV caution
- Unix tools are line‑oriented. They’re fine for simple CSVs (no embedded commas/quotes).
- For tricky CSVs, prefer Python/pandas. Today’s examples are simple CSVs.
5.3 In‑class lab (35 min)
Instructor tip: run these as separate Colab cells. Cells labeled “Bash” use
%%bash
. Cells labeled “Python” are only used to generate a small synthetic CSV we can play with offline (no API keys needed).
5.3.1 0) Mount Drive, set repo paths, and cd into the repo
Python (Colab)
from google.colab import drive
'/content/drive', force_remount=True)
drive.mount(
= "YOUR_GITHUB_USERNAME_OR_ORG" # <- change for your class
REPO_OWNER = "unified-stocks-teamX" # <- change
REPO_NAME = "/content/drive/MyDrive/dspt25"
BASE_DIR = f"{BASE_DIR}/{REPO_NAME}"
REPO_DIR
import pathlib, os, subprocess
=True, exist_ok=True)
pathlib.Path(BASE_DIR).mkdir(parentsif not pathlib.Path(REPO_DIR).exists():
print("Repo not found in Drive; please clone it in Session 2/3 or adjust REPO_DIR.")
else:
os.chdir(REPO_DIR)print("Working dir:", os.getcwd())
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/STAT4160"
pwd
ls -la
5.3.2 1) Create a small synthetic prices CSV to work with (safe & reproducible)
Python
# Generates data/raw/prices.csv with columns: ticker,date,adj_close,volume,log_return
import pandas as pd, numpy as np, os
from pathlib import Path
"data/raw").mkdir(parents=True, exist_ok=True)
Path(= pd.read_csv("tickers_25.csv")["ticker"].tolist() if os.path.exists("tickers_25.csv") else [
tickers "AAPL","MSFT","AMZN","GOOGL","META","NVDA","TSLA","JPM","JNJ","V",
"PG","HD","BAC","XOM","CVX","PFE","KO","DIS","NFLX","INTC","CSCO","ORCL","T","VZ","WMT"
]= pd.bdate_range("2020-01-01", periods=180) # ~ 9 months
dates = np.random.default_rng(7)
rng
=[]
framesfor t in tickers:
= rng.normal(0, 0.01, len(dates))
r = 100*np.exp(np.cumsum(r))
price = rng.integers(1e5, 5e6, len(dates))
vol = pd.DataFrame({"ticker": t, "date": dates, "adj_close": price, "volume": vol})
df "log_return"] = np.log(df["adj_close"]).diff().fillna(0)
df[
frames.append(df)
= pd.concat(frames, ignore_index=True)
out "data/raw/prices.csv", index=False)
out.to_csv( out.head()
5.3.3 2) Pipes & redirects warm‑up
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# How many lines? (including header)
wc -l data/raw/prices.csv | tee reports/prices_wc.txt
# First 5 lines, save to a sample
head -n 5 data/raw/prices.csv | tee data/raw/prices_sample.csv
# Show ticker column only (field 1), excluding header
cut -d, -f1 data/raw/prices.csv | tail -n +2 | head -n 10
5.3.4 3) grep
filters (basic and extended) + regex
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# All rows for NVDA OR MSFT (extended regex with alternation)
grep -E '^(NVDA|MSFT),' data/raw/prices.csv | head -n 3
# Rows where ticker starts with a vowel (A, E, I, O, U)
grep -E '^(A|E|I|O|U)[A-Z]*,' data/raw/prices.csv | head -n 3
# Count rows per ticker quickly (just for demo)
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq -c | head
5.3.5 4) sed
transformations (search/replace; in‑place edits)
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# Make a copy so we don't touch the raw file
cp data/raw/prices.csv data/interim/ || mkdir -p data/interim && cp data/raw/prices.csv data/interim
cp data/interim/prices.csv data/interim/prices_copy.csv
# Replace ISO date dashes with slashes (2020-01-02 -> 2020/01/02) in-place
sed -i '1!s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1\/\2\/\3/g' data/interim/prices_copy.csv
# Normalize ticker to lowercase (first column) using sed's capture groups and tolower via awk (hybrid example)
head -n 1 data/interim/prices_copy.csv > data/interim/prices_lower.csv
tail -n +2 data/interim/prices_copy.csv | awk -F, 'BEGIN{OFS=","}{ $1=tolower($1); print }' >> data/interim/prices_lower.csv
head -n 3 data/interim/prices_lower.csv
5.3.6 5) awk
for CSV summarization
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# Compute mean log_return per ticker (skip header). -F, sets comma as field separator.
awk -F, 'NR>1 { sum[$1]+=$5; n[$1]++ } END { OFS=","; print "ticker","mean_log_return"; for (t in sum) print t, sum[t]/n[t] }' data/raw/prices.csv \
| sort -t, -k2,2nr | head
# Top 5 dates with highest absolute log_return for NVDA
awk -F, 'NR>1 && $1=="NVDA" { print $2, $5 }' data/raw/prices.csv \
| awk '{ if ($2<0) s=-$2; else s=$2; print $1","$2","s }' \
| sort -t, -k3,3nr | head -n 5
NR>1
skips header.sum[$1]
andn[$1]
are associative arrays keyed by ticker.- We sort numerically on column 2 (mean) with
-k2,2n
ornr
for descending.
5.3.7 6) sort | uniq
deduping and comm
to compare lists
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# Unique tickers actually present in the file
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq > data/interim/tickers_in_data.txt
# Compare to our canonical list from tickers_25.csv
cut -d, -f1 tickers_25.csv | tail -n +2 | sort > data/interim/tickers_25.txt
echo "Only in data:"; comm -23 data/interim/tickers_in_data.txt data/interim/tickers_25.txt | sed 's/^/ /'
echo "Only in canonical:"; comm -13 data/interim/tickers_in_data.txt data/interim/tickers_25.txt | sed 's/^/ /'
5.3.8 7) find
and xargs
(safe, null‑terminated)
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# Show all CSVs under data/, printing sizes
find data -type f -name "*.csv" -printf "%p,%s bytes\n" | sort | head
# Count lines in each CSV (null-safe for weird filenames)
find data -type f -name "*.csv" -print0 | xargs -0 -I{} sh -c 'echo -n "{},"; wc -l < "{}"'
# Gzip-compress any CSV larger than ~1MB (demo threshold: 1e6 bytes)
find data -type f -name "*.csv" -size +1000k -print0 | xargs -0 -I{} gzip -kf "{}" # -k keeps original
Pattern: prefer
-print0 | xargs -0
to safely handle spaces/newlines in filenames.
5.3.9 8) Build a defensive CSV QA script and run it
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p scripts
cat > scripts/qa_csv.sh << 'EOF'
#!/usr/bin/env bash
# Simple CSV health check
# Usage: scripts/qa_csv.sh path/to/file.csv required_columns_csv
set -euo pipefail
IFS=$'\n\t'
FILE="${1:-}"
REQUIRED="${2:-ticker,date,adj_close,volume,log_return}"
err() { echo "ERROR: $*" >&2; exit 1; }
[[ -z "$FILE" ]] && err "No CSV file provided."
[[ ! -f "$FILE" ]] && err "File not found: $FILE"
# 1) Non-empty and header present
LINES=$(wc -l < "$FILE" || true)
[[ "${LINES:-0}" -lt 2 ]] && err "File has <2 lines (missing data?): $FILE"
HEADER=$(head -n 1 "$FILE")
# 2) All required columns present
IFS=',' read -r -a req <<< "$REQUIRED"
for col in "${req[@]}"; do
echo "$HEADER" | grep -q -E "(^|,)${col}(,|$)" || err "Missing required column: $col"
done
# 3) No obvious NA/blank values in required numeric cols (basic check)
NUMERIC="adj_close,volume,log_return"
IFS=',' read -r -a nums <<< "$NUMERIC"
for col in "${nums[@]}"; do
# find column index
idx=$(awk -F, -v COL="$col" 'NR==1{for(i=1;i<=NF;i++) if($i==COL) print i}' "$FILE")
[[ -z "${idx:-}" ]] && err "Column not found: $col"
# check any blank values from row 2 onward
bad=$(awk -F, -v I="$idx" 'NR>1 && ($I=="" || $I=="NA") {c++} END{print c+0}' "$FILE")
[[ "$bad" -gt 0 ]] && err "Found $bad blank/NA in column: $col"
done
echo "OK: $FILE passed basic CSV QA ($LINES lines)."
EOF
chmod +x scripts/qa_csv.sh
Run the QA script
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
scripts/qa_csv.sh data/raw/prices.csv
Intentionally break it (optional): open
data/raw/prices.csv
, blank out a value, and re‑run to watch it fail with non‑zero exit code.
5.4 Wrap‑up (10 min)
- Shell is about composable building blocks. Learn 15 commands deeply; combine them fluently.
- Prefer null‑safe
find … -print0 | xargs -0
patterns; always quote variables:"$FILE"
. - For complex CSV logic, fall back to Python; but shell shines for quick filters and QA.
- We’ll hook these into Make next session so one command runs your whole pipeline.
5.5 Homework (due before Session 6)
Goal: Practice and codify shell workflows into your project: (1) a data stats pipeline, (2) a per‑ticker split utility, (3) a Makefile target, and (4) a short shell‑only EDA text report.
5.5.1 Part A — Data stats pipeline (one‑liners saved to files)
Bash (Colab)
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p reports data/interim
# 1) Count lines and unique tickers
{
echo "Lines (incl header): $(wc -l < data/raw/prices.csv)";
echo "Unique tickers: $(cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | wc -l)";
} | tee reports/data_counts.txt
# 2) Top-10 days by absolute log_return across all tickers
tail -n +2 data/raw/prices.csv \
| awk -F, '{a=$5; if(a<0) a=-a; print $1","$2","$5","a}' \
| sort -t, -k4,4nr | head -n 10 \
| tee reports/top10_abs_moves.csv
# 3) Mean log_return per ticker (CSV)
awk -F, 'NR>1 { s[$1]+=$5; n[$1]++ } END { OFS=","; print "ticker,mean_log_return"; for(t in s) print t, s[t]/n[t] }' \
| sort -t, -k2,2nr | tee reports/mean_return_by_ticker.csv data/raw/prices.csv
5.5.2 Part B — Split per‑ticker CSVs into data/interim/ticker=XYZ/
directories
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
mkdir -p data/interim
# Extract header once
HEADER=$(head -n 1 data/raw/prices.csv)
# Create per-ticker files with header + rows (null-safe not necessary here)
cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | while read -r T; do
mkdir -p "data/interim/ticker=${T}"
{
echo "$HEADER"
awk -F, -v TK="$T" 'NR==1 || $1==TK' data/raw/prices.csv
} > "data/interim/ticker=${T}/prices_${T}.csv"
done
# Verify one example
ls -la data/interim/ticker=AAPL | head
5.5.3 Part C — Add Makefile targets for QA and per‑ticker split
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
# Append or create a Makefile
{
echo ""
echo "qa:"
echo "\tscripts/qa_csv.sh data/raw/prices.csv"
echo ""
echo "split-by-ticker:"
echo "\tbash -c 'HEADER=\$(head -n 1 data/raw/prices.csv); cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | while read -r T; do mkdir -p data/interim/ticker=\$\$T; { echo \"\$\$HEADER\"; awk -F, -v TK=\"\$\$T\" '\"'NR==1 || \$1==TK'\"' data/raw/prices.csv; } > data/interim/ticker=\$\$T/prices_\$\$T.csv; done'"
} >> Makefile
cat Makefile
Run the targets
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
make qa
make split-by-ticker
5.5.4 Part D — Shell‑only mini EDA report
Create reports/mini_eda.txt
with three sections: counts, top moves, mean returns.
Bash
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
{
echo "# Mini EDA (shell-only)"
echo "Generated: $(date)"
echo
echo "## Counts"
echo "Lines (incl header): $(wc -l < data/raw/prices.csv)"
echo "Unique tickers: $(cut -d, -f1 data/raw/prices.csv | tail -n +2 | sort | uniq | wc -l)"
echo
echo "## Top 5 absolute daily moves"
tail -n +2 data/raw/prices.csv \
| awk -F, '{a=$5; if(a<0) a=-a; print $1","$2","$5","a}' \
| sort -t, -k4,4nr | head -n 5
echo
echo "## Mean log_return by ticker (top 10)"
awk -F, 'NR>1 { s[$1]+=$5; n[$1]++ } END { for(t in s) printf "%s,%.6f\n", t, s[t]/n[t] }' \
| sort -t, -k2,2nr | head -n 10
data/raw/prices.csv } | tee reports/mini_eda.txt
5.5.5 Part E — Commit & push your changes (use your short‑lived token as in Session 2)
Bash + Python (getpass)
%%bash
set -euo pipefail
cd "/content/drive/MyDrive/dspt25/unified-stocks-teamX"
git add scripts/qa_csv.sh data/raw/prices.csv data/interim reports/*.txt reports/*.csv Makefile
git status
git commit -m "feat(shell): data QA script, per-ticker split, shell mini-EDA, Make targets"
from getpass import getpass
import os, subprocess
= getpass("GitHub token (not stored): ")
token = "YOUR_GITHUB_USERNAME_OR_ORG"
REPO_OWNER = "unified-stocks-teamX"
REPO_NAME = f"https://{token}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
push_url # Push current HEAD to main (adjust if you prefer a branch + PR)
"git", "push", push_url, "HEAD:main"], check=True)
subprocess.run([del token
5.5.6 Grading (pass/revise)
scripts/qa_csv.sh
exists, is executable, and fails on malformed CSV, passes on clean CSV.reports/data_counts.txt
,reports/top10_abs_moves.csv
,reports/mean_return_by_ticker.csv
, andreports/mini_eda.txt
generated.make qa
andmake split-by-ticker
run successfully.- Per‑ticker CSVs created under
data/interim/ticker=XYZ/
.
5.6 Key points
- Quote variables and paths. Prefer
-print0 | xargs -0
withfind
. - Fail fast in scripts (
set -euo pipefail
) and return non‑zero exit codes for CI (Continuous Integraion). - Shell is for plumbing—it complements, not replaces, Python.
Next session (6): “Make/just, rsync, ssh/tmux (survey)” and we’ll wire make get-data
and make report
into a reproducible one‑command pipeline.