2 Session 2 — Git essentials & Git‑LFS
Security note: Today we’ll push from Colab using a short‑lived GitHub personal access token (PAT) entered interactively. Never hard‑code or commit tokens.
2.1 Session 2 — Git essentials & Git‑LFS (75 min)
2.1.1 Learning goals
By the end of class, students can:
- Explain Git’s mental model: working directory → staging → commit; branches and remotes.
- Create a feature branch, commit changes, and push to GitHub from Colab safely.
- Use .gitignore to avoid committing generated artifacts and secrets.
- Install and configure Git‑LFS, track large/binary files, and verify tracking.
- Open a pull request (PR) and follow review etiquette.
2.2 Agenda (75 minutes)
- (8 min) Recap & goals; overview of today’s workflow
- (12 min) Slides: Git mental model; branches; remotes; commit hygiene
- (10 min) Slides:
.gitignore
must‑haves; Git‑LFS (when/why); LFS quotas & pitfalls - (35 min) In‑class lab: clone → config → branch →
.gitignore
→ LFS → sample Parquet → push → PR - (10 min) Wrap‑up; troubleshooting; homework briefing
2.3 Slides
2.3.1 Git mental model
- Working directory (your files) →
git add
→ staging →git commit
→ local history - Remote: GitHub hosts a copy.
git push
publishes commits;git pull
brings others’ changes. - Branch: a movable pointer to a chain of commits. Default is
main
. Create feature branches for each change.
In Git, a branch is essentially just a movable pointer to a commit.
2.4 1. The simple definition
- A branch has a name (e.g.,
main
,feature/login
). - That name points to a specific commit in your repository.
- As you make new commits on that branch, the pointer moves forward.
2.5 2. Visual example
Let’s say your repo looks like this:
A --- B --- C ← main
Here:
main
is the branch name.- It points to commit
C
.
If you make a new branch:
git branch feature
Now you have:
A --- B --- C ← main, feature
If you checkout feature
and make a commit:
A --- B --- C ← main
\
D ← feature
feature
moves forward toD
(new commit).main
stays atC
.
2.6 3. HEAD and active branch
HEAD
is your current position — it points to the branch you’re working on.- When you commit, Git moves that branch forward.
2.7 4. Why branches matter
- Let you work on new features, bug fixes, or experiments without touching the main codebase.
- Cheap to create and delete — Git branching is just updating a tiny file.
- Enable parallel development.
2.9 In git checkout -b
, the -b
means “create a new branch” before checking it out.
2.9.1 Without -b
git checkout branchname
- Switches to an existing branch.
- Fails if the branch does not exist.
2.9.2 With -b
git checkout -b branchname
- Tells Git: “make a branch called
branchname
pointing to the current commit, and then switch to it.” - Fails if the branch already exists.
2.9.3 Example
If you’re on main
:
git checkout -b feature-x
Steps Git takes:
- Create a new branch pointer
feature-x
→ same commit asmain
. - Move HEAD to
feature-x
(you’re now “on” that branch).
💡 In newer Git versions, the same idea is expressed with:
git switch -c feature-x # -c means create
-b
in checkout
and -c
in switch
both mean create.
2.9.4 Branch & PR etiquette
One feature/change per branch (small, reviewable diffs).
Commit messages: imperative mood, short subject line (≤ 72 chars), details in body if needed:
feat: add git-lfs tracking for parquet
docs: add README section on setup
chore: ignore raw data directory
PR description: what/why, testing notes, checklist. Tag your teammate for review.
2.9.5 .gitignore
must‑haves
- Secrets:
.env
, API keys (never commit). - Large/derived artifacts: raw/interim data, logs, cache, compiled assets.
- Notebooks’ checkpoints:
.ipynb_checkpoints/
. - OS/editor cruft:
.DS_Store
(for Mac),Thumbs.db
(for Windows),.vscode/
.
2.9.6 Git‑LFS
- Git‑LFS = Large File Storage. Keeps pointers in Git; binaries in LFS storage.
- Track only what’s necessary to version (e.g., small processed Parquet samples, posters/PDFs, small models).
- Do not LFS huge raw data you can re‑download (
make get-data
). - Quotas apply on Git hosting—be selective.
2.9.7 Safe pushes from Colab
- Use a fine‑grained PAT limited to a single repo with Contents: Read/Write + Pull requests: Read/Write.
- Enter token via
getpass
(not stored). Push using a temporary URL (token not saved ingit config
). - After push, clear cell output.
2.10 In‑class lab (35 min)
Instructor tip: Students should have created a repo on GitHub before this lab (e.g.,
unified-stocks-teamX
). If not, give them 3 minutes to do so and add their partner as a collaborator.
We’ll:
- Mount Drive & clone the repo.
- Configure Git identity.
- Create a feature branch.
- Add
.gitignore
. - Install and configure Git‑LFS.
- Track Parquet & DB files; generate a sample Parquet.
- Commit & push from Colab using a short‑lived PAT.
- Open a PR (via web UI, optional API snippet included).
2.10.1 0) Mount Google Drive and set variables
# Colab cell
from google.colab import drive
'/content/drive', force_remount=True)
drive.mount(
# Adjust these two for YOUR repo
= "YOUR_GITHUB_USERNAME_OR_ORG"
REPO_OWNER = "unified-stocks-teamX" # e.g., unified-stocks-team1
REPO_NAME
= "/content/drive/MyDrive/dspt25"
BASE_DIR = f"{BASE_DIR}/{REPO_NAME}"
CLONE_DIR = f"https://github.com/{REPO_OWNER}/{REPO_NAME}.git"
REPO_URL
import os, pathlib
=True, exist_ok=True) pathlib.Path(BASE_DIR).mkdir(parents
2.10.2 1) Clone the repo (or pull latest if already cloned)
import os, subprocess, shutil, pathlib
if not pathlib.Path(CLONE_DIR).exists():
!git clone {REPO_URL} {CLONE_DIR}
else:
# If the folder exists, just ensure it's a git repo and pull latest
os.chdir(CLONE_DIR)!git status
!git pull --ff-only # ff to avoid diverged branches
os.chdir(CLONE_DIR)print("Working dir:", os.getcwd())
2.10.3 2) Configure Git identity (local to this repo)
# Replace with your name and school email
!git config user.name "Your Name"
!git config user.email "you@example.edu"
!git config --get user.name
!git config --get user.email
2.10.4 3) Create and switch to a feature branch
= "setup/git-lfs"
BRANCH !git checkout -b {BRANCH}
!git branch --show-current
2.10.5 4) Add a robust .gitignore
= """\
gitignore # Byte-compiled / cache
__pycache__/
*.py[cod]
# Jupyter checkpoints
.ipynb_checkpoints/
# OS/editor files
.DS_Store
Thumbs.db
.vscode/
# Environments & secrets
.env
.env.*
.venv/
*.pem
*.key
# Data (raw & interim never committed)
data/raw/
data/interim/
# Logs & caches
logs/
.cache/
"""
open(".gitignore", "w").write(gitignore)
print(open(".gitignore").read())
2.10.6 5) Install and initialize Git‑LFS (Colab)
# Install git-lfs on the Colab VM (one-time per runtime): apt-get: advanced package tool(manager)
!apt-get -y update >/dev/null # refresh vailable packages from the repositories
!apt-get -y install git-lfs >/dev/null
!git lfs install
!git lfs version
2.10.7 6) Track Parquet/DB/PDF/model binaries with LFS
# Add .gitattributes entries via git lfs track
!git lfs track "data/processed/*.parquet"
!git lfs track "data/*.db"
!git lfs track "models/*.pt"
!git lfs track "reports/*.pdf"
# Show what LFS is tracking and verify .gitattributes created
!git lfs track
print("\n.gitattributes:")
print(open(".gitattributes").read())
Why not LFS for raw? Raw data should be re‑downloadable with
make get-data
later; don’t burn LFS quota.
2.10.8 7) Create a small Parquet file to test LFS
import pandas as pd, numpy as np, os, pathlib
"data/processed").mkdir(parents=True, exist_ok=True)
pathlib.Path(
= pd.read_csv("tickers_25.csv")["ticker"].tolist() if os.path.exists("tickers_25.csv") else [
tickers "AAPL","MSFT","AMZN","GOOGL","META","NVDA","TSLA","JPM","JNJ","V",
"PG","HD","BAC","XOM","CVX","PFE","KO","DIS","NFLX","INTC","CSCO","ORCL","T","VZ","WMT"
]
# 1000 business days x up to 25 tickers ~ 25k rows; a few MB as Parquet
= pd.bdate_range("2018-01-01", periods=1000)
dates = (pd.MultiIndex.from_product([tickers, dates], names=["ticker","date"])
df =False))
.to_frame(index= np.random.default_rng(42)
rng "r_1d"] = rng.normal(0, 0.01, size=len(df)) # synthetic daily returns
df["data/processed/sample_returns.parquet", index=False)
df.to_parquet( df.head()
2.10.9 8) Stage and commit changes
!git add .gitignore .gitattributes data/processed/sample_returns.parquet
!git status
!git commit -m "feat: add .gitignore and git-lfs tracking; add sample Parquet"
!git log --oneline -n 2 # limit to the most recent 2 commits
If see error “error: cannot run .git/hooks/post-commit: No such file or directory
”, it means the post-commit hook is not executable or missing. ### Troubleshooting post-commit hook error 1. See what Git is trying to run
ls -l .git/hooks/post-commit
- If you see
-rw-r--r--
, it’s not executable.
- Make it executable
chmod +x .git/hooks/post-commit
- Ensure it has a valid shebang (first line) Open it and confirm the first line is one of:
head -n 1 .git/hooks/post-commit
#!/bin/sh
# or
#!/usr/bin/env bash
# or (if it’s Node)
#!/usr/bin/env node
Save if you needed to fix that.
- Test the hook manually
.git/hooks/post-commit
# or explicitly with the interpreter you expect, e.g.:
bash .git/hooks/post-commit
2.10.10 9) Push from Colab with a short‑lived token (safe method)
Create a fine‑grained PAT at GitHub → Settings → Developer settings → Fine‑grained tokens
- Resource owner: your username/org
- Repositories: only select repositories
- Permissions: Contents (Read/Write), Pull requests (Read/Write)
- Expiration: short (e.g., 7 days)
# Colab cell: push using a temporary URL with token (not saved to git config)
from getpass import getpass
= getpass("Enter your GitHub token (input hidden; not stored): ")
token
= f"https://{token}@github.com/{REPO_OWNER}/{REPO_NAME}.git"
push_url !git push {push_url} {BRANCH}:{BRANCH}
# Optional: immediately clear the token variable
del token
If error occurs, check:
2.10.11 1. Check permissions
ls -l .git/hooks/pre-push
If it looks like -rw-r--r--
, then it’s missing the executable bit. Fix:
chmod +x .git/hooks/pre-push
2.10.12 2. Check the first line (shebang)
Open it:
head -n 1 .git/hooks/pre-push
You should see something like:
#!/bin/sh
or
#!/usr/bin/env bash
If it’s missing, add a valid shebang.
2.10.13 3. Test the hook manually
.git/hooks/pre-push
# or explicitly:
bash .git/hooks/pre-push
If the command prints the URL, clear this cell’s output after a successful push (Colab: “⋮” → “Clear output”).
2.10.14 10) Open a Pull Request
The name “pull request” can be confusing at first — it sounds like you are “pushing” your code, but really you’re asking someone else to pull it.
2.11 Origin of the term
The phrase comes from distributed version control (like Git before GitHub’s UI popularized it).
If you had changes in your branch/repo and wanted them in the upstream project, you’d contact the maintainer and say:
“Please pull these changes from my branch into yours.”
So a pull request is literally a request for someone else to pull your commits.
2.12 How it works (e.g., on GitHub, GitLab, Bitbucket)
- You push your branch to your fork or to the remote repository.
- You open a pull request against the target branch (usually
main
ordevelop
). - The repository maintainers review your code.
- If accepted, they “pull” your commits into their branch (though under the hood it’s often implemented as a merge or rebase).
2.13 Contrast with “push”
- Push: You directly upload commits to a remote branch you have permission to write to.
- Pull request: You don’t merge directly — instead, you ask maintainers to pull your changes, review them, and integrate them.
Summary: It’s called a pull request because you’re not pushing your changes into the target branch; you’re asking the project owner/maintainer to pull your branch into theirs.
Recommended (web UI): Navigate to your repo on GitHub → Compare & pull request → base:
main
, compare:setup/git-lfs
. Fill title/description, tag your partner, and create the PR.Optional (API): open a PR programmatically from Colab:
# OPTIONAL: Create PR via GitHub API (requires token again)
from getpass import getpass
import requests, json
= getpass("GitHub token (again, not stored): ")
token = {"Authorization": f"Bearer {token}",
headers "Accept": "application/vnd.github+json"}
= {
payload "title": "Setup: .gitignore + Git-LFS + sample Parquet",
"head": BRANCH,
"base": "main",
"body": "Adds .gitignore, configures Git-LFS for parquet/db/pdf/model files, and commits a sample Parquet for verification."
}= requests.post(f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/pulls",
r =headers, data=json.dumps(payload))
headersprint("PR status:", r.status_code)
try:
= r.json()["html_url"]
pr_url print("PR URL:", pr_url)
except Exception as e:
print("Response:", r.text)
del token
2.13.1 11) Quick verification checklist
git lfs ls-files
showsdata/processed/sample_returns.parquet
:
!git lfs ls-files
- PR diff shows a small pointer for the Parquet, not raw binary content.
.gitignore
present; no secrets or raw data committed.
2.14 Wrap‑up
- Keep PRs small and focused; write helpful titles and descriptions.
- Don’t commit secrets or large data. Use
.env
+.env.example
. - Use LFS selectively—version only small, important binaries (e.g., sample processed sets, posters).
- Next time: Quarto polish (already started) and Unix automation to fetch raw data reproducibly.
2.15 Homework (due before Session 3)
Goal: Cement branch/PR hygiene, add review scaffolding, and add a small guard against large files accidentally committed outside LFS.
2.15.1 Part A — Add a PR template and CODEOWNERS
Create a PR template so every PR includes key info.
# Run in your repo root
import os, pathlib, textwrap
".github").mkdir(exist_ok=True)
pathlib.Path(= textwrap.dedent("""\
tpl ## Summary
What does this PR do and why?
## Changes
-
## How to test
- From a fresh clone: steps to run
## Checklist
- [ ] Runs from a fresh clone (README steps)
- [ ] No secrets committed; `.env` only (and `.env.example` updated if needed)
- [ ] Large artifacts tracked by LFS (`git lfs ls-files` shows expected files)
- [ ] Clear, small diff; comments where useful
""")
open(".github/pull_request_template.md","w").write(tpl)
print("Wrote .github/pull_request_template.md")
(Optional) Require both teammates to review by setting CODEOWNERS (edit handles):
= """\
owners # Replace with your GitHub handles
* @teammate1 @teammate2
"""
open(".github/CODEOWNERS","w").write(owners)
print("Wrote .github/CODEOWNERS (edit handles!)")
Commit and push on a new branch (example: chore/pr-template
), open a PR, and merge after review. If working on G-Drive: execute the following before git operations: chmod +x .git/hooks/*
2.15.2 Part B — Add a large‑file guard (simple Python script)
Create a small tool that fails if files > 10 MB are found and aren’t tracked by LFS. This will be used manually for now (automation later in CI).
# tools/guard_large_files.py
import os, subprocess, sys
= 10
LIMIT_MB = os.getcwd()
ROOT
def lfs_tracked_paths(): #find all files tracked by lfs
try:
= subprocess.check_output(["git", "lfs", "ls-files"], text=True)
out = set()
tracked for line in out.strip().splitlines():
# line format: "<oid> <path>"
= line.split(None, 1)[-1].strip()
p
tracked.add(os.path.normpath(p))return tracked
except Exception:
return set()
def humanize(bytes_):
return f"{bytes_/(1024*1024):.2f} MB"
= lfs_tracked_paths()
lfs_set = []
bad for dirpath, dirnames, filenames in os.walk(ROOT):
# skip .git directory
if ".git" in dirpath.split(os.sep):
continue
for fn in filenames:
= os.path.normpath(os.path.join(dirpath, fn))
path try:
= os.path.getsize(path)
size except FileNotFoundError:
continue
if size >= LIMIT_MB * 1024 * 1024:
= os.path.relpath(path, ROOT)
rel if rel not in lfs_set:
bad.append((rel, size))
if bad:
print("ERROR: Large non-LFS files found:")
for rel, size in bad:
print(f" - {rel} ({humanize(size)})")
1)
sys.exit(else:
print("OK: No large non-LFS files detected.")
Add a Makefile target to run it. Let’s generate the tools
directory and the script:
# Define the path to the tools directory
= Path("tools")
tools_dir
# Create it if it doesn't exist (including any parents)
=True, exist_ok=True)
tools_dir.mkdir(parents
print(f"Directory '{tools_dir}' is ready.")
# Create/append Makefile target
from pathlib import Path
= "\n\nguard:\n\tpython tools/guard_large_files.py\n" # guard: Makefile target. \t: tab required.
text = Path("Makefile") # point to the Makefile
p # p.write_text(p.read_text() + text if p.exists() else text) # if p exists, read exising content and append text and overwrites.
# the above code will append text everytime, casue error if repeatedly excute.
if p.exists():
= p.read_text()
content if "guard:" not in content:
+ text)
p.write_text(content else:
p.write_text(text)
print("Added 'guard' target to Makefile")
After running the snippet: Your repo has a Makefile with a guard target. Running:
make guard
will execute your Python script:
python tools/guard_large_files.py
Run locally/Colab:
!python tools/guard_large_files.py
Commit on a new branch (e.g., chore/large-file-guard
), push, open PR, and merge after review.
2.15.3 Part C — Branch/PR practice (each student)
Each student creates their own branch (e.g.,
docs/readme-username
) and:- Adds a “Development workflow” section in
README.md
(1–2 paragraphs): how to clone, mount Drive in Colab, install requirements, and where outputs go. - Adds themselves to
README.md
“Contributors” section with a GitHub handle link.
- Adds a “Development workflow” section in
Push branch and open a PR.
Partner reviews the PR:
- Leave at least 2 useful comments (nits vs blockers).
- Approve when ready; the author merges.
Expected files touched: README.md
, .github/pull_request_template.md
, optional .github/CODEOWNERS
, tools/guard_large_files.py
, Makefile
.
2.15.4 Part D — Prove LFS is working
- On
main
, run:
!git lfs ls-files
- You should see
data/processed/sample_returns.parquet
(and any other tracked binaries). - In the GitHub web UI, click the file to confirm it’s an LFS pointer, not full binary contents.
2.15.5 Submission checklist (pass/revise)
- Two merged PRs (template + guard) with clear titles and descriptions.
- README updated with development workflow and contributors.
git lfs ls-files
shows expected files.tools/guard_large_files.py
present and passes (OK
) onmain
.
2.16 Key points
- Small PRs win. Short diffs → fast, focused reviews.
- Don’t commit secrets.
.env
only; keep.env.example
up to date. - Use LFS sparingly and purposefully—prefer regenerating big raw data.
- Colab pushes: use a short‑lived token, and clear outputs after use.
Next session: Quarto reporting polish and pipeline hooks; soon after, Unix automation so make get-data
can reproducibly fetch raw data for the unified‑stocks project.