Advanced

Git Internals

Deep dive into Git's architecture, data structures, and internal mechanisms for advanced understanding

Git Internals: Understanding How Git Actually Works

Understanding Git's internal architecture transforms you from someone who uses Git commands to someone who truly understands version control. This deep dive explores Git's data structures, storage mechanisms, and internal processes.

Git's Core Philosophy

Git as a Content-Addressable Filesystem

Git is fundamentally a content-addressable filesystem with a version control system built on top. Every piece of data is stored and retrieved using SHA-1 hashes of its content.

# Git stores everything as objects identified by SHA-1 hashes
echo "Hello, Git!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# The content determines the hash - same content = same hash
echo "Hello, Git!" | git hash-object --stdin  
# Always produces: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Key Principles

  • Immutability: Once created, objects never change
  • Content-Addressable: Objects are identified by their content hash
  • Distributed: Every clone contains the complete history
  • Integrity: SHA-1 hashes ensure data integrity
  • Efficiency: Delta compression and pack files optimize storage

The .git Directory Structure

Exploring the .git Directory

# Create a sample repository to explore
mkdir git-internals-demo && cd git-internals-demo
git init

# Examine the .git directory structure
find .git -type f | head -20

Directory Layout

.git/
├── HEAD                    # Points to current branch
├── config                  # Repository configuration
├── description             # Repository description for GitWeb
├── index                   # Staging area (binary file)
├── hooks/                  # Git hooks directory
├── info/                   # Additional repository information
│   └── exclude             # Local .gitignore patterns
├── objects/                # Object database
│   ├── info/               # Object metadata
│   └── pack/               # Pack files for compression
├── refs/                   # References (branches, tags)
│   ├── heads/              # Local branches
│   ├── remotes/            # Remote-tracking branches
│   └── tags/               # Tags
├── logs/                   # Reference logs (reflog)
│   ├── HEAD                # HEAD reflog
│   └── refs/               # Branch-specific reflogs
└── COMMIT_EDITMSG          # Last commit message

Key Files Explained

# HEAD points to the current branch
cat .git/HEAD
# ref: refs/heads/main

# Config contains repository settings
cat .git/config
# [core]
#     repositoryformatversion = 0
#     filemode = true
#     bare = false

# Index is the staging area (binary file)
git ls-files --stage
# Shows staged files with their blob hashes

# View current branch reference
cat .git/refs/heads/main
# Contains the SHA-1 hash of the latest commit on main

Git Object Types

The Four Object Types

Git stores all data in four types of objects:

Object TypePurposeContains
BlobFile contentRaw file data
TreeDirectory structureReferences to blobs and other trees
CommitVersion snapshotsTree reference + metadata
TagNamed referencesCommit reference + annotation

Blob Objects

Blobs store file content:

# Create a file and add it to Git
echo "Hello, Git Internals!" > file.txt
git add file.txt

# Find the blob hash
git ls-files --stage
# 100644 blob_hash 0    file.txt

# Examine the blob object
blob_hash=$(git ls-files --stage | cut -d' ' -f2)
git cat-file -t $blob_hash    # Shows object type: blob
git cat-file -p $blob_hash    # Shows content: Hello, Git Internals!
git cat-file -s $blob_hash    # Shows size in bytes

# Create blob directly
echo "Direct blob creation" | git hash-object --stdin -w
# -w flag writes to object database

Blob structure:

blob <content-size>\0<content>

Tree Objects

Trees represent directory snapshots:

# Create a more complex structure
mkdir subdir
echo "File in subdirectory" > subdir/nested.txt
echo "Root file" > root.txt
git add .
git commit -m "Create tree structure"

# Find the root tree
commit_hash=$(git rev-parse HEAD)
git cat-file -p $commit_hash | grep tree
# tree <tree_hash>

# Examine the tree
tree_hash=$(git cat-file -p $commit_hash | grep tree | cut -d' ' -f2)
git cat-file -p $tree_hash
# 100644 blob <hash>    root.txt
# 040000 tree <hash>    subdir

# Examine the subdirectory tree
subdir_tree=$(git cat-file -p $tree_hash | grep subdir | cut -f1 | cut -d' ' -f3)
git cat-file -p $subdir_tree
# 100644 blob <hash>    nested.txt

Tree object format:

tree <size>\0
<mode> <filename>\0<20-byte SHA-1 hash>
<mode> <filename>\0<20-byte SHA-1 hash>
...

File modes in trees:

  • 100644: Normal file
  • 100755: Executable file
  • 120000: Symbolic link
  • 040000: Directory (tree)
  • 160000: Gitlink (submodule)

Commit Objects

Commits link trees to create history:

# Examine a commit object
git cat-file -p HEAD
# tree <tree_hash>
# parent <parent_commit_hash>  (if not initial commit)
# author Name <email> <timestamp> <timezone>
# committer Name <email> <timestamp> <timezone>
#
# Commit message

# See commit with multiple parents (merge commit)
# Create a merge to demonstrate
git checkout -b feature
echo "Feature work" > feature.txt
git add feature.txt
git commit -m "Add feature"
git checkout main
git merge feature

# Examine the merge commit
git cat-file -p HEAD
# tree <tree_hash>
# parent <first_parent_hash>
# parent <second_parent_hash>
# ...

Commit object structure:

commit <size>\0
tree <tree_SHA-1>
parent <parent_SHA-1>
author <name> <email> <timestamp> <timezone>
committer <name> <email> <timestamp> <timezone>

<commit message>

Tag Objects

Tags create named references to commits:

# Create annotated tag
git tag -a v1.0 -m "Version 1.0 release"

# Examine tag object
git cat-file -p v1.0
# object <commit_hash>
# type commit
# tag v1.0
# tagger Name <email> <timestamp> <timezone>
#
# Version 1.0 release

# Lightweight tags are just references
git tag v1.0-light
cat .git/refs/tags/v1.0-light
# Contains commit hash directly (not a tag object)

Object Storage and Retrieval

How Git Stores Objects

# Objects are stored in .git/objects/
# Hash: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Stored at: .git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d

# Create an object to see storage
echo "Test content" | git hash-object --stdin -w
hash=$(echo "Test content" | git hash-object --stdin)
echo "Hash: $hash"

# Find the object file
ls -la .git/objects/${hash:0:2}/
# File name is remaining 38 characters of hash

# Objects are compressed with zlib
file .git/objects/${hash:0:2}/${hash:2}
# Should show: zlib compressed data

# Decompress manually (advanced)
python3 -c "
import zlib
with open('.git/objects/${hash:0:2}/${hash:2}', 'rb') as f:
    content = zlib.decompress(f.read())
    print(repr(content))
"
# b'blob 13\x00Test content\n'

Object Lookup Process

# Git resolves partial hashes
git rev-parse 8ab6      # Expands to full hash if unique
git show 8ab6           # Shows object if unique

# Lookup process:
# 1. Check if it's a full SHA-1
# 2. Check refs/heads/, refs/tags/, refs/remotes/
# 3. Check reflog entries
# 4. Expand partial SHA-1 if unique
# 5. Apply rev-parse rules (@, ^, ~, etc.)

The Index (Staging Area)

Understanding the Index

The index is a binary file that acts as a staging area between your working directory and the repository.

# View index contents
git ls-files --stage
# Shows: <mode> <hash> <stage> <filename>

# Index is more than just file tracking
git ls-files --debug
# Shows detailed index information including timestamps, device info, etc.

# The index tracks:
# - File mode and permissions
# - Timestamps (mtime, ctime)
# - File size
# - Device and inode numbers
# - SHA-1 hash of content
# - Stage number (0=normal, 1-3=conflict resolution)

Index File Structure

# Create files in different states
echo "Staged content" > staged.txt
echo "Modified content" > modified.txt
git add staged.txt modified.txt
echo "New modified content" > modified.txt  # Modify after staging

# Compare working tree vs index vs HEAD
git diff                    # Working tree vs index
git diff --cached          # Index vs HEAD  
git diff HEAD              # Working tree vs HEAD

# Status shows index state
git status --porcelain
# A  staged.txt     (added to index)
# AM modified.txt   (added to index, modified in working tree)

Index Operations

# Low-level index operations
git update-index --add --cacheinfo 100644 $blob_hash filename
git update-index --remove filename
git update-index --refresh  # Update stat info

# Write tree from index
git write-tree              # Creates tree object from index
# Returns tree hash

# Read tree to index
git read-tree <tree_hash>   # Loads tree into index
git read-tree --reset HEAD  # Reset index to HEAD

References and the Reflog

Understanding References

# References are human-readable names for SHA-1 hashes
ls -la .git/refs/heads/     # Local branches
ls -la .git/refs/remotes/   # Remote-tracking branches
ls -la .git/refs/tags/      # Tags

# Each ref file contains a SHA-1 hash
cat .git/refs/heads/main
# 1a2b3c4d5e6f7890abcdef1234567890abcdef12

# HEAD is special - it points to current branch
cat .git/HEAD
# ref: refs/heads/main

Symbolic References

# HEAD is a symbolic reference
git symbolic-ref HEAD
# refs/heads/main

# Create symbolic reference
git symbolic-ref refs/heads/alias refs/heads/main
# Now 'alias' points to whatever 'main' points to

# Update references safely
git update-ref refs/heads/main $new_commit_hash
git update-ref -d refs/heads/branch-to-delete  # Delete reference

The Reflog

The reflog tracks reference changes over time:

# View HEAD reflog
git reflog
# <hash> HEAD@{0}: commit: Latest commit
# <hash> HEAD@{1}: commit: Previous commit
# <hash> HEAD@{2}: checkout: moving from feature to main

# Reflog for specific branch
git reflog show main
git reflog show origin/main

# Reflog entries are stored as files
ls -la .git/logs/refs/heads/
cat .git/logs/refs/heads/main
# Each line: <old_hash> <new_hash> <name> <email> <timestamp> <tz> <message>

# Reflog is local only - not shared between repositories
# Entries expire after ~90 days by default
git config --get gc.reflogExpire        # Default expiration
git config --get gc.reflogExpireUnreachable  # Unreachable entries

Pack Files and Garbage Collection

Why Pack Files?

Git optimizes storage using pack files that contain multiple objects with delta compression:

# Initially, each object is stored separately
git count-objects -v
# count: number of loose objects
# size: disk space used by loose objects  
# in-pack: objects stored in pack files
# packs: number of pack files

# Force garbage collection to create pack files
git gc
git count-objects -v
# Notice how loose objects become packed

Pack File Structure

# Examine pack files
ls -la .git/objects/pack/
# pack-<SHA-1>.idx    # Index file for quick lookup
# pack-<SHA-1>.pack   # Actual packed objects

# Pack info
git verify-pack -v .git/objects/pack/pack-*.idx
# Shows: SHA-1 type size packed-size offset depth base-SHA-1

# Delta compression chains
# Objects stored as deltas against base objects
# Chains limited to prevent deep delta chains

Garbage Collection Process

# What git gc does:
# 1. Pack loose objects into pack files
# 2. Remove redundant pack files  
# 3. Remove unreachable objects (after grace period)
# 4. Update pack file indexes
# 5. Prune reflog entries
# 6. Re-pack for optimal delta compression

# Manual garbage collection
git gc --aggressive --prune=now
# --aggressive: more thorough but slower
# --prune=now: remove unreachable objects immediately

# Auto garbage collection triggers
git config --get gc.auto              # Auto-gc threshold
git config --get gc.autoPackLimit     # Pack file threshold

Pack File Algorithms

# Git uses delta compression for similar objects
# Example: Two versions of the same file

# Version 1
echo "Line 1
Line 2  
Line 3" > version1.txt
git add version1.txt && git commit -m "Version 1"

# Version 2 (small change)
echo "Line 1 - MODIFIED
Line 2
Line 3" > version1.txt  
git add version1.txt && git commit -m "Version 2"

# After packing, version 2 might be stored as:
# - Base: version 1 content
# - Delta: instructions to transform base to version 2
# This is much smaller than storing both complete files

Git's Graph Structure

Commit Graph Visualization

# Git history is a directed acyclic graph (DAG)
git log --graph --oneline --all
# Shows branch structure visually

# Each commit points to its parent(s)
# Merge commits have multiple parents
# Initial commits have no parents

# Navigate the graph
git rev-list --parents HEAD  # Show commit and parent hashes
git show-branch --all        # Show branch relationships

Graph Traversal

# Git uses graph traversal for many operations

# Find merge base (common ancestor)
git merge-base branch1 branch2

# List all commits reachable from HEAD
git rev-list HEAD

# List commits in topological order
git rev-list --topo-order HEAD

# Find commits that touch specific paths
git rev-list HEAD -- path/to/file

# Two-dot vs three-dot notation
git rev-list branch1..branch2    # Commits in branch2 but not branch1
git rev-list branch1...branch2   # Commits in either branch but not both

Reachability and Ancestry

# Understanding commit relationships
git rev-parse HEAD~1     # Parent of HEAD
git rev-parse HEAD^1     # First parent of HEAD (same as HEAD~1)
git rev-parse HEAD^2     # Second parent of HEAD (merge commits only)
git rev-parse HEAD~2     # Grandparent of HEAD

# Complex navigation
git rev-parse HEAD~2^2~1  # Second parent of grandparent's first parent

# Check if commit is ancestor
git merge-base --is-ancestor commit1 commit2
echo $?  # 0 if commit1 is ancestor of commit2, 1 otherwise

Low-Level Git Commands (Plumbing)

Reading Objects

# Object examination (plumbing commands)
git cat-file -t <hash>      # Object type
git cat-file -s <hash>      # Object size  
git cat-file -p <hash>      # Object content (pretty print)
git cat-file -e <hash>      # Check if object exists (exit code)

# List tree contents
git ls-tree <tree-hash>     # List tree entries
git ls-tree -r <tree-hash>  # Recursive listing
git ls-tree -t <tree-hash>  # Show tree entries themselves

Creating Objects

# Create objects manually
echo "content" | git hash-object --stdin -w     # Create blob
git mktree                                      # Create tree from stdin
git commit-tree <tree> -p <parent> -m "msg"   # Create commit

# Example: Create commit manually
tree_hash=$(git write-tree)
parent_hash=$(git rev-parse HEAD)
commit_hash=$(echo "Manual commit" | git commit-tree $tree_hash -p $parent_hash)
git update-ref refs/heads/manual-branch $commit_hash

Index Manipulation

# Low-level index operations
git ls-files --stage                           # Show index contents
git update-index --add file                    # Add file to index
git update-index --remove file                 # Remove from index
git update-index --refresh                     # Refresh stat info

# Index and trees
git write-tree                                 # Create tree from index
git read-tree <tree-hash>                      # Load tree into index
git checkout-index -a                          # Checkout all files from index

Reference Management

# Reference operations
git update-ref refs/heads/branch <hash>        # Update branch reference
git symbolic-ref HEAD refs/heads/branch        # Update symbolic reference
git for-each-ref                               # List all references
git show-ref                                   # Show reference values

# Reference logs
git reflog show HEAD                           # Show HEAD reflog
git reflog expire --expire=30.days HEAD       # Expire old reflog entries

Advanced Internal Concepts

Content-Addressable Storage Benefits

# Same content = same hash = storage efficiency
echo "Duplicate content" | git hash-object --stdin -w
echo "Duplicate content" | git hash-object --stdin -w
# Both commands return the same hash - content stored only once

# Integrity checking
git fsck                    # Check object integrity
git fsck --full            # More thorough check
git fsck --connectivity-only  # Check reachability only

Delta Compression Details

# Pack files use delta compression
git verify-pack -v .git/objects/pack/pack-*.idx | head -10
# Look for entries with base-SHA-1 (delta objects)

# Delta compression is smart:
# - Recent versions often stored as base
# - Older versions stored as deltas
# - Deltas can chain, but with limits to prevent performance issues

Object Naming and Collision

# SHA-1 collision handling (theoretical - extremely rare)
# Git would detect collisions and handle gracefully
# Moving to SHA-256 in newer Git versions

# Partial hash resolution
git rev-parse 1a2b        # Expands to full hash if unambiguous
git rev-parse --short HEAD  # Generate short hash

# Hash collision probability
# With SHA-1: 2^160 possible hashes
# Need ~2^80 objects for 50% collision probability
# Practically impossible for any real repository

Performance Implications

Repository Size Optimization

# Monitor repository size
git count-objects -v -H    # Human-readable sizes
du -sh .git                # Total .git directory size

# Identify large objects
git rev-list --objects --all | 
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
sed -n 's/^blob //p' |
sort -n |
tail -10
# Shows largest blobs

# Remove large files from history (dangerous!)
# git filter-branch --tree-filter 'rm -f large-file' HEAD
# Better: use git filter-repo or BFG Repo-Cleaner

Index Performance

# Index performance factors:
# - File count (index size grows linearly)
# - Path depth (affects sorting/lookup)
# - File size (stat() operations)

# Optimize index operations
git config core.preloadindex true      # Parallel index loading
git config core.fscache true          # File system cache on Windows
git config status.showUntrackedFiles no # Disable untracked file scanning

Pack File Performance

# Pack file optimization
git config pack.deltaCacheSize 256m    # Delta cache size
git config pack.windowMemory 512m      # Window memory for packing
git config pack.threads 0             # Use all CPU cores

# Repack for optimal performance
git repack -ad --depth=50 --window=250
# -a: pack all objects
# -d: remove redundant packs  
# --depth: maximum delta chain depth
# --window: consider N objects for delta compression

Debugging Git Issues

Object Database Corruption

# Check repository integrity
git fsck --full --strict
# Reports: missing objects, corruption, dangling commits, etc.

# Recover from corruption
git fsck --lost-found     # Place unreachable objects in .git/lost-found/

# Find dangling commits
git fsck --no-reflog | grep "dangling commit"
git show <dangling-commit-hash>  # Examine recovered commits

Reference Issues

# Fix broken HEAD
git symbolic-ref HEAD refs/heads/main

# Recover deleted branches (if reflog exists)
git reflog show --all | grep branch-name
git checkout -b recovered-branch <hash-from-reflog>

# Find unreferenced commits
git fsck --unreachable | grep commit
git log --oneline <unreachable-commit>

Performance Debugging

# Debug slow Git operations
GIT_TRACE=1 git status              # Trace Git execution
GIT_TRACE_PERFORMANCE=1 git log     # Performance timing
GIT_TRACE_PACK_ACCESS=1 git log     # Pack file access

# Profile specific operations
time git status
time git log --oneline

# Check pack file efficiency  
git verify-pack -v .git/objects/pack/pack-*.idx | 
awk '{sum += $3} END {print "Total packed size:", sum}'

Git Configuration Internals

Configuration Hierarchy

# Configuration precedence (highest to lowest):
# 1. Command line: -c key=value
# 2. Environment variables: GIT_CONFIG_*
# 3. Repository: .git/config
# 4. User: ~/.gitconfig or ~/.config/git/config  
# 5. System: /etc/gitconfig

# View effective configuration
git config --list --show-origin
# Shows where each config value is defined

# Configuration sections
git config --list | grep -E '^(core|user|remote)\.'

Environment Variables

# Key Git environment variables
echo $GIT_DIR              # .git directory location
echo $GIT_WORK_TREE        # Working tree location
echo $GIT_INDEX_FILE       # Index file location
echo $GIT_OBJECT_DIRECTORY # Objects directory

# Temporary overrides
GIT_AUTHOR_NAME="Temp Name" GIT_AUTHOR_EMAIL="temp@example.com" git commit -m "Temp commit"
GIT_COMMITTER_DATE="2023-01-01 12:00:00" git commit --amend --no-edit

Building Custom Git Tools

Using Git Plumbing Commands

#!/bin/bash
# Custom tool: Find largest files in repository history

echo "Finding largest files in Git history..."

git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
awk '/^blob/ { print $3, $4 }' |
sort -nr |
head -10 |
while read size file; do
    echo "$(numfmt --to=iec-i --suffix=B $size) $file"
done

Repository Analysis Script

#!/bin/bash
# Repository analysis tool

echo "=== Git Repository Analysis ==="
echo

echo "Repository size:"
du -sh .git
echo

echo "Object counts:"
git count-objects -v
echo

echo "Largest pack files:"
ls -lh .git/objects/pack/*.pack 2>/dev/null || echo "No pack files"
echo

echo "Branch information:"
git for-each-ref --format='%(refname:short) %(objectname:short) %(contents:subject)' refs/heads/
echo

echo "Recent activity:"
git reflog --oneline -10

Next Steps

🔬 Congratulations! You now understand Git's internal architecture.

Continue exploring:

  1. Performance Optimization - Optimize Git for large repositories
  2. Git Security - Implement security best practices
  3. Git Hooks - Automate workflows with hooks

Practical Applications

  • Repository forensics - Investigate corruption and recover lost data
  • Custom tooling - Build specialized Git tools using plumbing commands
  • Performance tuning - Optimize Git for specific use cases
  • Migration scripts - Convert between version control systems
  • Advanced workflows - Design complex branching strategies

Advanced Topics to Explore

  • Implement custom merge drivers
  • Create specialized clean/smudge filters
  • Build repository migration tools
  • Develop Git protocol extensions
  • Design custom storage backends
  • Optimize for monorepo workflows