GitSnap: A lightweight tool for creating Git repositories, committing, and pushing to GitHub.
Wrangling Your Code with Python and Git – No Lasso Required!
Table of contents
- Initializing a Repo
- Hashing Objects
- Bonus Functions
- The Git Index
- Checking the Status
- Writing the Index
- Committing Changes
- The Commit Process
- Talking to a Git Server: The "Smart Protocol"
- The Smart Protocol
- The pkt-line Format
- Functions for Handling pkt-lines
- Making an HTTPS Request (without requests)
- Checking the Remote's Commit Hash
- Bringing it All Together
- Determining Missing Objects and Pushing to the Server
- Finding Missing Objects
- Functions for Finding Tree and Commit Objects:
- Determining Missing Objects
- Creating the Pack File
- The Push Process
- Command Line Parsing and Usage
- Using argparse for Command-Line Parsing
- Example Usage of gitsnap
- Lets Wrap-Up: Your New Git Sidekick!
December 17th 2024
Summary : In this article, I share how I wrote 500 lines of Python code to create a simple Git client. It can set up a repository, add files, commit changes, and even push itself to GitHub. I’ll walk you through the process and explain the code behind this fun project.
"Git is famous for its simple object model, and for good reason. When I first started learning Git, I was surprised to find that the local object database is just a bunch of regular files in the .git
folder. Aside from the index (.git/index
) and pack files (which are optional), the setup and format are pretty straightforward. Inspired by Mary Rose Cook’s similar project, I decided to see if I could build enough of Git to create a repository, commit changes, and even push to a real server—GitHub, in this case.
While Mary’s gitlet
program focuses more on teaching, mine goes a step further and pushes itself to GitHub—definitely a hacky bonus. Her version handles more Git features like merging, but it uses a simpler text-based index instead of Git’s binary format. Plus, her gitlet
can only push to a local repository, not to a remote server like GitHub.
My goal was to create something that could handle all the steps, including pushing to a real Git server, and use the same binary index format that Git uses. This way, I could check my progress with regular Git commands along the way. I called my version GitSnap, written in Python (3.11+), and it uses only standard library modules. It’s just over 500 lines of code (including blank lines and comments). It covers the basics like init
, add
, commit
, and push
, but also includes commands like status
, diff
, cat-file
, ls-files
, and hash-object
. These extra commands were not only useful on their own but also helped me debug GitSnap.
Let’s jump into the code! You can check out the full GitSnap.py on GitHub or follow along as I break down key parts of it."
Initializing a Repo
Initializing a local Git repository is pretty simple—it's all about creating the .git
directory and a few other files and folders inside it. Once we’ve got that setup, we can start adding the necessary bits. Here's how we can do it with a small helper function to read and write files, followed by the init()
function:
def init(repo):
"""Create the repo directory and set up the .git directory."""
os.mkdir(repo) # Create the main repo folder
os.mkdir(os.path.join(repo, '.git')) # Create the .git directory
for name in ['objects', 'refs', 'refs/heads']: # Create subdirectories inside .git
os.mkdir(os.path.join(repo, '.git', name))
write_file(os.path.join(repo, '.git', 'HEAD'),
b'ref: refs/heads/master') # Point HEAD to the master branch
print('Initialized empty repository: {}'.format(repo)) # Let’s celebrate!
Now, a couple of things to note: First, this is a simple, no-frills approach. There’s no fancy error handling, because, let’s face it, this is a 500-line project. So, if the repo
directory already exists, it will just fail and show an error with a traceback. You can always modify this repo as it’ll be open for your Pull Requests
But hey, it's a start, and it’s good enough for the basics!
Hashing Objects
The hash_object
function takes care of hashing and saving a single object to the .git/objects
"database." In Git, there are three main types of objects: blobs (regular files), commits, and trees (representing the state of a directory).
Each object is made up of a small header that includes the type and size of the object, followed by a NUL byte and the actual data. This whole thing is then zlib-compressed and saved to .git/objects/ab/cd...
, where the first two characters of the 40-character SHA-1 hash are used as a folder name (ab
), and the rest (cd...
) is the filename.
Here’s the code that handles this process, and as usual, we stick to Python’s standard library for everything (shoutout to os
and hashlib
):
def hash_object(data, obj_type, write=True):
"""Compute the hash of object data of the given type and write to the object store if "write" is True.
Return the SHA-1 object hash as a hex string."""
# Create the object header with type and size
header = '{} {}'.format(obj_type, len(data)).encode()
# Combine the header, a NUL byte, and the data
full_data = header + b'\x00' + data
# Hash the combined data with SHA-1
sha1 = hashlib.sha1(full_data).hexdigest()
# If write is True, save the compressed object to the .git/objects directory
if write:
path = os.path.join('.git', 'objects', sha1[:2], sha1[2:])
if not os.path.exists(path):
os.makedirs(os.path.dirname(path), exist_ok=True)
write_file(path, zlib.compress(full_data)) # Write the compressed object
return sha1 # Return the SHA-1 hash as a hex string
Bonus Functions
There are a few other helpful functions here:
find_object()
: Looks up an object by its hash (or hash prefix).read_object()
: Reads an object and its type—kind of the reverse ofhash_object()
.cat_file()
: This is the GitSnap version ofgit cat-file
, which pretty-prints an object’s contents (or its size or type) to the terminal.
With all this in place, you’ve got the ability to store, retrieve, and inspect Git objects, just like Git does under the hood!
The Git Index
Next up, let's talk about the Git index, or as it's often called, the staging area. When you add files to the index, you’re basically preparing them for a commit. The index is a list of file entries, ordered by their path, and each entry includes information like the file's path, modification time, SHA-1 hash, and more. It’s important to note that the index lists all files in the current directory, not just the ones you’re currently staging for a commit.
The actual index file is stored at .git/index
, and it uses a custom binary format. While it’s not super complicated, it does involve handling some struct and dealing with variable-length path fields. Here’s the basic breakdown:
Header: The first 12 bytes
Entries: Each entry is 62 bytes, plus the length of the path, with some padding
Footer: The last 20 bytes are a SHA-1 hash of the index for integrity checking
Here’s how we can define an IndexEntry
and a function to read the index:
# Data for one entry in the git index (.git/index)
IndexEntry = collections.namedtuple('IndexEntry', [
'ctime_s', 'ctime_n', 'mtime_s', 'mtime_n', 'dev', 'ino', 'mode',
'uid', 'gid', 'size', 'sha1', 'flags', 'path',
])
def read_index():
"""Read the git index file and return a list of IndexEntry objects."""
try:
data = read_file(os.path.join('.git', 'index'))
except FileNotFoundError:
return []
digest = hashlib.sha1(data[:-20]).digest()
assert digest == data[-20:], 'Invalid index checksum'
# Read the signature and version
signature, version, num_entries = struct.unpack('!4sLL', data[:12])
assert signature == b'DIRC', f'Invalid index signature {signature}'
assert version == 2, f'Unknown index version {version}'
entry_data = data[12:-20]
entries = []
i = 0
while i + 62 < len(entry_data):
fields_end = i + 62
fields = struct.unpack('!LLLLLLLLLL20sH', entry_data[i:fields_end])
path_end = entry_data.index(b'\x00', fields_end)
path = entry_data[fields_end:path_end]
entry = IndexEntry(*(fields + (path.decode(),)))
entries.append(entry)
entry_len = ((62 + len(path) + 8) // 8) * 8
i += entry_len
assert len(entries) == num_entries
return entries
Checking the Status
Once we've got the index in place, we can start tracking the status of files with commands like ls_files
, status
, and diff
—each of which shows the state of the files in the index in different ways.
ls_files
: Lists all files in the index, showing their mode and hash if the-s
option is used.status()
usesget_status()
: Compares the files in the index to those in the working directory and prints out which files are new, modified, or deleted.diff()
: Shows the difference between the files in the index and the working directory (using Python’sdifflib
module).
In reality, Git likely handles these operations more efficiently than this, considering things like file modification time and other details. But in GitSnap, I take a more basic approach: I do a full directory listing with os.walk()
, compare paths using set operations, and then check if the hashes match. For example, this set comprehension checks for changed files:
changed = {p for p in (paths & entry_paths)
if hash_object(read_file(p), 'blob', write=False) !=
entries_by_path[p].sha1.hex()}
Writing the Index
Once we’ve checked the status and figured out which files have changed, we need to update the index. The write_index()
function does exactly that, and the add()
function is used to add new files to the index. It works by reading the current index, adding the new paths, sorting the entries, and then writing it all back.
Now we’re ready to commit! With the files added to the index, we can move on to the next step in the Git process.
Committing Changes
When you make a commit in Git, you're essentially creating two objects:
- Tree Object: This is a snapshot of the directory (or really the index) at the time of the commit. It’s like taking a picture of all the files in your project at that point. The tree lists the hashes of the files (blobs) and sub-trees in the directory, and it’s recursive. That means if a file changes, the hash of the entire tree changes too. But if a file or sub-tree doesn’t change, it’s just referred to by the same hash, making it super efficient.
Here’s an example of how a tree object might look when printed using git cat-file pretty
:
100644 blob 4aab5f560862b45d7a9f1370b1c163b74484a24d LICENSE.txt
100644 blob 43ab992ed09fa756c56ff162d5fe303003b5ae0f README.md
100644 blob c10cb8bc2c114aba5a1cb20dea4c1597e5a3c193 gitsnap.py
The function write_tree
is used to create this tree object. Git stores the information as a mix of binary and text — the mode and path as text, followed by a NUL byte and then the binary SHA-1 hash. Here’s how write_tree()
works in GitSnap:
def write_tree():
"""Write a tree object from the current index entries."""
tree_entries = []
for entry in read_index():
assert '/' not in entry.path, 'Currently only supports a single, top-level directory'
mode_path = '{:o} {}'.format(entry.mode, entry.path).encode()
tree_entry = mode_path + b'\x00' + entry.sha1
tree_entries.append(tree_entry)
return hash_object(b''.join(tree_entries), 'tree')
- Commit Object: This object records the tree hash, the parent commit, the author, timestamp, and the commit message. Git handles branching and merging like a pro, but GitSnap only supports a single branch (master), so there’s always just one parent commit — or no parents for the first commit.
Here’s an example of what a commit object looks like:
tree 22264ec0ce9da29d0c420e46627fa0cf057e709a
parent 03f882ade69ad898aba73664740641d909883cdc
author Ben Hoyt <benhoyt@gmail.com> 1493170892 -0500
committer Ben Hoyt <benhoyt@gmail.com> 1493170892 -0500
Fix cat-file size/type/pretty handling
Now let’s look at how we create a commit in GitSnap. We first generate the tree object, get the parent commit (if any), and then format the commit data. Here's the commit
function:
def commit(message, author):
"""Commit the current state of the index to master with the given message.
Returns the hash of the commit object.
"""
tree = write_tree()
parent = get_local_master_hash()
timestamp = int(time.mktime(time.localtime()))
utc_offset = -time.timezone
author_time = '{} {}{:02}{:02}'.format(
timestamp,
'+' if utc_offset > 0 else '-',
abs(utc_offset) // 3600,
(abs(utc_offset) // 60) % 60)
# Building the commit data
lines = ['tree ' + tree]
if parent:
lines.append('parent ' + parent)
lines.append('author {} {}'.format(author, author_time))
lines.append('committer {} {}'.format(author, author_time))
lines.append('')
lines.append(message)
lines.append('')
# Creating the commit object
data = '\n'.join(lines).encode()
sha1 = hash_object(data, 'commit')
# Writing the commit hash to the master branch
master_path = os.path.join('.git', 'refs', 'heads', 'master')
write_file(master_path, (sha1 + '\n').encode())
print('Committed to master: {:7}'.format(sha1))
return sha1
The Commit Process
Create the Tree: First, we create a tree object using the current index. It’s a snapshot of the directory.
Parent Commit: If there’s already a commit (i.e., it's not the first one), we record the parent commit.
Commit Data: We prepare the commit data, including author details, timestamp, and the commit message.
Generate the Commit Object: Using the tree and commit data, we hash it and create a commit object.
Update the Master Branch: Finally, we save the commit hash in the
refs/heads/master
file to track the commit.
Now that we’ve committed our changes, we can see the new state of our project in the master
branch. With GitSnap, you’ve just made your first commit!
Talking to a Git Server: The "Smart Protocol"
In this section, we'll talk about how to make GitSnap interact with a real Git server (like GitHub, Bitbucket, etc.). When you push changes to a remote repository, you need to figure out how to synchronize your local commits with the remote ones. The challenge here is using the smart protocol to transfer the necessary commit data and objects.
The Smart Protocol
Git used to use a "dumb" transfer protocol, which was simpler but inefficient. As of 2011, GitHub stopped supporting this protocol in favor of the smart protocol(sarcastically), which is more efficient and uses "pack files" to bundle up and transfer missing objects. While the smart protocol is more complex, it’s what GitHub and other servers use today.
The core idea is simple:
Check the remote's master branch to see what commit it’s on.
Identify missing objects (commits, trees, blobs) that your local repository needs to sync up with the remote.
Send a pack file that contains all the missing objects.
The pkt-line Format
At the heart of the smart protocol is the pkt-line format, which is used to send metadata, like commit hashes, between the client and the server. A "pkt-line" is essentially a length-prefixed packet where the first 4 bytes represent the length of the packet, followed by the actual data.
Here’s an example of the response GitHub sends when you make a git-receive-pack
request:
001f# service=git-receive-pack\n
0000
00b20000000000000000000000000000000000000000 capabilities^{}\x00
report-status delete-refs side-band-64k quiet atomic ofs-delta
agent=git/2.9.3~peff-merge-upstream-2-9-1788-gef730f7\n
0000
Functions for Handling pkt-lines
To interact with the server using the smart protocol, we need two key functions:
Extracting pkt-line data: Convert the raw server data into individual lines.
Building pkt-line data: Convert a list of lines into the correct pkt-line format to send back to the server.
Here’s how you can implement both:
def extract_lines(data):
"""Extract list of lines from given server data."""
lines = []
i = 0
for _ in range(1000):
line_length = int(data[i:i + 4], 16)
line = data[i + 4:i + line_length]
lines.append(line)
if line_length == 0:
i += 4
else:
i += line_length
if i >= len(data):
break
return lines
def build_lines_data(lines):
"""Build byte string from given lines to send to server."""
result = []
for line in lines:
result.append('{:04x}'.format(len(line) + 5).encode())
result.append(line)
result.append(b'\n')
result.append(b'0000')
return b''.join(result)
Making an HTTPS Request (without requests
)
To interact with a remote Git server, you'll need to make authenticated HTTPS requests. While libraries like requests
make this easier, you can also use Python’s built-in urllib
library. Here's a simple function to make a GET request with authentication:
import urllib.request
def http_request(url, username, password, data=None):
"""Make an authenticated HTTP request to the given URL."""
password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
password_manager.add_password(None, url, username, password)
auth_handler = urllib.request.HTTPBasicAuthHandler(password_manager)
opener = urllib.request.build_opener(auth_handler)
f = opener.open(url, data=data)
return f.read()
In contrast, using the requests
library simplifies this to just a single line:
import requests
def http_request(url, username, password):
response = requests.get(url, auth=(username, password))
response.raise_for_status()
return response.content
Checking the Remote's Commit Hash
To figure out what commit the remote repository's master branch is on, we can send a request to the server and use the info/refs
endpoint. This gives us the commit hash of the remote master branch.
Here’s the code to retrieve the remote master branch's commit hash:
def get_remote_master_hash(git_url, username, password):
"""Get commit hash of remote master branch, return SHA-1 hex string."""
url = git_url + '/info/refs?service=git-receive-pack'
response = http_request(url, username, password)
lines = extract_lines(response)
assert lines[0] == b'# service=git-receive-pack\n'
assert lines[1] == b''
if lines[2][:40] == b'0' * 40:
return None # No commits on remote
master_sha1, master_ref = lines[2].split(b'\x00')[0].split()
assert master_ref == b'refs/heads/master'
assert len(master_sha1) == 40
return master_sha1.decode()
Bringing it All Together
Once you have the remote commit hash, you can compare it to your local commit. If the remote is behind, you can send a pack file containing the missing objects. This process is more complex than using the "dumb" protocol, but it’s much more efficient and scalable for modern Git hosting services like GitHub and Bitbucket.
By following these steps, you're using the smart protocol to talk to a Git server and sync your repository with the remote, ensuring that all missing commits and objects are transferred properly.
Determining Missing Objects and Pushing to the Server
In this section, we explore how to identify the missing objects that need to be pushed to a remote Git server and how to actually push those changes. We’ll break it down into the following steps:
Finding the Missing Objects: The first task is to identify which objects (commits, trees, blobs) are missing on the remote server compared to the local repository.
Creating the Pack File: After identifying the missing objects, we need to bundle them into a "pack file," which is a compressed file containing all the necessary objects.
Pushing the Changes: Finally, we send the updated commit hash and the pack file to the remote server via an HTTP request.
Finding Missing Objects
To determine which objects are missing, we first need to identify the objects referenced by the local commit and the remote commit. This is done by recursively traversing the commit tree and gathering the object hashes.
Functions for Finding Tree and Commit Objects:
find_tree_objects(tree_sha1)
: This function recursively finds all objects within a tree, including the hash of the tree itself.def find_tree_objects(tree_sha1): """Return set of SHA-1 hashes of all objects in this tree (recursively), including the hash of the tree itself.""" objects = {tree_sha1} for mode, path, sha1 in read_tree(sha1=tree_sha1): if stat.S_ISDIR(mode): objects.update(find_tree_objects(sha1)) else: objects.add(sha1) return objects
find_commit_objects(commit_sha1)
: This function recursively collects all objects referenced by a commit, including the commit itself, its tree, and its parent commits.def find_commit_objects(commit_sha1): """Return set of SHA-1 hashes of all objects in this commit (recursively), its tree, its parents, and the hash of the commit itself.""" objects = {commit_sha1} obj_type, commit = read_object(commit_sha1) assert obj_type == 'commit' lines = commit.decode().splitlines() tree = next(l[5:45] for l in lines if l.startswith('tree ')) objects.update(find_tree_objects(tree)) parents = (l[7:47] for l in lines if l.startswith('parent ')) for parent in parents: objects.update(find_commit_objects(parent)) return objects
Determining Missing Objects
Once we have the set of objects for both the local and remote commits, we can determine the missing objects by taking the set difference between the two sets.
def find_missing_objects(local_sha1, remote_sha1):
"""Return set of SHA-1 hashes of objects in local commit that are missing at the remote."""
local_objects = find_commit_objects(local_sha1)
if remote_sha1 is None:
return local_objects
remote_objects = find_commit_objects(remote_sha1)
return local_objects - remote_objects
Creating the Pack File
After determining the missing objects, the next step is to create a pack file. This file contains the missing objects and is sent to the server during the push process. A pack file has a specific structure, starting with a 12-byte header, followed by the objects themselves, encoded and compressed.
encode_pack_object(obj)
: This function encodes a single object to be included in the pack file. The object is compressed using zlib to reduce its size.def encode_pack_object(obj): """Encode a single object for a pack file and return bytes (variable-length header followed by compressed data bytes).""" obj_type, data = read_object(obj) type_num = ObjectType[obj_type].value size = len(data) byte = (type_num << 4) | (size & 0x0f) size >>= 4 header = [] while size: header.append(byte | 0x80) byte = size & 0x7f size >>= 7 header.append(byte) return bytes(header) + zlib.compress(data)
create_pack(objects)
: This function creates the full pack file by combining the header and the encoded objects.def create_pack(objects): """Create pack file containing all objects in given set of SHA-1 hashes, return data bytes of full pack file.""" header = struct.pack('!4sLL', b'PACK', 2, len(objects)) body = b''.join(encode_pack_object(o) for o in sorted(objects)) contents = header + body sha1 = hashlib.sha1(contents).digest() data = contents + sha1 return data
The Push Process
The final step is to push the changes to the remote repository. This involves sending a pkt-line request to update the master branch with the new commit hash, followed by the pack file containing the missing objects.
def push(git_url, username, password):
"""Push master branch to given git repo URL."""
remote_sha1 = get_remote_master_hash(git_url, username, password)
local_sha1 = get_local_master_hash()
missing = find_missing_objects(local_sha1, remote_sha1)
lines = ['{} {} refs/heads/master\x00 report-status'.format(
remote_sha1 or ('0' * 40), local_sha1).encode()]
data = build_lines_data(lines) + create_pack(missing)
url = git_url + '/git-receive-pack'
response = http_request(url, username, password, data=data)
lines = extract_lines(response)
assert lines[0] == b'unpack ok\n', \
"expected line 1 b'unpack ok', got: {}".format(lines[0])
In the push
function:
The remote commit hash is retrieved using
get_remote_master_hash()
.The local commit hash is obtained using
get_local_master_hash()
.The missing objects are found using
find_missing_objects()
.The pkt-line request is sent, along with the pack file containing the missing objects.
The response is checked to ensure the push was successful (indicated by the line
b'unpack ok\n'
).
Command Line Parsing and Usage
In this section, we look at how gitsnap utilizes Python’s argparse
module to parse command-line arguments and execute various Git-like actions. The focus is on creating a user-friendly command line interface (CLI) with functionality similar to Git commands like git init
, git commit
, git status
, and others.
Using argparse
for Command-Line Parsing
argparse
is a powerful module in Python's standard library that helps parse command-line arguments and provides a clean syntax for writing CLI tools. I won’t copy the code here, but take a look at the argparse code in the source. Also gitsnap uses it to allow users to perform Git-like operations through simple commands.
- Subcommands: Just like Git uses subcommands (
git init
,git commit
, etc.), gitsnap also defines multiple subcommands for initializing a repo, committing changes, checking status, etc.
Here's an example structure of how argparse could be used for a gitsnap CLI:
import argparse
def init_repo(name):
print(f"Initialized empty repository: {name}")
def commit_changes(message):
print(f"Committed changes with message: {message}")
def status():
print("Showing status...")
parser = argparse.ArgumentParser(prog="gitsnap")
subparsers = parser.add_subparsers()
# Subcommand for init
init_parser = subparsers.add_parser("init")
init_parser.add_argument("name")
init_parser.set_defaults(func=lambda args: init_repo(args.name))
# Subcommand for commit
commit_parser = subparsers.add_parser("commit")
commit_parser.add_argument("-m", "--message", required=True)
commit_parser.set_defaults(func=lambda args: commit_changes(args.message))
# Subcommand for status
status_parser = subparsers.add_parser("status")
status_parser.set_defaults(func=status)
# Parse arguments and call corresponding function
args = parser.parse_args()
args.func(args)
Example Usage of gitsnap
Once gitsnapis set up, the command-line tool can be used just like Git for various operations:
Initializing a repository: You can initialize a new repository by running the following command:
$ python3 gitsnap.py init gitsnap
This creates a new repository called
gitsnap
.Checking status: After making changes, you can check the status of your repository with:
$ python3 gitsnap.py status
This will show the current changes in the working directory.
Adding files to the staging area: You can add files to be committed using the
add
subcommand:$ python3 gitsnap.py add gitsnap.py
Committing changes: To commit changes, use the
commit
subcommand with a message:$ python3 gitsnap.py commit -m "First working version of gitsnap"
This will commit the changes with the provided message.
Viewing commit details: You can view commit details with the
cat-file
subcommand:$ python3 gitsnap.py cat-file commit 00d5
This will output the commit details, including the tree hash, author, committer, and the commit message.
Viewing changes (diff): To see what has changed between the index and the working copy, use the
diff
subcommand:$ python3 gitsnap.py diff
This shows the differences between the staged changes and the working directory.
Pushing changes to a remote: Finally, to push the committed changes to a remote repository:
$ python3 gitsnap.py push https://github.com/yashrajtarte/gitsnap.git
This will push the changes to the specified GitHub repository, including the missing objects.
Lets Wrap-Up: Your New Git Sidekick!
gitsnap is like your trusty sidekick for Git! It’s a simple CLI tool that lets you interact with both local repos and remote Git servers, all while keeping that familiar Git vibe.
Thanks to argparse, gitsnap is super flexible—just add new commands whenever you need them, no hassle, no stress. It's as comfy as your favorite pair of jeans!
The best part? If you're already a Git pro, gitsnap will feel like second nature. It follows the same flow as Git, making repo management a total breeze.