GitSnap: A lightweight tool for creating Git repositories, committing, and pushing to GitHub.

GitSnap: A lightweight tool for creating Git repositories, committing, and pushing to GitHub.

Wrangling Your Code with Python and Git – No Lasso Required!

December 17th 2024

Summary : In this article, I share how I wrote 500 lines of Python code to create a simple Git client. It can set up a repository, add files, commit changes, and even push itself to GitHub. I’ll walk you through the process and explain the code behind this fun project.

"Git is famous for its simple object model, and for good reason. When I first started learning Git, I was surprised to find that the local object database is just a bunch of regular files in the .git folder. Aside from the index (.git/index) and pack files (which are optional), the setup and format are pretty straightforward. Inspired by Mary Rose Cook’s similar project, I decided to see if I could build enough of Git to create a repository, commit changes, and even push to a real server—GitHub, in this case.

While Mary’s gitlet program focuses more on teaching, mine goes a step further and pushes itself to GitHub—definitely a hacky bonus. Her version handles more Git features like merging, but it uses a simpler text-based index instead of Git’s binary format. Plus, her gitlet can only push to a local repository, not to a remote server like GitHub.

My goal was to create something that could handle all the steps, including pushing to a real Git server, and use the same binary index format that Git uses. This way, I could check my progress with regular Git commands along the way. I called my version GitSnap, written in Python (3.11+), and it uses only standard library modules. It’s just over 500 lines of code (including blank lines and comments). It covers the basics like init, add, commit, and push, but also includes commands like status, diff, cat-file, ls-files, and hash-object. These extra commands were not only useful on their own but also helped me debug GitSnap.

Let’s jump into the code! You can check out the full GitSnap.py on GitHub or follow along as I break down key parts of it."

Initializing a Repo

Initializing a local Git repository is pretty simple—it's all about creating the .git directory and a few other files and folders inside it. Once we’ve got that setup, we can start adding the necessary bits. Here's how we can do it with a small helper function to read and write files, followed by the init() function:

def init(repo):
    """Create the repo directory and set up the .git directory."""
    os.mkdir(repo)  # Create the main repo folder
    os.mkdir(os.path.join(repo, '.git'))  # Create the .git directory
    for name in ['objects', 'refs', 'refs/heads']:  # Create subdirectories inside .git
        os.mkdir(os.path.join(repo, '.git', name))
    write_file(os.path.join(repo, '.git', 'HEAD'), 
               b'ref: refs/heads/master')  # Point HEAD to the master branch
    print('Initialized empty repository: {}'.format(repo))  # Let’s celebrate!

Now, a couple of things to note: First, this is a simple, no-frills approach. There’s no fancy error handling, because, let’s face it, this is a 500-line project. So, if the repo directory already exists, it will just fail and show an error with a traceback. You can always modify this repo as it’ll be open for your Pull Requests

But hey, it's a start, and it’s good enough for the basics!

Hashing Objects

The hash_object function takes care of hashing and saving a single object to the .git/objects "database." In Git, there are three main types of objects: blobs (regular files), commits, and trees (representing the state of a directory).

Each object is made up of a small header that includes the type and size of the object, followed by a NUL byte and the actual data. This whole thing is then zlib-compressed and saved to .git/objects/ab/cd..., where the first two characters of the 40-character SHA-1 hash are used as a folder name (ab), and the rest (cd...) is the filename.

Here’s the code that handles this process, and as usual, we stick to Python’s standard library for everything (shoutout to os and hashlib):

def hash_object(data, obj_type, write=True):
    """Compute the hash of object data of the given type and write to the object store if "write" is True. 
    Return the SHA-1 object hash as a hex string."""

    # Create the object header with type and size
    header = '{} {}'.format(obj_type, len(data)).encode()

    # Combine the header, a NUL byte, and the data
    full_data = header + b'\x00' + data

    # Hash the combined data with SHA-1
    sha1 = hashlib.sha1(full_data).hexdigest()

    # If write is True, save the compressed object to the .git/objects directory
    if write:
        path = os.path.join('.git', 'objects', sha1[:2], sha1[2:])
        if not os.path.exists(path):
            os.makedirs(os.path.dirname(path), exist_ok=True)
            write_file(path, zlib.compress(full_data))  # Write the compressed object

    return sha1  # Return the SHA-1 hash as a hex string

Bonus Functions

There are a few other helpful functions here:

  • find_object(): Looks up an object by its hash (or hash prefix).

  • read_object(): Reads an object and its type—kind of the reverse of hash_object().

  • cat_file(): This is the GitSnap version of git cat-file, which pretty-prints an object’s contents (or its size or type) to the terminal.

With all this in place, you’ve got the ability to store, retrieve, and inspect Git objects, just like Git does under the hood!

The Git Index

Next up, let's talk about the Git index, or as it's often called, the staging area. When you add files to the index, you’re basically preparing them for a commit. The index is a list of file entries, ordered by their path, and each entry includes information like the file's path, modification time, SHA-1 hash, and more. It’s important to note that the index lists all files in the current directory, not just the ones you’re currently staging for a commit.

The actual index file is stored at .git/index, and it uses a custom binary format. While it’s not super complicated, it does involve handling some struct and dealing with variable-length path fields. Here’s the basic breakdown:

  • Header: The first 12 bytes

  • Entries: Each entry is 62 bytes, plus the length of the path, with some padding

  • Footer: The last 20 bytes are a SHA-1 hash of the index for integrity checking

Here’s how we can define an IndexEntry and a function to read the index:

# Data for one entry in the git index (.git/index)
IndexEntry = collections.namedtuple('IndexEntry', [
    'ctime_s', 'ctime_n', 'mtime_s', 'mtime_n', 'dev', 'ino', 'mode',
    'uid', 'gid', 'size', 'sha1', 'flags', 'path',
])

def read_index():
    """Read the git index file and return a list of IndexEntry objects."""
    try:
        data = read_file(os.path.join('.git', 'index'))
    except FileNotFoundError:
        return []

    digest = hashlib.sha1(data[:-20]).digest()
    assert digest == data[-20:], 'Invalid index checksum'

    # Read the signature and version
    signature, version, num_entries = struct.unpack('!4sLL', data[:12])
    assert signature == b'DIRC', f'Invalid index signature {signature}'
    assert version == 2, f'Unknown index version {version}'

    entry_data = data[12:-20]
    entries = []
    i = 0
    while i + 62 < len(entry_data):
        fields_end = i + 62
        fields = struct.unpack('!LLLLLLLLLL20sH', entry_data[i:fields_end])
        path_end = entry_data.index(b'\x00', fields_end)
        path = entry_data[fields_end:path_end]

        entry = IndexEntry(*(fields + (path.decode(),)))
        entries.append(entry)

        entry_len = ((62 + len(path) + 8) // 8) * 8
        i += entry_len

    assert len(entries) == num_entries
    return entries

Checking the Status

Once we've got the index in place, we can start tracking the status of files with commands like ls_files, status, and diff—each of which shows the state of the files in the index in different ways.

  • ls_files: Lists all files in the index, showing their mode and hash if the -s option is used.

  • status() uses get_status(): Compares the files in the index to those in the working directory and prints out which files are new, modified, or deleted.

  • diff(): Shows the difference between the files in the index and the working directory (using Python’s difflib module).

In reality, Git likely handles these operations more efficiently than this, considering things like file modification time and other details. But in GitSnap, I take a more basic approach: I do a full directory listing with os.walk(), compare paths using set operations, and then check if the hashes match. For example, this set comprehension checks for changed files:

changed = {p for p in (paths & entry_paths)
           if hash_object(read_file(p), 'blob', write=False) !=
              entries_by_path[p].sha1.hex()}

Writing the Index

Once we’ve checked the status and figured out which files have changed, we need to update the index. The write_index() function does exactly that, and the add() function is used to add new files to the index. It works by reading the current index, adding the new paths, sorting the entries, and then writing it all back.

Now we’re ready to commit! With the files added to the index, we can move on to the next step in the Git process.

Committing Changes

When you make a commit in Git, you're essentially creating two objects:

  1. Tree Object: This is a snapshot of the directory (or really the index) at the time of the commit. It’s like taking a picture of all the files in your project at that point. The tree lists the hashes of the files (blobs) and sub-trees in the directory, and it’s recursive. That means if a file changes, the hash of the entire tree changes too. But if a file or sub-tree doesn’t change, it’s just referred to by the same hash, making it super efficient.

Here’s an example of how a tree object might look when printed using git cat-file pretty:

100644 blob 4aab5f560862b45d7a9f1370b1c163b74484a24d    LICENSE.txt
100644 blob 43ab992ed09fa756c56ff162d5fe303003b5ae0f    README.md
100644 blob c10cb8bc2c114aba5a1cb20dea4c1597e5a3c193    gitsnap.py

The function write_tree is used to create this tree object. Git stores the information as a mix of binary and text — the mode and path as text, followed by a NUL byte and then the binary SHA-1 hash. Here’s how write_tree() works in GitSnap:

def write_tree():
    """Write a tree object from the current index entries."""
    tree_entries = []
    for entry in read_index():
        assert '/' not in entry.path, 'Currently only supports a single, top-level directory'
        mode_path = '{:o} {}'.format(entry.mode, entry.path).encode()
        tree_entry = mode_path + b'\x00' + entry.sha1
        tree_entries.append(tree_entry)
    return hash_object(b''.join(tree_entries), 'tree')
  1. Commit Object: This object records the tree hash, the parent commit, the author, timestamp, and the commit message. Git handles branching and merging like a pro, but GitSnap only supports a single branch (master), so there’s always just one parent commit — or no parents for the first commit.

Here’s an example of what a commit object looks like:

tree 22264ec0ce9da29d0c420e46627fa0cf057e709a
parent 03f882ade69ad898aba73664740641d909883cdc
author Ben Hoyt <benhoyt@gmail.com> 1493170892 -0500
committer Ben Hoyt <benhoyt@gmail.com> 1493170892 -0500

Fix cat-file size/type/pretty handling

Now let’s look at how we create a commit in GitSnap. We first generate the tree object, get the parent commit (if any), and then format the commit data. Here's the commit function:

def commit(message, author):
    """Commit the current state of the index to master with the given message.
    Returns the hash of the commit object.
    """
    tree = write_tree()
    parent = get_local_master_hash()
    timestamp = int(time.mktime(time.localtime()))
    utc_offset = -time.timezone
    author_time = '{} {}{:02}{:02}'.format(
            timestamp,
            '+' if utc_offset > 0 else '-',
            abs(utc_offset) // 3600,
            (abs(utc_offset) // 60) % 60)

    # Building the commit data
    lines = ['tree ' + tree]
    if parent:
        lines.append('parent ' + parent)
    lines.append('author {} {}'.format(author, author_time))
    lines.append('committer {} {}'.format(author, author_time))
    lines.append('')
    lines.append(message)
    lines.append('')

    # Creating the commit object
    data = '\n'.join(lines).encode()
    sha1 = hash_object(data, 'commit')

    # Writing the commit hash to the master branch
    master_path = os.path.join('.git', 'refs', 'heads', 'master')
    write_file(master_path, (sha1 + '\n').encode())

    print('Committed to master: {:7}'.format(sha1))
    return sha1

The Commit Process

  1. Create the Tree: First, we create a tree object using the current index. It’s a snapshot of the directory.

  2. Parent Commit: If there’s already a commit (i.e., it's not the first one), we record the parent commit.

  3. Commit Data: We prepare the commit data, including author details, timestamp, and the commit message.

  4. Generate the Commit Object: Using the tree and commit data, we hash it and create a commit object.

  5. Update the Master Branch: Finally, we save the commit hash in the refs/heads/master file to track the commit.

Now that we’ve committed our changes, we can see the new state of our project in the master branch. With GitSnap, you’ve just made your first commit!

Talking to a Git Server: The "Smart Protocol"

In this section, we'll talk about how to make GitSnap interact with a real Git server (like GitHub, Bitbucket, etc.). When you push changes to a remote repository, you need to figure out how to synchronize your local commits with the remote ones. The challenge here is using the smart protocol to transfer the necessary commit data and objects.

The Smart Protocol

Git used to use a "dumb" transfer protocol, which was simpler but inefficient. As of 2011, GitHub stopped supporting this protocol in favor of the smart protocol(sarcastically), which is more efficient and uses "pack files" to bundle up and transfer missing objects. While the smart protocol is more complex, it’s what GitHub and other servers use today.

The core idea is simple:

  1. Check the remote's master branch to see what commit it’s on.

  2. Identify missing objects (commits, trees, blobs) that your local repository needs to sync up with the remote.

  3. Send a pack file that contains all the missing objects.

The pkt-line Format

At the heart of the smart protocol is the pkt-line format, which is used to send metadata, like commit hashes, between the client and the server. A "pkt-line" is essentially a length-prefixed packet where the first 4 bytes represent the length of the packet, followed by the actual data.

Here’s an example of the response GitHub sends when you make a git-receive-pack request:

001f# service=git-receive-pack\n
0000
00b20000000000000000000000000000000000000000 capabilities^{}\x00
        report-status delete-refs side-band-64k quiet atomic ofs-delta
        agent=git/2.9.3~peff-merge-upstream-2-9-1788-gef730f7\n
0000

Functions for Handling pkt-lines

To interact with the server using the smart protocol, we need two key functions:

  1. Extracting pkt-line data: Convert the raw server data into individual lines.

  2. Building pkt-line data: Convert a list of lines into the correct pkt-line format to send back to the server.

Here’s how you can implement both:

def extract_lines(data):
    """Extract list of lines from given server data."""
    lines = []
    i = 0
    for _ in range(1000):
        line_length = int(data[i:i + 4], 16)
        line = data[i + 4:i + line_length]
        lines.append(line)
        if line_length == 0:
            i += 4
        else:
            i += line_length
        if i >= len(data):
            break
    return lines

def build_lines_data(lines):
    """Build byte string from given lines to send to server."""
    result = []
    for line in lines:
        result.append('{:04x}'.format(len(line) + 5).encode())
        result.append(line)
        result.append(b'\n')
    result.append(b'0000')
    return b''.join(result)

Making an HTTPS Request (without requests)

To interact with a remote Git server, you'll need to make authenticated HTTPS requests. While libraries like requests make this easier, you can also use Python’s built-in urllib library. Here's a simple function to make a GET request with authentication:

import urllib.request

def http_request(url, username, password, data=None):
    """Make an authenticated HTTP request to the given URL."""
    password_manager = urllib.request.HTTPPasswordMgrWithDefaultRealm()
    password_manager.add_password(None, url, username, password)
    auth_handler = urllib.request.HTTPBasicAuthHandler(password_manager)
    opener = urllib.request.build_opener(auth_handler)
    f = opener.open(url, data=data)
    return f.read()

In contrast, using the requests library simplifies this to just a single line:

import requests

def http_request(url, username, password):
    response = requests.get(url, auth=(username, password))
    response.raise_for_status()
    return response.content

Checking the Remote's Commit Hash

To figure out what commit the remote repository's master branch is on, we can send a request to the server and use the info/refs endpoint. This gives us the commit hash of the remote master branch.

Here’s the code to retrieve the remote master branch's commit hash:

def get_remote_master_hash(git_url, username, password):
    """Get commit hash of remote master branch, return SHA-1 hex string."""
    url = git_url + '/info/refs?service=git-receive-pack'
    response = http_request(url, username, password)
    lines = extract_lines(response)

    assert lines[0] == b'# service=git-receive-pack\n'
    assert lines[1] == b''
    if lines[2][:40] == b'0' * 40:
        return None  # No commits on remote

    master_sha1, master_ref = lines[2].split(b'\x00')[0].split()
    assert master_ref == b'refs/heads/master'
    assert len(master_sha1) == 40
    return master_sha1.decode()

Bringing it All Together

Once you have the remote commit hash, you can compare it to your local commit. If the remote is behind, you can send a pack file containing the missing objects. This process is more complex than using the "dumb" protocol, but it’s much more efficient and scalable for modern Git hosting services like GitHub and Bitbucket.

By following these steps, you're using the smart protocol to talk to a Git server and sync your repository with the remote, ensuring that all missing commits and objects are transferred properly.

Determining Missing Objects and Pushing to the Server

In this section, we explore how to identify the missing objects that need to be pushed to a remote Git server and how to actually push those changes. We’ll break it down into the following steps:

  1. Finding the Missing Objects: The first task is to identify which objects (commits, trees, blobs) are missing on the remote server compared to the local repository.

  2. Creating the Pack File: After identifying the missing objects, we need to bundle them into a "pack file," which is a compressed file containing all the necessary objects.

  3. Pushing the Changes: Finally, we send the updated commit hash and the pack file to the remote server via an HTTP request.

Finding Missing Objects

To determine which objects are missing, we first need to identify the objects referenced by the local commit and the remote commit. This is done by recursively traversing the commit tree and gathering the object hashes.

Functions for Finding Tree and Commit Objects:
  1. find_tree_objects(tree_sha1): This function recursively finds all objects within a tree, including the hash of the tree itself.

     def find_tree_objects(tree_sha1):
         """Return set of SHA-1 hashes of all objects in this tree (recursively), including the hash of the tree itself."""
         objects = {tree_sha1}
         for mode, path, sha1 in read_tree(sha1=tree_sha1):
             if stat.S_ISDIR(mode):
                 objects.update(find_tree_objects(sha1))
             else:
                 objects.add(sha1)
         return objects
    
  2. find_commit_objects(commit_sha1): This function recursively collects all objects referenced by a commit, including the commit itself, its tree, and its parent commits.

     def find_commit_objects(commit_sha1):
         """Return set of SHA-1 hashes of all objects in this commit (recursively), its tree, its parents, and the hash of the commit itself."""
         objects = {commit_sha1}
         obj_type, commit = read_object(commit_sha1)
         assert obj_type == 'commit'
         lines = commit.decode().splitlines()
         tree = next(l[5:45] for l in lines if l.startswith('tree '))
         objects.update(find_tree_objects(tree))
         parents = (l[7:47] for l in lines if l.startswith('parent '))
         for parent in parents:
             objects.update(find_commit_objects(parent))
         return objects
    
Determining Missing Objects

Once we have the set of objects for both the local and remote commits, we can determine the missing objects by taking the set difference between the two sets.

def find_missing_objects(local_sha1, remote_sha1):
    """Return set of SHA-1 hashes of objects in local commit that are missing at the remote."""
    local_objects = find_commit_objects(local_sha1)
    if remote_sha1 is None:
        return local_objects
    remote_objects = find_commit_objects(remote_sha1)
    return local_objects - remote_objects

Creating the Pack File

After determining the missing objects, the next step is to create a pack file. This file contains the missing objects and is sent to the server during the push process. A pack file has a specific structure, starting with a 12-byte header, followed by the objects themselves, encoded and compressed.

  1. encode_pack_object(obj): This function encodes a single object to be included in the pack file. The object is compressed using zlib to reduce its size.

     def encode_pack_object(obj):
         """Encode a single object for a pack file and return bytes (variable-length header followed by compressed data bytes)."""
         obj_type, data = read_object(obj)
         type_num = ObjectType[obj_type].value
         size = len(data)
         byte = (type_num << 4) | (size & 0x0f)
         size >>= 4
         header = []
         while size:
             header.append(byte | 0x80)
             byte = size & 0x7f
             size >>= 7
         header.append(byte)
         return bytes(header) + zlib.compress(data)
    
  2. create_pack(objects): This function creates the full pack file by combining the header and the encoded objects.

     def create_pack(objects):
         """Create pack file containing all objects in given set of SHA-1 hashes, return data bytes of full pack file."""
         header = struct.pack('!4sLL', b'PACK', 2, len(objects))
         body = b''.join(encode_pack_object(o) for o in sorted(objects))
         contents = header + body
         sha1 = hashlib.sha1(contents).digest()
         data = contents + sha1
         return data
    

The Push Process

The final step is to push the changes to the remote repository. This involves sending a pkt-line request to update the master branch with the new commit hash, followed by the pack file containing the missing objects.

def push(git_url, username, password):
    """Push master branch to given git repo URL."""
    remote_sha1 = get_remote_master_hash(git_url, username, password)
    local_sha1 = get_local_master_hash()
    missing = find_missing_objects(local_sha1, remote_sha1)
    lines = ['{} {} refs/heads/master\x00 report-status'.format(
            remote_sha1 or ('0' * 40), local_sha1).encode()]
    data = build_lines_data(lines) + create_pack(missing)
    url = git_url + '/git-receive-pack'
    response = http_request(url, username, password, data=data)
    lines = extract_lines(response)
    assert lines[0] == b'unpack ok\n', \
        "expected line 1 b'unpack ok', got: {}".format(lines[0])

In the push function:

  1. The remote commit hash is retrieved using get_remote_master_hash().

  2. The local commit hash is obtained using get_local_master_hash().

  3. The missing objects are found using find_missing_objects().

  4. The pkt-line request is sent, along with the pack file containing the missing objects.

  5. The response is checked to ensure the push was successful (indicated by the line b'unpack ok\n').

Command Line Parsing and Usage

In this section, we look at how gitsnap utilizes Python’s argparse module to parse command-line arguments and execute various Git-like actions. The focus is on creating a user-friendly command line interface (CLI) with functionality similar to Git commands like git init, git commit, git status, and others.

Using argparse for Command-Line Parsing

argparse is a powerful module in Python's standard library that helps parse command-line arguments and provides a clean syntax for writing CLI tools. I won’t copy the code here, but take a look at the argparse code in the source. Also gitsnap uses it to allow users to perform Git-like operations through simple commands.

  • Subcommands: Just like Git uses subcommands (git init, git commit, etc.), gitsnap also defines multiple subcommands for initializing a repo, committing changes, checking status, etc.

Here's an example structure of how argparse could be used for a gitsnap CLI:

import argparse

def init_repo(name):
    print(f"Initialized empty repository: {name}")

def commit_changes(message):
    print(f"Committed changes with message: {message}")

def status():
    print("Showing status...")

parser = argparse.ArgumentParser(prog="gitsnap")
subparsers = parser.add_subparsers()

# Subcommand for init
init_parser = subparsers.add_parser("init")
init_parser.add_argument("name")
init_parser.set_defaults(func=lambda args: init_repo(args.name))

# Subcommand for commit
commit_parser = subparsers.add_parser("commit")
commit_parser.add_argument("-m", "--message", required=True)
commit_parser.set_defaults(func=lambda args: commit_changes(args.message))

# Subcommand for status
status_parser = subparsers.add_parser("status")
status_parser.set_defaults(func=status)

# Parse arguments and call corresponding function
args = parser.parse_args()
args.func(args)

Example Usage of gitsnap

Once gitsnapis set up, the command-line tool can be used just like Git for various operations:

  1. Initializing a repository: You can initialize a new repository by running the following command:

     $ python3 gitsnap.py init gitsnap
    

    This creates a new repository called gitsnap.

  2. Checking status: After making changes, you can check the status of your repository with:

     $ python3 gitsnap.py status
    

    This will show the current changes in the working directory.

  3. Adding files to the staging area: You can add files to be committed using the add subcommand:

     $ python3 gitsnap.py add gitsnap.py
    
  4. Committing changes: To commit changes, use the commit subcommand with a message:

     $ python3 gitsnap.py commit -m "First working version of gitsnap"
    

    This will commit the changes with the provided message.

  5. Viewing commit details: You can view commit details with the cat-file subcommand:

     $ python3 gitsnap.py cat-file commit 00d5
    

    This will output the commit details, including the tree hash, author, committer, and the commit message.

  6. Viewing changes (diff): To see what has changed between the index and the working copy, use the diff subcommand:

     $ python3 gitsnap.py diff
    

    This shows the differences between the staged changes and the working directory.

  7. Pushing changes to a remote: Finally, to push the committed changes to a remote repository:

     $ python3 gitsnap.py push https://github.com/yashrajtarte/gitsnap.git
    

    This will push the changes to the specified GitHub repository, including the missing objects.

Lets Wrap-Up: Your New Git Sidekick!

  • gitsnap is like your trusty sidekick for Git! It’s a simple CLI tool that lets you interact with both local repos and remote Git servers, all while keeping that familiar Git vibe.

  • Thanks to argparse, gitsnap is super flexible—just add new commands whenever you need them, no hassle, no stress. It's as comfy as your favorite pair of jeans!

  • The best part? If you're already a Git pro, gitsnap will feel like second nature. It follows the same flow as Git, making repo management a total breeze.