Sunday, 27 March 2016

Git - The Onion Model

Disclaimer: This post doesn't teach you git basics. This will help you understand how git works.

In this article we will understand Git's Onion Model - the layers that finally makes it a powerful Distributed Version/Revision Control System.

The four layers are -
4. Distributed Version Control System - Supports push, fetch, pull etc. operations
3. Version Control System - History, branches, merges, rebase etc.
2. Simple Content Tracker - Commits, versions (labels/tags)
1. Map -  key mapped to value persisted on disk




Layer 1 - Map
At its core, git is just a map persisted on disk. This map is a table of key and value. This is also called object database.

The key is SHA1 hash. The value can be of type
  • blob
  • tree
  • commit
  • tag
You can give any value to git and it will calculate its SHA1 hash.

E.g. using the plumbing command hash-object

$ echo "sibtain" | git hash-object --stdin 
64219700ea1c10634d4141fcd1c3f01163cb03d1

To persist this value use -w flag of hash-object.
Note: you need to execute following command inside a git repostiry. (use $ git init to initialize a new repository).

$ echo "sibtain" | git hash-object --stdin -w
64219700ea1c10634d4141fcd1c3f01163cb03d1

To find where/how it is persisted in object store,

$ ls -l .git/objects/
total 12
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 64
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 info
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 pack

You see a directory 64. These are first two characters of SHA1. Inside that directory you will find a file with name as remaining part of SHA1 generated. This is a binary file.

$ ls -l .git/objects/64
total 4
-r--r--r-- 1 sibtain sibtain 23 Mar 27 11:34 219700ea1c10634d4141fcd1c3f01163cb03d1


To get content I'll use another plumbing command cat-file -

$ git cat-file -p 64219700ea1c10634d4141fcd1c3f01163cb03d1
sibtain

To get type of the object
$ git cat-file -t 64219700ea1c10634d4141fcd1c3f01163cb03d1
blob

This clearly explains the inner most layer of git - The Map of key and value pairs & how it is persisted.

The directory structure of a nearly empty git repo is as follows.

$ tree -a
.
`-- .git
    |-- branches
    |-- config
    |-- description
    |-- HEAD
    |-- hooks
    |   |-- applypatch-msg.sample
    |   |-- commit-msg.sample
    |   |-- post-update.sample
    |   |-- pre-applypatch.sample
    |   |-- pre-commit.sample
    |   |-- prepare-commit-msg.sample
    |   |-- pre-rebase.sample
    |   `-- update.sample
    |-- info
    |   `-- exclude
    |-- objects
    |   |-- 64
    |   |   `-- 219700ea1c10634d4141fcd1c3f01163cb03d1
    |   |-- info
    |   `-- pack
    `-- refs
        |-- heads
        `-- tags
11 directories, 13 files

Layer 2 - Simple Content Tracker

The features of a content tracking system is to have provision for maintaining versions and commit checkpoints.

Here we will explore where/how a commit is persisted.

Following directory structure is commited.

$ tree
.
|-- city.lst
`-- city_profile
    |-- mumbai.txt
    `-- pune.txt
 
$ git log
commit 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
Author: sibtain <sibtain@sibtain-linuxmint.(none)>
Date:   Sun Mar 27 17:09:25 2016 +0530

    Adds city details

commit 8408f82db302b32f02510a7afd1749210a3ab9bc
Author: sibtain <sibtain@sibtain-linuxmint.(none)>
Date:   Sun Mar 27 17:09:07 2016 +0530

    Adds City list

Let's focus on commit 2130ce8. Check what's in object database.

$ ls .git/objects
10  21  5e  64  84  8f  a6  ab  info  pack
 
$ ls -l .git/objects/21/
total 4
-r--r--r-- 1 sibtain sibtain 166 Mar 27 17:09 30ce8e0f697af309e47ab1f0dc916fece0eb9a

What is the type of this object?

$ git cat-file -t 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
commit

OK. So it is a commit object. What it contains?

$ git cat-file -p 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
tree 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
parent 8408f82db302b32f02510a7afd1749210a3ab9bc
author sibtain <sibtain@sibtain-linuxmint.(none)> 1459078765 +0530
committer sibtain <sibtain@sibtain-linuxmint.(none)> 1459078765 +0530

Adds city details

Therefore, a commit is a simple piece of text generated and stored by git as object in object database. It is having message, committer/author details with timestamp, tree and parent references holding SHA1 values.

Parent points to previous commit. In case of a 3-way merge commit there will be 2 parent entries.

It is also having pointer to a tree. Let's explore that.

$ git cat-file -t 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
tree
$ git cat-file -p 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
100644 blob 8f4272c240a23d814ee963abcccf9f871aae9be8    city.lst
040000 tree a6ec82fc89c19390894fd7685d32b5124bb24516    city_profile

The tree object is having 2 references. One for a blob (city.lst) and another for a tree (city_profile). The initial numbers specify permission of those objects in hexadecimal. File names and permissions are not stored in blobs, they are stored in tree. Blob is just text.

$ git cat-file -p 8f4272c240a23d814ee963abcccf9f871aae9be8
Mumbai
Pune

$ git cat-file -p a6ec82fc89c19390894fd7685d32b5124bb24516
100644 blob 1013a5511947260b727bd9f79946517121c682ef    mumbai.txt
100644 blob ab4f45300c9270dbb2ba92bc06c0a670271b8f33    pune.txt

Note: You can also use just first few digits of SHA1 in any of the commands.

I'll add a new name to city.lst and then commit changes.

$ git log --oneline
1c57ddc Adds a city to list
2130ce8 Adds city details
8408f82 Adds City list

$ git cat-file -p 1c57ddc
tree 949bd0423891cee02f38c75ac8ec8623ea3f59ff
parent 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
author sibtain <sibtain.masih@gmail.com> 1459079797 +0530
committer sibtain <sibtain.masih@gmail.com> 1459079797 +0530

Adds a city to list

$ git cat-file -p 949bd0
100644 blob 7dc571a82b903bbe28a391600ad9b2a68f752f62    city.lst
040000 tree a6ec82fc89c19390894fd7685d32b5124bb24516    city_profile

Observe that SHA1 of city profile is not changed. So this commit also points to same object is database for city_profile as previous commit. Only there is a new object created and referenced for city.lst



To find how many objects are persisted in object database - 

$ git count-objects
12 objects, 48 kilobytes

The count of 12 comes from the following division.
1 - blob object for demo of hash-objects text - "sibtain"
3 - commit objects
3 - tree objects as commit trees
2 - blob objects for city.lst
1 - tree object for directory city_profile
2 - blob objects for 2 files inside city_profile directory

We have discussed about commits till here. Another feature of a Simple Content Tracker is versioning via tags or labels. A tag is a label for current state of the project. Git supports two types of tags viz.
  1. Lightweight 
  2. Annotated

Lightweight Tags

A lightweight tag just contains a SHA1 value as reference to a commit.

$ git tag lw-1.1

$ ls .git/refs/tags/
lw-1.1


$ cat .git/refs/tags/lw-1.1
1c57ddc8852ecfd621a35df5a93caf7c8f6987d6


$ git cat-file -t 1c57ddc
commit


Annotated Tags

Annotated tag comes with a message and creates an object in git's object db.

$ git tag -a 1.0 -m "Stable 1.0 version"

You  will find an entry for this tag in .git/refs/tags

$ ls -l .git/refs/tags/
total 4
-rw-r--r-- 1 sibtain sibtain 41 Mar 27 20:21 1.0

It contains a reference to a tag object in git's object database.

$ cat .git/refs/tags/1.0
14498a628e939bda2ec6d53032f944a6889c0ecd

The object starts with 14,

$ ls .git/objects/14
498a628e939bda2ec6d53032f944a6889c0ecd
What is the type of this object and what it contains?

$ git cat-file -t 14498a628e939bda2ec6d53032f944a6889c0ecd
tag

$ git cat-file -p 14498a628e939bda2ec6d53032f944a6889c0ecd
object 1c57ddc8852ecfd621a35df5a93caf7c8f6987d6
type commit
tag 1.0
tagger sibtain <sibtain.masih@gmail.com> Sun Mar 27 20:21:00 2016 +0530

Stable 1.0 version

It contains pointer to a commit object, tag name, tagger details with timestamp and message.

Another way to retrieve same information is by using the tag directly.

$ git cat-file -t 1.0
tag

$ git cat-file -p 1.0
object 1c57ddc8852ecfd621a35df5a93caf7c8f6987d6
type commit
tag 1.0
tagger sibtain <sibtain.masih@gmail.com> Sun Mar 27 20:21:00 2016 +0530

Stable 1.0 version

While branches move, tags don't. They stay with same object forever.

Just to revise, the four types of objects that git's object database can store are -
  • Blobs
  • Trees
  • Commits
  • Annotated Tags
You can think of git as - a high level file system built on top of a native file system.

Layer 3 - Version Control System

A version control system is just a single repository. It has history, branches, merges and tags.

History

References between commits are used to track history. All other references viz. commit to a tree, tree to another tree and tree to blob are used to track content of each commit.

Branches

A branch makes a file in .git/refs/heads directory. The file has same name as branch and it contains a SHA1 value as reference to the commit to which it points.

$ git checkout -b villages

$ ls .git/refs/heads/
master 
villages

$ cat .git/refs/heads/villages
1c57ddc8852ecfd621a35df5a93caf7c8f6987d6

$ git cat-file -t 1c57dd
commit

How git finds current branch?

The HEAD pointer contains reference to a file in .git/refs/heads which becomes the current branch.

$ cat .git/HEAD
ref: refs/heads/villages

When you make a new commit, value of HEAD is not changed. The village branch pointer moves & as HEAD is a pointer to village it looks like HEAD is also moved.

Garbage Collection in git looks for objects which cannot be ultimately reached from a branch or a tag. Such objects are garbage collected. As an object is a file in .git/objects/. Hence garbage collection means removing files of those objects.

Rebase

Click Here to learn how to do rebasing in git.

There is some twist here. Remember that -
Commits are database objects & database objects are immutable.
When I do a rebase, the parent of one of the commit is set to a new commit. As the parent value changes, the commit gets a new SHA1. But commits are immutable.

Therefore, when we do a rebase, new copy commits are created which have same data as old commits except the commit which points to a new parent commit. The branch pointer is moved to the tip of the commit chain. As the old commits become unreachable, they are garbage collected including trees and blobs (if any).
Rebasing is an operation that creates new commits. Q.E.D.
History, Branches, Merges, Rebases - that's pretty much a Version Control System.
Layer 4 - Distributed Version Control System

To learn how to work with Git as D-VCS, refer to my posts -

Few points to note here -

.git/refs/heads/remotes/ – contains only reference to HEAD. To optimize the references to all other branches are in .git/packed-refs file.

$ git show-ref master
will show references of all branches (local+remote) having master in their name.

That's it.. All four layers of the Git explained - This completes our Onion Model !!