Sunday, 11 December 2016

GraphDB: Introduction

NoSQL type of databases fall into following four categories.
  1. Graph - e.g. Neo4j
  2. Document Store - e.g. MongoDB, CouchDB
  3. Key Value - e.g. Redis (REmote DIctionary Service)
  4. Columnar - e.g. HBase, Cassandra
This series of posts will highlight the marquee features of Graph DB and help you digest them using Neo4j.

But before that let's have a high-level overview of Graph Space.

Graph Space

The graph space can be divided into two parts.
Graph Space Classification
1- Graph Databases:

A graph database is an online DBMS which exposes CRUD for underlying graph data model, is accessed in realtime from an application and tuned for ACID.

The two key factors to consider while evaluating any GraphDB product are -

a) Underlying Storage

Graph DB uses some mechanism to persist graph data. This can be
* Native Graph Storage - this is optimized for storing graphs
* Others - this can be relational, object-oriented, or some other general purpose data store.

b) Processing Engine  

* Non-native Graph Processing
Relationships are 1st class citizen in graph data model. Where as in any non-native processing engine we have to infer a relation e.g. in RDBMS a relation can be inferred using combination of Primary and Foreign Keys.

* Native Graph Processing
Here connected nodes physically point to each other in db. This gives significant performance advantage. This is also known as Index-free Adjacency.

Note: Native graph storage and native graph processing are neither good nor bad — they come with their own trade-offs.
 
2- Graph Compute Engines (GCE):

GCE is for offline graph analytics performed as a series of batch steps. It executes global graph computational algorithms against (large) datasets. The information is fed to GCE from a system of records (OLTP) database (e.g. Postgresql, Neo4j) by a periodic ETL job. GCE then processes information in batches (OLAP) and answer user queries e.g. “What a user usually purchases if s/he buys product X?”.

High Level Overview of GCE (courtesy: Graph Databases by O'reilly Publications)
Why Graph Databases?

To replace a well-established well-understood data platform with Graph DB, we need some compelling reasons. Here I give you a few -

1] Performance

Graph DB provides better query performance for connected data compared to RDBMS or any other No SQL database.

RDBMS are join intensive. If I ask typical social network analysis queries like "who is friend of friend of friend of Amit which also friend of Ajit?" .. OMG, how to write a query for this and how much deteriorated performance to expect.

Graph DB performance remains relatively constant because queries are localized to a graph portion.  The execution time is proportional to the limited part of the graph traversed to satisfy the query rather than the entire graph.

2] Flexibility

As developers we want to connect data as the domain dictates. This allows  structure and schema to emerge in tandem with our growing understanding of the problem space.

Graph DBs addresses this need directly. We can add new relationships, nodes, labels, and  subgraphs  to an existing structure without disturbing existing queries and application functionality.

The additive nature of graphs also means we tend to perform fewer migrations, thereby reducing maintenance overhead and risk.

3] Agility

Graph databases offer an extremely flexible data model, and a mode of delivery aligned with today’s agile software delivery practices.

Schema-free nature of the graph data model, coupled with the testable Graph DB’s API and query language, empower us to evolve an application in a controlled and agile manner.

Before closing this post let's describe a Graph in a Graph database.

A Graph in Graph DB is -
  • Set of vertices and edges.
  • Vertices represent nodes and edges represents relationship between them
  • Each node has a label. A label defines its type. e.g. User, tweet
  • A node can have more than one labels.
  • Each relationship is directional and is tagged by a label. e.g. follows, tweets
  • A relationship always has a start and an end node. 
  • Each node and relation can hold a document store i.e. properties/key-value pairs.
Sample Graph Representation

Sunday, 27 November 2016

Django Channels - Realtime Web Apps

Django since its birth has been supporting the request response format of communication over the network. With the advent of WebSockets and need for realtime web applications, Channels is being introduced for 
  • avoiding long pollings/auto refreshes, 
  • implementing asynchronous tasks (e.g. image thumbnailing, sending email) and
  • pushing changes to Web UIs.
Click Here to access documentation.

This post isn't going to be too verbose, rather I'll quickly take you through the steps required to have your first django application with channel running.

Aim: Client opens a websocket, sends a string message. It gets length of string as response.

Solution Steps:

1] Activate a virtual env (click here to know HowTo)

2] On  my Linux system, I was getting error for Python.h, which I resolved referring to this StackOverflow thread.

sudo apt-get  update; sudo apt-get install  python-dev -y

3] Install django, djangorestframework (optional) and channels

pip install django
pip install djangoroestframework
pip install channels

This will also download the dependencies viz. asgi, daphne, twisted.

4] Created django project tutorial and chatbox app inside it.

django-admin.py startproject tutorial
cd tutorial
django-admin.py startapp chatbox


5] Add routing.py and consumers.py in chatbox app

consumers.py - It will contain methods which will consume message placed on a channel.

def ws_echo(message):
    message.reply_channel.send({
        'text': str(len(message.content['text'])),
    })

A consumer functions takes message as input. The two important attributes of message to note here are -
  • message.content : holds a dictionary as message content
  • message.reply_channel : to send response to sender webocket
A message always has a reply_channel attached. The consumer function takes message, and responds with a dictionary with key text holding len of text sent by sender websocket.

IMP: The type of value for key text should be str or something which supports encode else it throws following exception -

ERROR - server - HTTP/WS send decode error: 'int' object has no attribute 'encode'

routing.py - Mapping between consumer method and channel is defined in routing.py

from channels.routing import route

channel_routing = [
    route('websocket.receive', 'chatbox.consumers.ws_echo'),
]

chatbox.consumers.ws_echo - chatbox is django app, consumers is module inside it and ws_echo is method defined inside module. This consumer will listed on channel websocket.receive

6] Specify INSTALLED_APPS in tutorial/settings.py

INSTALLED_APPS = [
    'django.contrib.admin',
    .........
    .........
    'channels',
    'chatbox',
]

7] Adds Channels backend in tutorial/settings.py

CHANNEL_LAYERS = {
    'default': {
        'BACKEND': 'asgiref.inmemory.ChannelLayer',
        'ROUTING': 'chatbox.routing.channel_routing',
    },

BACKEND - Specifies it is going to be in memory channel, (later we will use redis)
ROUTING - This is pointing to channel_routing array in routing.py of chatbox django app.

Note: You may think of creating routing.py in individual apps, merging all in routing.py of tutorial and then passing it in ROUTING. The way we do for urls.py.

8] Start Django Webserver

$ python manage.py runserver 0.0.0.0:8000
Performing system checks...

System check identified no issues (0 silenced).
November 27, 2016 - 06:20:33
Django version 1.10.3, using settings 'tutorial.settings'
Starting Channels development server at http://0.0.0.0:8000/
Channel layer default (asgiref.inmemory.ChannelLayer)
Quit the server with CONTROL-C.
2016-11-27 06:20:33,190 - INFO - worker - Listening on channels http.request, websocket.connect, websocket.receive
2016-11-27 06:20:33,191 - INFO - worker - Listening on channels http.request, websocket.connect, websocket.receive
2016-11-27 06:20:33,191 - INFO - worker - Listening on channels http.request, websocket.connect, websocket.receive
2016-11-27 06:20:33,192 - INFO - worker - Listening on channels http.request, websocket.connect, websocket.receive
2016-11-27 06:20:33,194 - INFO - server - Using busy-loop synchronous mode on channel layer

9] From Console of chrome debugger, send a message as follows.

ws = new WebSocket('ws://localhost:8000/')

ws.onmessage = function(message) {
  alert(message.data);
}

ws.send("Hello World")

Once you create WebSocket object, the server log will show following entry.

[2016/11/27 06:28:30] WebSocket CONNECT / [127.0.0.1:52394]

We have defined onmessage handler to websocket. It will be invoked each time a message is received from server.

After sending message via websocket, you will get response. This will invoke onmessage handler and you will see alert - 11.

Question: How the message got delivered to websocket.receive? I have no where specified channel in my client socket?

Answer: There are channels which are already available for us. For example –
  • http.request channel can be listened on if we want to handle incoming http messages
  • websocket.receive can be used to process incoming websocket messages.

Reference:
Chatting in Realtime with WebSockets and Django Channels.

Sunday, 28 August 2016

VS Code Setup - TypeScript, Angular 1.x

This is going to be a quick take on TypeScript.
TypeScript Introduction
  • Open source project by Microsoft
  • Follows ES6 syntax, even ES7 if possible
  • Adds typing to vars (which has its own advantages)
  • The compiler for TS is tsc – ts compiler. It is transpiler which Transpiles code into JS.
Environment Setup
Package Installation

sudo apt-get install npm
sudo apt-get install nodejs-legacy
sudo npm install -g typescript

VS Code Installation

Click here to download & install Visual Studio Code

Preparing VS Code for TS

Follow the link to prepare VS Code for TS. This involves -
defining tsconfig.json
creating tasks.json

Note:   If ctrl+shift+B doesn't invoke the task then check – is tsc working from commandline.

How I did setup?

Created project_dir/src/student.ts with following code.
module main{

    class Subject{

        name:string;
        marks:number;

        constructor(name:string){
            this.name=name;
        }

        print():string{
            return this.name;
        }
    }

    class Student{

        name:string;
        subjects:Array;

        constructor(name:string){
            this.name=name;
        }

        print():string{
            return this.name        
        }

        getSubjects():Array{
            return this.subjects;
        }

        addSubject(name:string):void{
            this.subjects.push(new Subject(name));
        }
    }

    let stud1:Student = new Student("John");
    console.log(stud1.print())
}

Press Ctrl+Shift+B. This will make a strip appear on top of editor area. Click on Configure Task Runner.

Select TypeScript - tsconfig.json

This will add tasks.json.

Create src/tsconfig.json (I altered it a bit. See github repo)

{
    "compilerOptions": {
        "target": "es5",
        "module": "commonjs",
        "sourceMap": true
    }
}

Go to View in menu bar and click on output. Then press ctrl+shift+B. I got following error.

error TS5057: Cannot find a tsconfig.json file at the specified directory: '.'
 Go to tasks.json and update the args line to following.

"args": ["-p", "./src"],

Then press ctrl+shift+B. You will find student.js and student.js.map files created
 

Debugging a TS file

Place a debug point in app.ts file, press F5 and select environment Node.js . This will create  launch.json file. Edit the content to make it similar to the following.

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Launch",
            "type": "node",
            "request": "launch",
            "program": "${workspaceRoot}/src/app/student.ts",
            "stopOnEntry": false,
            "args": [],
            "cwd": "${workspaceRoot}",
            "preLaunchTask": null,            
            "runtimeExecutable": null,
            "runtimeArgs": [
                "--nolazy"
            ],
            "env": {
                "NODE_ENV": "development"
            },
            "externalConsole": false,
            "sourceMaps": true,
            "outDir": "${workspaceRoot}/src/app"
        },
        {
            "name": "Attach",
            "type": "node",
            "request": "attach",
            "port": 5858,
            "address": "localhost",
            "restart": false,
            "sourceMaps": false,
            "outDir": null,
            "localRoot": "${workspaceRoot}/src/app",
            "remoteRoot": null
        }
    ]
}

Save it and press F5. The debugger will start.

Configuring local server

To serve AngularJS code, I need *.html files to be served from a local server. Click here to view the reference thread.

$ sudo npm install -g live-server

From VSCode, do a right click on index.html, open in console and execute

$ live-server

Downloading TSD files for Angular

$ typings search --name angular
$ typings install  dt~angular --save --global

when I did search for angular, it showed SOURCE=dt, therefore while installing dt~ prepended to angular.

Also install jquery.
$ typings search --name jquery
$ typings install  dt~jquery --save --global

This creates index.d.ts file containing reference to all type script definition files. Rename this to tsd.d.ts and its reference in app.ts

/// <reference path="../../typings/tsd.d.ts" />

Add empty file jsonconfig.json at same level as typings.json

Finally  I developed first demo app and pushed to github repo.

Sunday, 27 March 2016

Git - The Onion Model

Disclaimer: This post doesn't teach you git basics. This will help you understand how git works.

In this article we will understand Git's Onion Model - the layers that finally makes it a powerful Distributed Version/Revision Control System.

The four layers are -
4. Distributed Version Control System - Supports push, fetch, pull etc. operations
3. Version Control System - History, branches, merges, rebase etc.
2. Simple Content Tracker - Commits, versions (labels/tags)
1. Map -  key mapped to value persisted on disk




Layer 1 - Map
At its core, git is just a map persisted on disk. This map is a table of key and value. This is also called object database.

The key is SHA1 hash. The value can be of type
  • blob
  • tree
  • commit
  • tag
You can give any value to git and it will calculate its SHA1 hash.

E.g. using the plumbing command hash-object

$ echo "sibtain" | git hash-object --stdin 
64219700ea1c10634d4141fcd1c3f01163cb03d1

To persist this value use -w flag of hash-object.
Note: you need to execute following command inside a git repostiry. (use $ git init to initialize a new repository).

$ echo "sibtain" | git hash-object --stdin -w
64219700ea1c10634d4141fcd1c3f01163cb03d1

To find where/how it is persisted in object store,

$ ls -l .git/objects/
total 12
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 64
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 info
drwxr-xr-x 2 sibtain sibtain 4096 Mar 27 11:34 pack

You see a directory 64. These are first two characters of SHA1. Inside that directory you will find a file with name as remaining part of SHA1 generated. This is a binary file.

$ ls -l .git/objects/64
total 4
-r--r--r-- 1 sibtain sibtain 23 Mar 27 11:34 219700ea1c10634d4141fcd1c3f01163cb03d1


To get content I'll use another plumbing command cat-file -

$ git cat-file -p 64219700ea1c10634d4141fcd1c3f01163cb03d1
sibtain

To get type of the object
$ git cat-file -t 64219700ea1c10634d4141fcd1c3f01163cb03d1
blob

This clearly explains the inner most layer of git - The Map of key and value pairs & how it is persisted.

The directory structure of a nearly empty git repo is as follows.

$ tree -a
.
`-- .git
    |-- branches
    |-- config
    |-- description
    |-- HEAD
    |-- hooks
    |   |-- applypatch-msg.sample
    |   |-- commit-msg.sample
    |   |-- post-update.sample
    |   |-- pre-applypatch.sample
    |   |-- pre-commit.sample
    |   |-- prepare-commit-msg.sample
    |   |-- pre-rebase.sample
    |   `-- update.sample
    |-- info
    |   `-- exclude
    |-- objects
    |   |-- 64
    |   |   `-- 219700ea1c10634d4141fcd1c3f01163cb03d1
    |   |-- info
    |   `-- pack
    `-- refs
        |-- heads
        `-- tags
11 directories, 13 files

Layer 2 - Simple Content Tracker

The features of a content tracking system is to have provision for maintaining versions and commit checkpoints.

Here we will explore where/how a commit is persisted.

Following directory structure is commited.

$ tree
.
|-- city.lst
`-- city_profile
    |-- mumbai.txt
    `-- pune.txt
 
$ git log
commit 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
Author: sibtain <sibtain@sibtain-linuxmint.(none)>
Date:   Sun Mar 27 17:09:25 2016 +0530

    Adds city details

commit 8408f82db302b32f02510a7afd1749210a3ab9bc
Author: sibtain <sibtain@sibtain-linuxmint.(none)>
Date:   Sun Mar 27 17:09:07 2016 +0530

    Adds City list

Let's focus on commit 2130ce8. Check what's in object database.

$ ls .git/objects
10  21  5e  64  84  8f  a6  ab  info  pack
 
$ ls -l .git/objects/21/
total 4
-r--r--r-- 1 sibtain sibtain 166 Mar 27 17:09 30ce8e0f697af309e47ab1f0dc916fece0eb9a

What is the type of this object?

$ git cat-file -t 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
commit

OK. So it is a commit object. What it contains?

$ git cat-file -p 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
tree 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
parent 8408f82db302b32f02510a7afd1749210a3ab9bc
author sibtain <sibtain@sibtain-linuxmint.(none)> 1459078765 +0530
committer sibtain <sibtain@sibtain-linuxmint.(none)> 1459078765 +0530

Adds city details

Therefore, a commit is a simple piece of text generated and stored by git as object in object database. It is having message, committer/author details with timestamp, tree and parent references holding SHA1 values.

Parent points to previous commit. In case of a 3-way merge commit there will be 2 parent entries.

It is also having pointer to a tree. Let's explore that.

$ git cat-file -t 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
tree
$ git cat-file -p 8f7b3eb4e75d78e50dd9d37a8464c3855c1c190e
100644 blob 8f4272c240a23d814ee963abcccf9f871aae9be8    city.lst
040000 tree a6ec82fc89c19390894fd7685d32b5124bb24516    city_profile

The tree object is having 2 references. One for a blob (city.lst) and another for a tree (city_profile). The initial numbers specify permission of those objects in hexadecimal. File names and permissions are not stored in blobs, they are stored in tree. Blob is just text.

$ git cat-file -p 8f4272c240a23d814ee963abcccf9f871aae9be8
Mumbai
Pune

$ git cat-file -p a6ec82fc89c19390894fd7685d32b5124bb24516
100644 blob 1013a5511947260b727bd9f79946517121c682ef    mumbai.txt
100644 blob ab4f45300c9270dbb2ba92bc06c0a670271b8f33    pune.txt

Note: You can also use just first few digits of SHA1 in any of the commands.

I'll add a new name to city.lst and then commit changes.

$ git log --oneline
1c57ddc Adds a city to list
2130ce8 Adds city details
8408f82 Adds City list

$ git cat-file -p 1c57ddc
tree 949bd0423891cee02f38c75ac8ec8623ea3f59ff
parent 2130ce8e0f697af309e47ab1f0dc916fece0eb9a
author sibtain <sibtain.masih@gmail.com> 1459079797 +0530
committer sibtain <sibtain.masih@gmail.com> 1459079797 +0530

Adds a city to list

$ git cat-file -p 949bd0
100644 blob 7dc571a82b903bbe28a391600ad9b2a68f752f62    city.lst
040000 tree a6ec82fc89c19390894fd7685d32b5124bb24516    city_profile

Observe that SHA1 of city profile is not changed. So this commit also points to same object is database for city_profile as previous commit. Only there is a new object created and referenced for city.lst



To find how many objects are persisted in object database - 

$ git count-objects
12 objects, 48 kilobytes

The count of 12 comes from the following division.
1 - blob object for demo of hash-objects text - "sibtain"
3 - commit objects
3 - tree objects as commit trees
2 - blob objects for city.lst
1 - tree object for directory city_profile
2 - blob objects for 2 files inside city_profile directory

We have discussed about commits till here. Another feature of a Simple Content Tracker is versioning via tags or labels. A tag is a label for current state of the project. Git supports two types of tags viz.
  1. Lightweight 
  2. Annotated

Lightweight Tags

A lightweight tag just contains a SHA1 value as reference to a commit.

$ git tag lw-1.1

$ ls .git/refs/tags/
lw-1.1


$ cat .git/refs/tags/lw-1.1
1c57ddc8852ecfd621a35df5a93caf7c8f6987d6


$ git cat-file -t 1c57ddc
commit


Annotated Tags

Annotated tag comes with a message and creates an object in git's object db.

$ git tag -a 1.0 -m "Stable 1.0 version"

You  will find an entry for this tag in .git/refs/tags

$ ls -l .git/refs/tags/
total 4
-rw-r--r-- 1 sibtain sibtain 41 Mar 27 20:21 1.0

It contains a reference to a tag object in git's object database.

$ cat .git/refs/tags/1.0
14498a628e939bda2ec6d53032f944a6889c0ecd

The object starts with 14,

$ ls .git/objects/14
498a628e939bda2ec6d53032f944a6889c0ecd
What is the type of this object and what it contains?

$ git cat-file -t 14498a628e939bda2ec6d53032f944a6889c0ecd
tag

$ git cat-file -p 14498a628e939bda2ec6d53032f944a6889c0ecd
object 1c57ddc8852ecfd621a35df5a93caf7c8f6987d6
type commit
tag 1.0
tagger sibtain <sibtain.masih@gmail.com> Sun Mar 27 20:21:00 2016 +0530

Stable 1.0 version

It contains pointer to a commit object, tag name, tagger details with timestamp and message.

Another way to retrieve same information is by using the tag directly.

$ git cat-file -t 1.0
tag

$ git cat-file -p 1.0
object 1c57ddc8852ecfd621a35df5a93caf7c8f6987d6
type commit
tag 1.0
tagger sibtain <sibtain.masih@gmail.com> Sun Mar 27 20:21:00 2016 +0530

Stable 1.0 version

While branches move, tags don't. They stay with same object forever.

Just to revise, the four types of objects that git's object database can store are -
  • Blobs
  • Trees
  • Commits
  • Annotated Tags
You can think of git as - a high level file system built on top of a native file system.

Layer 3 - Version Control System

A version control system is just a single repository. It has history, branches, merges and tags.

History

References between commits are used to track history. All other references viz. commit to a tree, tree to another tree and tree to blob are used to track content of each commit.

Branches

A branch makes a file in .git/refs/heads directory. The file has same name as branch and it contains a SHA1 value as reference to the commit to which it points.

$ git checkout -b villages

$ ls .git/refs/heads/
master 
villages

$ cat .git/refs/heads/villages
1c57ddc8852ecfd621a35df5a93caf7c8f6987d6

$ git cat-file -t 1c57dd
commit

How git finds current branch?

The HEAD pointer contains reference to a file in .git/refs/heads which becomes the current branch.

$ cat .git/HEAD
ref: refs/heads/villages

When you make a new commit, value of HEAD is not changed. The village branch pointer moves & as HEAD is a pointer to village it looks like HEAD is also moved.

Garbage Collection in git looks for objects which cannot be ultimately reached from a branch or a tag. Such objects are garbage collected. As an object is a file in .git/objects/. Hence garbage collection means removing files of those objects.

Rebase

Click Here to learn how to do rebasing in git.

There is some twist here. Remember that -
Commits are database objects & database objects are immutable.
When I do a rebase, the parent of one of the commit is set to a new commit. As the parent value changes, the commit gets a new SHA1. But commits are immutable.

Therefore, when we do a rebase, new copy commits are created which have same data as old commits except the commit which points to a new parent commit. The branch pointer is moved to the tip of the commit chain. As the old commits become unreachable, they are garbage collected including trees and blobs (if any).
Rebasing is an operation that creates new commits. Q.E.D.
History, Branches, Merges, Rebases - that's pretty much a Version Control System.
Layer 4 - Distributed Version Control System

To learn how to work with Git as D-VCS, refer to my posts -

Few points to note here -

.git/refs/heads/remotes/ – contains only reference to HEAD. To optimize the references to all other branches are in .git/packed-refs file.

$ git show-ref master
will show references of all branches (local+remote) having master in their name.

That's it.. All four layers of the Git explained - This completes our Onion Model !!

Saturday, 23 January 2016

Git - Working with Remotes (2)

In this post we will see how we can follow a git workflow and collaborate project development.

Aim
Build a website for Imperial College of Engineering (IEC). (from 3 Idiots ;)

Team
unckle-bob [role = project owner, maintainer]
sibtainmasih [role = contributor]
+ many others (including you :)

Steps

1. unckle-bob creates a new repository on github.

Repository name = ice-website
Description = Dummy website project for Imperial College of Engineering from 3 Idiots
Select initialize this repository with README


2. sibtainmasih forks this repository.

Search for ice-website, go to repository's page and click on fork. Now he is having a fork under his name.
3. sibtainmasih clone's his repository

$ git clone https://github.com/sibtainmasih/ice-website.git
Cloning into 'ice-website'...
remote: Counting objects: 4, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 4 (delta 0), reused 4 (delta 0), pack-reused 0
Unpacking objects: 100% (4/4), done.
Checking connectivity... done.


Then he sets config for user.name and user.email

4. sibtainmasih prepares skeleton file structure, commits changes and pushes them to his repo.

$ git log --oneline --decorate --all
77bfa0c (HEAD -> master) Add home file
59c9189 Add index file
a9780bf (origin/master, origin/HEAD) Initial commit

$ git push
Counting objects: 6, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (6/6), 647 bytes | 0 bytes/s, done.
Total 6 (delta 1), reused 0 (delta 0)
To https://github.com/sibtainmasih/ice-website.git
   a9780bf..77bfa0c  master -> master



5. unckle-bob takes responsibility of adding courses.html page.

He clones his repository, commits a course.html file and pushes to unckle-bob/ice-website.

sibtainmasih will see not see these changes in his forked repository.

6. Meanwhile sibtainmasih adds few more commits and then creates a Pull Request (PR).

He clicks on Create pull request. It takes to comparing changes page which warns that can't automatically merge due to conflicts.


He continues, clicks on Create Pull Request, provides title, details and completes that. It ends in a warning suggesting branch has conflicts which must be resolved by someone who has write access i.e. unckle-bob

Hold On! As a contributor it is your responsibility to resolve the conflict(s) before making a pull request. Therefore sibtainmasih does the following.

A) Configure a remote upstream to project's central repository -

$ git remote add upstream https://github.com/unckle-bob/ice-website.git

B) Fetch upstream -

$ git fetch upstream
remote: Counting objects: 12, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 12 (delta 4), reused 12 (delta 4), pack-reused 0
Unpacking objects: 100% (12/12), done.
From https://github.com/unckle-bob/ice-website
 * [new branch]      master     -> upstream/master


C) Check commit history -








D) Merge local master with upstream/master -

$ git merge upstream/master
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.


Check which file(s) have conflicts,

$ git status -s
UU README.md
A  courses.html
A  m_c_a.html


resolve that and do $ git commit

Finally see merged commit history






E) Push changes

$ git push


F) And then create a PR. This will now show a green able to merge text -.


7. unckle-bob reviews the PR and merges in main project repository.

This will show that there is no conflict and the pull request can be merged.

unckle-bob clicks on Merge pull request & the history is properly interlaced.


Remember - There is not autosync. You need to setup upstream and merge/resolve conflicts to make your PRs readily merge-able.

That's it ! This is how git helps to do development in collaboration with other team members.

Git - Working with Remotes (1)

Create a repository for your project named "demo-project" on github. Google will help if you don't know.

What is a Remote Repository?
Remote repository of a project is a git repository hosted on a network or Internet. E.g. github repository named "demo-project" which we just created. These repositories are created to enable collaboration among team members. Everyone will PUSH their code (after resolving conflicts if any) to the remote repository and others will get latest copy of project code with a PULL from remote repository. 

There can be two kinds of remote repositories for a project.
1. Central Remote Repo [Only 1]
2. Forked Remote Repo [1 per team member]

The central remote repo (CRR) of a project will be a read-only repository. All the team members will fork their own remote repos from CRR and will push their changes to forked remote repos (FRR). Once they are ready to merge code in CRR, they need to create a Pull Request (PR) from branch of their FRR to a branch of CRR. Then the code will be reviewed before merging into CRR and making it available to all the FRR of other team members. 

Note: There is option to keep FRR in sync with CRR so that if any pull requests are accepted in CRR, the FRR also gets updated.

With backdrop set, let's dig deeper and understand how to work with remote repositories.

Understanding Git Remote

Scenario 1 
I am already having a local git repository. How can I connect it to remote repository and push my commits?

Solution
Ok. So I already have a git repository which I created using git init command and have my commits in it.

You can use following commands for preparation.

1. Create a project directory -
$ mkdir demo_project_repo
$ cd demo_project_repo/

2. Initialize it as git repository
$ git init

3. Set configs
$ git config --global user.name unckle-bob
$ git config --global user.email bob@gmail.com

4. Check branches (you will not find any branch till first commit).
$ git branch

5. Create index.html
$ vi index.html

6. Status will show index.html as untracked file
$ git status -s
?? index.html

7. Add index.html to staging and check status
$ git add index.html
$ git status -s
A  index.html

8. Commit snapshot
$ git commit -m "Add index.html of project"

(I have made one more commit - total 2)

9. Now check branch & you can see master branch created.
$ git branch
* master

10. -r flag is used to list remote branches
$ git branch -r 

11. -a flag is to list all branches (local + remote)
$ git branch -a
* master

12. To list all the remotes repositories configured in local repo. -v for verbose. (No output as we haven't configured any)
$ git remote -v

 Now I want to connect it to a remote git repository. On github you will find URL for the repo, copy that.



13. Add remote
Syntax - $ git remote add <alias> <url>
$ git remote add origin https://github.com/unckle-bob/demo-project.git

It is just a convention to give alias to central repo as origin. You can use any other alias.

14. Check remotes with -v (verbose) flag 
$ git remote -v
origin  https://github.com/unckle-bob/demo-project.git (fetch)
origin  https://github.com/unckle-bob/demo-project.git (push)

15. You can also see origin remote added in config file.
$ cat .git/config
[core]
        repositoryformatversion = 0
        filemode = false
        bare = false
        logallrefupdates = true
        symlinks = false
        ignorecase = true
        hideDotFiles = dotGitOnly
[user]
        name = unckle-bob
        email = vjtimca11@gmail.com
[remote "origin"]
        url = https://github.com/unckle-bob/demo-project.git
        fetch = +refs/heads/*:refs/remotes/origin/*

16. Let's push our master branch to origin. Note that master in following command specifies local branch.
$ git push -u origin master
Counting objects: 3, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 280 bytes | 0 bytes/s, done.
Total 3 (delta 0), reused 0 (delta 0)
To https://github.com/unckle-bob/demo-project.git
 * [new branch]      master -> master

17. -u flag in command of step 16 tells git to maintain mapping in config. Next time a simple git push will automatically connect to mapped remote branch.
$ cat .git/config
[core]
        repositoryformatversion = 0
        filemode = false
        bare = false
        logallrefupdates = true
        symlinks = false
        ignorecase = true
        hideDotFiles = dotGitOnly
[user]
        name = unckle-bob
        email = vjtimca11@gmail.com
[remote "origin"]
        url = https://github.com/unckle-bob/demo-project.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master

18. You can see commits pushed to remote repository.


19. To view remote branches from your local repo 
$ git branch -r
  origin/master

20. To view all branches i.e. local + remote
$ git branch -a
* master
  remotes/origin/master

21. The way local branches are pointers to SHA-1 values commits, same are remote branches.
$ ls -l .git/refs/remotes/origin/
total 1
-rw-r--r-- 1 Usern 197121 41 Jan 21 09:02 master

$ cat .git/refs/remotes/origin/master
02de0b7e6dac98d27b61e31cb0d2722e768f0135
$ git log --oneline --decorate
02de0b7 (HEAD -> master, origin/master) Update index file
00ab6f6 Add index.html of project

$ git remote show origin
* remote origin
  Fetch URL: https://github.com/unckle-bob/demo-project.git
  Push  URL: https://github.com/unckle-bob/demo-project.git
  HEAD branch: master
  Remote branch:
    master tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)



Scenario 2
I am starting with a new project & have created a repository on github. How do I clone it on my local system and continue with development?

Solution
In this case we use clone command. It takes remote repo URL and optional directory name which will be create for local repository.

$ git clone <repo-url> [directory_name]

Scenario 3
I want to contribute to an open source project "django-rest-framework" hosted on github. How can I do that?

Solution

1. Go to central repo of the project on github and fork it.

2. Get URL of your fork of repository and execute following command on your system.

$ git clone https://github.com/unckle-bob/django-rest-framework.git drf

Now do your changes and push to your forked github repo.

3. To merge with central project repo raise a Pull Request (PR)

4. The project owner will review the PR and merge/close it.


Finally few more commands before closing this post.

To rename a remote alias -
$ git remote rename origin gitrepo

To delete a remote alias
$ git remote rm origin

To delete a remote branch
$ git push origin :remote_branch_name

What is the colon (:) magic in last command? - When you do $ git push origin <local_branch_name> it automatically appends :<remote_branch_name> & makes the command as -
$ git push origin local_branch_name:remote_branch_name

Now by not providing any local branch name before colon(:) we action deletion of remote branch. Another command is -
$ git push origin --delete remote_branch_name

You can also do force push using -f command.

Git - Using diff command

In this post I am directly going to hit command line to demonstrate how $ git diff works.

Use Case 1
I changed a tracked file. Now want to see difference between working directory & repo version.

Solution
$ git diff HEAD index.html
diff --git a/index.html b/index.html
index 2a6a819..02b838e 100644
--- a/index.html
+++ b/index.html
@@ -3,6 +3,6 @@

        </head>
        <body>
-               This is index.html page. Adding some more content.
+               Incredible India
        </body>
 </html>
Use Case 2
I staged my changes. Now want to see difference between staged and repo versions of a file.

Solution
In this case if you execute $ git diff there will be no results. Use --cached flag to compare staged and commit versions.

$ git diff --cached index.html

Use Case 3
I edited a staged file & now want to see difference between staged and working copy of the file.

Solution
$ git diff index.html
diff --git a/index.html b/index.html
index 02b838e..6b97ecf 100644
--- a/index.html
+++ b/index.html
@@ -4,5 +4,6 @@
        </head>
        <body>
                Incredible India
+               New Ambassadors - Big B & PC
        </body>
 </html>
Following diagram summarizes the diff commands to use to compare the versions between any of the 3 git repository states.



Use Case 4
Find difference made to a file between two commits.

Solution
I am having index.html file in my repo. I am having two commits.
$ git log --oneline
02de0b7 Update index file
00ab6f6 Add index.html of project

I want to find what is changed in index.html between old and current commit (HEAD).

$ git diff 00ab6f6..HEAD index.html
diff --git a/index.html b/index.html
index 0ce595f..2a6a819 100644
--- a/index.html
+++ b/index.html
@@ -3,6 +3,6 @@

        </head>
        <body>
-               This is index.html page.
+               This is index.html page. Adding some more content.
        </body>
 </html>

It shows a line removed (red) and a new line added (green) replacing old line.

Note: The order of commits in command is important. First old commit then latest commit.

I want to see the words changed.
$ git diff --color-words 00ab6f6..HEAD index.html
diff --git a/index.html b/index.html
index 0ce595f..2a6a819 100644
--- a/index.html
+++ b/index.html
@@ -3,6 +3,6 @@

        </head>
        <body>
                This is index.html page. Adding some more content.
        </body>
</html>

It just shows me words added in green. If any content is removed from file then it will be shown in red.

You can use $ git diff with tree-ish i.e. SHA-1 values or branch names.

E.g.
1. To get difference between two branches -
$ git diff master..feature-home

2. To get differences between two commits with summary and stat-
$ git diff --summary --stat 02de0b7..18ddc0d
 home.html  | 1 +
 index.html | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)
 create mode 100644 home.html

You can use file name(s) in all the above commands to get difference made (if any) in a particular file or set of files.

Setup p4Merge as diff and merge tool

I found it difficult to understand the difference between versions of a file when using a simple console. A better approach is to use P4Merge Tool.

Click here to access the blog post I referred for configuring P4Merge as merge tool in git on Windows.

After download and install, I executed following commands in my git repo.

$ git config --global merge.tool p4merge
$ git config --global mergetool.p4merge.path "C:/Program Files/Perforce/p4merge.exe"

And then to launch P4Merge tool

$ git difftool