Sunday, 11 December 2016

GraphDB: Introduction

NoSQL type of databases fall into following four categories.
  1. Graph - e.g. Neo4j
  2. Document Store - e.g. MongoDB, CouchDB
  3. Key Value - e.g. Redis (REmote DIctionary Service)
  4. Columnar - e.g. HBase, Cassandra
This series of posts will highlight the marquee features of Graph DB and help you digest them using Neo4j.

But before that let's have a high-level overview of Graph Space.

Graph Space

The graph space can be divided into two parts.
Graph Space Classification
1- Graph Databases:

A graph database is an online DBMS which exposes CRUD for underlying graph data model, is accessed in realtime from an application and tuned for ACID.

The two key factors to consider while evaluating any GraphDB product are -

a) Underlying Storage

Graph DB uses some mechanism to persist graph data. This can be
* Native Graph Storage - this is optimized for storing graphs
* Others - this can be relational, object-oriented, or some other general purpose data store.

b) Processing Engine  

* Non-native Graph Processing
Relationships are 1st class citizen in graph data model. Where as in any non-native processing engine we have to infer a relation e.g. in RDBMS a relation can be inferred using combination of Primary and Foreign Keys.

* Native Graph Processing
Here connected nodes physically point to each other in db. This gives significant performance advantage. This is also known as Index-free Adjacency.

Note: Native graph storage and native graph processing are neither good nor bad — they come with their own trade-offs.
 
2- Graph Compute Engines (GCE):

GCE is for offline graph analytics performed as a series of batch steps. It executes global graph computational algorithms against (large) datasets. The information is fed to GCE from a system of records (OLTP) database (e.g. Postgresql, Neo4j) by a periodic ETL job. GCE then processes information in batches (OLAP) and answer user queries e.g. “What a user usually purchases if s/he buys product X?”.

High Level Overview of GCE (courtesy: Graph Databases by O'reilly Publications)
Why Graph Databases?

To replace a well-established well-understood data platform with Graph DB, we need some compelling reasons. Here I give you a few -

1] Performance

Graph DB provides better query performance for connected data compared to RDBMS or any other No SQL database.

RDBMS are join intensive. If I ask typical social network analysis queries like "who is friend of friend of friend of Amit which also friend of Ajit?" .. OMG, how to write a query for this and how much deteriorated performance to expect.

Graph DB performance remains relatively constant because queries are localized to a graph portion.  The execution time is proportional to the limited part of the graph traversed to satisfy the query rather than the entire graph.

2] Flexibility

As developers we want to connect data as the domain dictates. This allows  structure and schema to emerge in tandem with our growing understanding of the problem space.

Graph DBs addresses this need directly. We can add new relationships, nodes, labels, and  subgraphs  to an existing structure without disturbing existing queries and application functionality.

The additive nature of graphs also means we tend to perform fewer migrations, thereby reducing maintenance overhead and risk.

3] Agility

Graph databases offer an extremely flexible data model, and a mode of delivery aligned with today’s agile software delivery practices.

Schema-free nature of the graph data model, coupled with the testable Graph DB’s API and query language, empower us to evolve an application in a controlled and agile manner.

Before closing this post let's describe a Graph in a Graph database.

A Graph in Graph DB is -
  • Set of vertices and edges.
  • Vertices represent nodes and edges represents relationship between them
  • Each node has a label. A label defines its type. e.g. User, tweet
  • A node can have more than one labels.
  • Each relationship is directional and is tagged by a label. e.g. follows, tweets
  • A relationship always has a start and an end node. 
  • Each node and relation can hold a document store i.e. properties/key-value pairs.
Sample Graph Representation