Tuesday, 10 July 2012

Hashing : Introduction and Explanation

Why Indexing to Hashing?

The size of a record is very large as compared to the size of an index. In B Tree indexing, large number of indices can be accommodated on a frame. Each entry consists of Key Value + Address of the record where it is stored on disk. A B Tree node can be half full to completely full.

The structure of B Tree has to be stored somewhere on the disk. To fetch any record
  1. Perform disk access to load the root node into a frame
  2. Follow a path and traverse to particular node by recursive disk access to load a particular node
  3. Traverse through the key values to reach the desired key
  4. Read address of the record associated with the key
  5. Perform disk access for the particular address and load the record in the main memory
Hence extra space and efforts are required.

The solution is hashing. Hashing is a key-to-address mapping process. The objective is to get the address in a single calculation. Hashing gives transformation, the hash function: h (key) -> Address 
On hash function apply key values to get corresponding addresses i.e. the address of the block. Hashing saves
  • Multiple disk accesses that happens in B Tree
  • Space to store B Tree
The disk access will be only knowing the address and fetching the record.

Hashing is a way to reach a particular record in the file without scanning the entire file. A block (group of records) is usually fetched in primary memory where data is traversed sequentially. This is applicable to structures where records are stored continuously.

Prerequisite – There must be a well defined definite place where records are stored. The place is identified by address (may be a block number). The records need not to be sorted. Hashing provide us way to keep records in order (something other than sorting). The records will be divided into blocks b1, b2, b3 …

We are not going to access individual records. Block is the thing to be fetched.

h (key/5) = block number

Divide key by 5 and round to the ceiling. Therefore for records 1, 2, 3, 4, 5 the block number will be 1. E.g. 2/5 = (0.4) = 1. Hence block number 1 contains record with key value 2. In this organization within block arrangement is not important. The records can be in any order like 5, 2, 3, 1, and 4. Hence this technique is not so demanding. This hash technique pins record at particular file location.

h (key/x) = block [+ | -] k

Assumption- Key values are more or less in order.
Here we are adding or subtracting some value k from the block address.

Hash Functions:

Following are the ways to generate Hash Functions. Click on the link to learn about each method in detail.

  1. Direct Hashing
  2. Relative Addressing
  3. Division-Remainder Method
  4. Mid Square Method
  5. Folding

Collision:

Collision occurs when different key values collide on same address.
h (kwy1) -> Address 1
h (key2) -> Address1

E.g. Consider two keys -2 and 2 and h (k) = k2. As h (-2) = 4 and h (2) = 4 hence there is a collision. A bad choice of hash function will result in too many collisions. A bad hash function may result in triple collision as h (k1) -> A1, h (k2) - > A1 and h (k3) -> A1.

There is nothing as good or bad hash function at absolute point. The good or bad hash function depends on valid key values. If all the hash functions giving collisions then select the one with minimum collisions. If we get non-colliding block address then it is a good hash function.

Click here to learn about different methods to avoid collisions

With Special Thanks to,

Prof. [Ms] L. C. Nene
Lecturer,
MCA Department,
Veermata Jijabai Technological Institute,
Mumbai.

No comments:

Post a Comment

Your comments are very much valuable for us. Thanks for giving your precious time.