hashing

Hash functions

http://install4install.com

The mapping of keys to indices of a hash table is called as hash function.It is the usually the composition of two maps:

  • Hash Code Map – Maps keys to integers when they are not integers.
  • Compression Map – Maps wide range of integers to size of hash table.If S is size of table then integers need to brought to range 0 to [S -1]

Hash Code Maps

Integer Cast – Reinterpret the bits of number as integer for numeric type with 32 bits or less.

Component sum – For numeric type with more than 32 bits,we can chunk into group of 32 bits and then interpret as integers.Component sum is not a good choice for  strings as it leads to more number of collisions.For e.g. if I want to convert ‘cusp‘ to integer value firstly ASCII value will be fetched  and then   added up  together to obtain the sum and integer value will be obtained.But there are huge no of possibilities that many words of english  dictionary have the same sum.If sum is same,multiple keys are mapped to same location resulting in high number of collisions.One of the  technique’s used to overcome this  problem is Polynomial accumulation

Polynomial Accumulation

For strings of english language,combine the ASCII values(ao, a1 ,a2,..an-1)  and view them as coefficients of a polynomial.

ao + a1x + a2x2 + .an-1xn-1

The integer value is computed with Horner’s rule for certain fixed value of x.The choice of x=33,37,39 or 41 gives at most 6 collisions on vocabulary of 50,000 words,

ao + x(a1+ x(a2 + .x(an-2 +x an-1)))

Compression Maps

Use the remainder method – Use the modulo method for mapping high range of integers to size of table.If h is the hash function,k is the key and s is the size of table then use

h(k) = k mod s

Double Hashing

It uses two hash functions h1 and h2.h1(k) is the position in the table to check for key k and h2 determines the offset from first position we use when searching again for k.Offset is how much distance has to covered while mapping the key to slot.The offset for linear probing is always 1,as we jumped one location ahead to store the element in the table.

Example

Consider a series of number 18,41,22,44,59,32,31,73 to be stored in hash table.

h1(k) = k mod 13(Fig 1)

h2(k) =  8 – (k mod 8)

Fig 1 : Animation of Double Hashing

Fig 1 : Animation of Double Hashing

 

Hashing with non-integer keys

We can find some technique/way to covert keys to integers.For e.g if we have a mobile number like  98134-93789,we can  remove hyphen store it like 9813493789 or if we have a  string then we can  add ASCII values and then use hash functions on integer values.

 

Features of Good Hash Function

  • It is faster to compute.
  • Distribution of keys is uniform throughout the table.
  • It minimizes the probability of collisions.(No matter how good hash function we use,there is some probability of collisions in the slots).

Short URL: http://tinyit.cc/80f90e

Hashing

Hashing is a method and useful technique to implement dictionaries.This method is used to perform searching,insertion and deletion at a faster rate.A function called Hash Function is used to compute and return position of the record instead of searching with comparisons.The data is stored in array called as Hash table.The Hash Function maps keys into positions in a hash table.The mapping of keys to indices of a hash table is known as Hash Function.The major requirement of hash function is to map equal keys to equal indices.

Given a key, the algorithm computes an index that suggests where the entry can be found:

index = f(key, array_size)

The value of index is determined by 2 steps

  1. hash = hash_func(key)
  2. index = hash % array_size

A hash function is purely a user choice and a good hash function has following features :

  • It is quick to compute.
  • Keys are distributed uniformly throughout the table.
  • The load factor is defined by  t= n/m where m is the slots/buckets and n is the elements in m of hash table t.

Example

Say we want to perform certain operations on  salaries of employes from employee id’s  51000 to 52000.In order to access the data at a quicker rate we can use hashing(Fig 1 and Fig 2).

  • Now keys(k) are are id’s from 51000 to 52000, i.e the array_size would be 1000.
  • Say table[1000] is the array.
  • We choose a Hash Function
hash = k - 51000

index = hash % array_size
Screen Shot 2014-05-15 at 2.12.46 PM

Fig 1 : Computing hash function

Screen Shot 2014-05-15 at 2.14.29 PM

Fig 2 : Hash Table of Fig 1

Collision

When two or more values are mapped at same location/bucket by the hash function,then collision arises.

Example 2

Say we have 100 students of subject CS501 of batch 2014 and  the roll no of first student is like 2014 CS501210,second is like 2014 CS501211 and the list goes on till 2104 CS501310.

  • There are 100 students in class so we create a hash table of size 100.
  • Say we chose a hash function of last two digits.
  • 2014 CS501210 is stored at 10 location.
  • Where will 2014 CS501310  go?Again at same location i.e at 10 ?
  • A concept called collision arises when two or more entries are mapped at same location.

Collision Resolution

Chaining

In chaining,an array of tables/links is set up indexed by the keys to the list of items with the same key.Say we have a hash function which computed modulo 5 of any series and say  hash table has 6 locations in it.If two or more keys are mapped into same location we will keep on adding to the linked list.By example given below 1 key is mapped to location 0,2 keys are mapped to location 1,1 key is mapped to location 2,2 keys to location 3,3keys to location 4 and  no keys are mapped to location 5.The hash values ending at 0 are stored at 0 location,ending at 1 is stored at first location and the list goes on in the similar pattern..Thus ,the problem of collision is resolved by chaining.

Screen Shot 2014-05-15 at 4.34.25 PM

Fig 3 : Chaining method

 

Linear probing

It is a method to resolve collision.The item will be stored at next available slot in the table if there is clash,assuming that the table is already not full.It uses less memory than chaining as links are not required but is slower than chaining as one has to move across the table for long time.

Consider Fig 4.Say a list of numbers has to be stored in a hash table of size 8 and hash function is h(k) = k%8.The no. in series are 48,26,43,61,86,10,4 ,13. Mapping is done and numbers are stored in the table.10 and 26 have same remainder(2),so 10 goes to next empty slot.i.e. at location 4,Number 4 was to be stored at location 4 but since it is occupied it will go next empty slot.i.e. at 7 and lastly 13 and 61 have same remainder (5) but 61 is already occupied so 13 will go to location 1.

 

Screen Shot 2014-05-16 at 2.50.55 PM

Fig 4 : Linear Probing

 

 


Short URL: http://tinyit.cc/087d20