Direct Address Table | Hashing | Probing | Data Structures

Direct Address Table, Hashing, and Probing are fundamental concepts in data structures and algorithms used to efficiently store and retrieve data.

 

Direct Address Table is a simple and straightforward data structure that maps keys directly to memory addresses. It works well when the keys are small and there are no collisions. However, it becomes impractical or even impossible to use when the keys are large or the range of possible keys is vast.

 

Hashing is a technique that overcomes the limitations of Direct Address Table by using a hash function to map keys to array indices. A hash function takes a key as input and computes a hash code, which is used to determine the index in the hash table where the key-value pair will be stored. Hashing allows for efficient insertion, retrieval, and deletion operations, and it provides a more compact representation of data.

 

Probing techniques, such as linear probing, quadratic probing, and double hashing, are used to handle collisions that occur when two or more keys produce the same hash code. Probing resolves collisions by searching for the next available slot in the hash table. It utilizes a probing sequence to systematically probe different indices until an empty slot is found. By intelligently selecting the probing sequence, clustering and performance issues caused by collisions can be mitigated.

 

Together, Direct Address Table, Hashing, and Probing form the foundation for implementing efficient and scalable data structures like hash tables. These techniques enable fast access to data by minimizing collisions and providing a compact representation of the stored information. Understanding these concepts is essential for designing and implementing high-performance data structures and algorithms in various domains, including databases, caching systems, and information retrieval systems.

 

NOTE: The main intension of data structures is to store the data. Either to give it as an input to an algorithm or take the output from an algorithm.

Common operations –

  1. Insertion
  2. Searching – dominant operation that can be performed.
  3. Delete

Data sets or data are generally static, ie, once you create the data, it rarely changes.

Many data structure are designed in order to search the data faster.

 

Worst case complexity for searching in various data structure are –

  1. Unsorted array = O(n)
  2. Sorted array = O(log n)
  3. Linked list = O(n)
  4. Binary Tree = O(n)
  5. Binary Search Tree = O(n)
  6. Balanced Binary Search Tree = O(log n)
  7. Priority Queue (Min. & Max heap) = O(n)

The minimum search time is (log n) in the above searching for DS.

 

So, in order to minimize the search time, we use hashing. Therefore, hashing is the soln. to minimize the search time. If we use hashing technique, then the avg. time to search an element is O(1), i.e., constant time. 

 

Before using hashing, Direct Address Tables were used which is similar to arrays.

In DAT, we are going to have an array and we are going to place the elements in order of key value. This method is useful only when we know the no. of keys which are going to be in a small range and the no. of keys which are inserted is very less.

eg. {1, 2, 3,…, 100} -> Key range

 

Suppose for 100 elements to be inserted into DS. Suppose 1 is to be inserted at index 10. And we have to search that 1 later, we can directly go to the index and access 1 since array has random accessibility over data.

 

Therefore, search time = O(1), we place elements according to the key value.

This method is useful only when,

  1. You know the no. small range and of keys which are going to be inserted is to be in a small range and
  2. If no. of keys which are inserted is very less.

 

Limitations:-

Since we are going to use key values in array indexes. Therefore, there has to be linear relationship (directly proportional) between no. of keys used and size of arrays.

In case, the key used are large in no. eg 1000000, and the no. of keys are less.

{1000000, 1000001, 1000002,…, 100009}

So, in order to put the keys, we have to define an array whose size is the key. So, if the key size and array elements are close to each other then we can use DAT.

Suppose if the key size is in 100s and no. of elements are also in 100s, then defining one slot for each element is useful in such a way.

If the key size is very large and the total no. of keys to be inserted into the array is very small, then having such an architecture is waste of space.

So, there we are going to use hashing.

 

 

Introduction to Hashing

Hashing is mapping a set of elements from a large set to a smaller set using hashing function.

Hashing
Hashing

Potential candidates to be placed in DS is very large but not all elements will be present in our DS, only few will be present.

Suppose, value of keys range = 0-999, but 10 elements can be placed in our DS at a time.

Therefore, declaring a DS which is of 1000 spots is a waste of space. Instead we declare a space of 10 elements and we could try to map.

The mapping function is called hash function. If we are sure that our data is not going to go beyond 10 elements, then we can use hashing.

We apply a function (hash function) on the data so that the function itself will say where to put the data in constant time. And when we apply search, we use the function to search and go to that element directly in case the element is present. If it is not present, we say it is not present.

Therefore, search time = O(1)

 

In simple words, Hashing is a technique used in data structures to efficiently store and retrieve data. It involves mapping data elements to a fixed-size array, called a hash table or hash map, based on their values. The mapping is done using a hash function, which converts the data element into a numerical value called a hash code or hash value.

Here’s a detailed explanation of how hashing works:

  1. Hash Functions:
    – A hash function takes an input, such as a string or an object, and produces a fixed-size hash code as its output.
    – The hash code is typically an integer, but it can also be a character string or a sequence of bits.
    – A good hash function should be fast to compute, distribute hash codes evenly across the hash table, and minimize collisions (two different data elements having the same hash code).
  2. Hash Table:
    – A hash table is an array-like data structure with a fixed number of slots or buckets, where each slot can hold one or more data elements.
    – The size of the hash table is determined during its initialization and is typically chosen based on the expected number of data elements to be stored.
    – The hash table provides fast access to data by computing the hash code of a data element and using it to determine the index or bucket where the element should be stored or retrieved from.
  3. Hashing Process:
    – When inserting a data element into the hash table, the hash function is applied to compute its hash code.
    – The hash code is then mapped to a valid index in the hash table by performing a modulo operation using the size of the hash table.
    – If there is no collision (i.e., the bucket at the computed index is empty), the data element is inserted into that bucket.
    – If there is a collision (i.e., the bucket is already occupied), a collision resolution technique is employed to handle it. Common techniques include chaining (using linked lists or other data structures to store multiple elements in the same bucket) or open addressing (probing for the next available empty bucket).
    – When retrieving a data element, the hash function is again applied to compute its hash code, which is used to locate the appropriate bucket in the hash table. The element is then searched for within that bucket.
  4. Collision Handling:
    – Collisions occur when two different data elements produce the same hash code, resulting in them being mapped to the same index in the hash table.
    – Collision resolution techniques aim to minimize collisions and ensure efficient storage and retrieval of data.
    – Chaining: In this technique, each bucket in the hash table contains a linked list or any other suitable data structure to hold multiple elements with the same hash code. New elements are appended to the end of the list.
    – Open Addressing: In this technique, when a collision occurs, the algorithm probes for the next available empty slot in the hash table by following a predefined sequence of steps. Common approaches include linear probing (checking the next slot in a linear fashion) and quadratic probing (probing using a quadratic sequence).
    – Collision resolution techniques should ensure that all data elements can eventually be stored and retrieved from the hash table, and the performance should be good even in the presence of collisions.
  5. Performance:
    – Hashing provides fast average-case time complexity for inserting, retrieving, and deleting elements, typically O(1) or constant time.
    – However, in the worst-case scenario, when many collisions occur, the performance can degrade to O(n), where n is the number of elements in the hash table.
    – The choice of a good hash function, appropriate hash table size, and collision resolution technique greatly influence the performance of the hashing data structure.

Hashing is widely used in various applications, such as database indexing. Certainly! Here are some additional applications and considerations related to hashing in data structures:

 

Additional Applications of Hashing:

  1. Data Retrieval: Hashing allows for fast retrieval of data. By computing the hash code of a data element, you can directly access the corresponding bucket in the hash table, reducing the search time compared to other data structures like arrays or linked lists.
  2. Caching: Hashing is commonly used in caching systems to quickly determine if a requested item is already present in the cache. The hash code acts as a key to identify and locate cached data, enabling efficient retrieval and storage.
  3. Dictionaries and Symbol Tables: Hash tables are often used to implement dictionaries and symbol tables, where data elements are stored as key-value pairs. The hash function is applied to the keys, allowing for efficient lookup and manipulation of values associated with specific keys.
  4. Password Storage: Hash functions are extensively used to store password information securely. Instead of storing actual passwords, their hash codes are stored. When a user enters their password, the hash function is applied to it, and the resulting hash code is compared with the stored hash code for verification.
  5. Checksums and Data Integrity: Hash functions are employed to generate checksums for data verification. By comparing the hash code of a received file or message with the expected hash code, you can determine if any data has been modified during transmission.

Considerations:

  1. Hash Function Quality: The quality of the hash function is crucial. It should produce a uniform distribution of hash codes to minimize collisions. A poor hash function can lead to clustering (uneven distribution of elements) and increased collisions, negatively impacting performance.
  2. Load Factor and Resizing: The load factor of a hash table represents the ratio of the number of elements to the number of buckets. Maintaining a low load factor reduces collisions. When the load factor exceeds a certain threshold, the hash table may need to be resized (expanded) to maintain performance.
  3. Memory Usage: Hash tables consume memory, and the size of the hash table needs to be chosen carefully. A smaller hash table may result in more collisions, while a larger one may waste memory. Balancing memory usage and performance is an important consideration.
  4. Collision Resolution Overhead: The chosen collision resolution technique impacts performance. Chaining incurs additional memory overhead for storing and managing linked lists, while open addressing may require more probing steps and result in increased search time.
  5. Hash Function Stability: Hash functions should produce consistent hash codes for the same input over time and across different executions. Otherwise, data retrieval and consistency could be compromised.

By understanding the principles and considerations of hashing, you can effectively use and implement hash-based data structures for efficient storage and retrieval of data in various applications.

 

Probing

Probing is a technique used in data structures, such as hash tables, that deal with collisions, which occur when two or more data elements map to the same index in the underlying data structure. Probing is a way to resolve collisions by finding an alternative slot or position to store the collided element. It involves a series of steps to search for the next available slot in the data structure.

 

Let’s explore probing in more detail:

  1. Collision in Hash Tables:
    – In a hash table, collisions occur when different data elements produce the same hash code or map to the same index in the hash table’s underlying array.
    – Collisions can be a result of imperfect hash functions or limitations in the size of the hash table relative to the number of data elements.
  2. Probing Techniques:
    – Probing techniques aim to find an alternative position for storing the collided element within the data structure.
    – There are different approaches to probing, including linear probing, quadratic probing, and double hashing.
  3. Linear Probing:
    – Linear probing is a simple and commonly used probing technique.
    – When a collision occurs, linear probing searches for the next available slot by linearly checking the subsequent slots in the data structure.
    – The probing sequence is typically defined as incrementing the index by a fixed value (usually 1) until an empty slot is found.
  4. Quadratic Probing:
    – Quadratic probing is another probing technique that reduces primary clustering, which is the tendency for collided elements to form clusters near each other.
    – When a collision occurs, quadratic probing uses a quadratic sequence to search for the next available slot.
    – The probing sequence is defined by incrementing the index by a quadratic function of an incrementing value.
    – Common quadratic functions include quadratic, cubic, or other polynomial functions.
  5. Double Hashing:
    – Double hashing is a more advanced probing technique that utilizes a second hash function to determine the next probing position.
    – When a collision occurs, double hashing calculates the offset for the next position by applying the secondary hash function to the collided element.
    – The secondary hash function should be carefully chosen to avoid producing the same offsets for different elements.
    – The probing sequence for double hashing can be more varied than linear or quadratic probing, depending on the implementation.
  6. Probing Limitations and Considerations:
    – Probing techniques have limitations and considerations that should be taken into account.
    – Load Factor: As the number of elements in the data structure increases, the probability of collisions also increases, affecting probing performance. Maintaining a low load factor is essential to minimize collisions and probing time.
    – Clustering: Probing can lead to secondary clustering, where elements tend to cluster around occupied positions. This clustering can impact performance, as it increases the likelihood of further collisions.
    – Probing Order: The order in which elements are probed during collision resolution can affect performance. Different probing orders can have varying degrees of clustering or spread of elements.

 

Probing techniques in detail

Linear probing is a simple and widely used probing technique in hash tables for resolving collisions. When a collision occurs, linear probing searches for the next available slot by incrementing the index linearly until an empty slot is found.

 

Here’s a detailed explanation of linear probing:

  1. Collision Occurrence:
    – Collisions happen when two or more data elements produce the same hash code or map to the same index in the hash table.
    – In such cases, linear probing is used to find an alternative slot to store the collided element.
  2. Probing Sequence:
    – Linear probing uses a probing sequence that increments the index by a fixed value (typically 1) for each probing attempt.
    – The probing sequence can be represented by the formula: `probe(i) = (hash(key) + i) % table_size`, where:
    – `probe(i)` is the index being probed.
    – `hash(key)` is the hash code of the key.
    – `i` is the number of probing attempts.
    – `table_size` is the size of the hash table.
  3. Collision Resolution:
    – When a collision occurs during an insert operation, linear probing starts searching for the next available slot.
    – It starts at the computed index based on the hash code and increments the index by 1 in each probing attempt.
    – Linear probing continues until it finds an empty slot or an available position in the hash table.
  4. Insertion Process:
    – To insert an element using linear probing:
    1. Compute the hash code of the element’s key.
    2. Map the hash code to an initial index in the hash table.
    3. If the initial index is unoccupied, insert the element at that index.
    4. If the initial index is occupied, start linear probing by incrementing the index by 1 and check the next slot.
    5. Repeat the probing process until an empty slot is found, and insert the element at that index.
  5. Retrieval Process:
    – To retrieve an element using linear probing:
    1. Compute the hash code of the key for the element to be retrieved.
    2. Map the hash code to the initial index.
    3. If the element is found at the initial index, return it.
    4. If not, start linear probing by incrementing the index by 1 and check the next slot.
    5. Repeat the probing process until the element is found or an empty slot is encountered.
    6. If an empty slot is encountered, the element does not exist in the hash table.
  6. Deletion Process:
    – Deletion using linear probing can be a bit more complex due to the need to maintain the integrity of the probing sequence.
    – When deleting an element, it is marked as deleted or replaced with a special value to signify its absence without breaking the probing sequence.
    – During subsequent searches, if a deleted element is encountered, the search continues until a non-deleted element or an empty slot is found.

Linear probing has some advantages, including simplicity and cache-friendly behavior. It often provides good performance when the load factor (the ratio of the number of elements to the size of the hash table) is relatively low. However, linear probing can suffer from primary clustering, where collided elements tend to form clusters near each other, leading to increased collisions and slower performance. To mitigate clustering, techniques like perturbation can be applied to perturb the linear probing sequence and improve the distribution of elements.

 

Quadratic probing is a probing technique used in hash tables to resolve collisions. It aims to reduce primary clustering, where collided elements tend to form clusters near each other. Instead of linearly incrementing the index, quadratic probing uses a quadratic sequence to determine the next probing position.

 

Here’s a detailed explanation of quadratic probing:

  1. Collision Occurrence:
    – When two or more elements produce the same hash code or map to the same index in the hash table, collisions occur.
    – Quadratic probing is used to find an alternative slot for storing collided elements.
  2. Probing Sequence:
    – Quadratic probing uses a quadratic sequence to determine the next probing position.
    – The probing sequence can be represented by the formula: `probe(i) = (hash(key) + c1 * i + c2 * i^2) % table_size`, where:
    – `probe(i)` is the index being probed.
    – `hash(key)` is the hash code of the key.
    – `i` is the number of probing attempts.
    – `c1` and `c2` are constants used in the quadratic function.
    – `table_size` is the size of the hash table.
  3. Collision Resolution:
    – When a collision occurs during an insert operation, quadratic probing starts searching for the next available slot.
    – It computes the next probing position using the quadratic sequence.
    – The probing sequence calculates the index by adding the hash code, a linear term multiplied by the probing attempt, and a quadratic term multiplied by the square of the probing attempt.
    – The modulus operation `table_size` is applied to ensure the index wraps around within the bounds of the hash table.
  4. Insertion Process:
    – To insert an element using quadratic probing:
    1. Compute the hash code of the element’s key.
    2. Map the hash code to an initial index in the hash table.
    3. If the initial index is unoccupied, insert the element at that index.
    4. If the initial index is occupied, start quadratic probing by calculating the next probing position using the probing sequence formula.
    5. Repeat the probing process until an empty slot is found, and insert the element at that index.
  5. Retrieval Process:
    – To retrieve an element using quadratic probing:
    1. Compute the hash code of the key for the element to be retrieved.
    2. Map the hash code to the initial index.
    3. If the element is found at the initial index, return it.
    4. If not, start quadratic probing by calculating the next probing position using the probing sequence formula.
    5. Repeat the probing process until the element is found or an empty slot is encountered.
    6. If an empty slot is encountered, the element does not exist in the hash table.
  6. Deletion Process:
    – Deletion using quadratic probing can be more complex due to the need to maintain the integrity of the probing sequence.
    – When deleting an element, it is marked as deleted or replaced with a special value to signify its absence without breaking the probing sequence.
    – During subsequent searches, if a deleted element is encountered, the search continues until a non-deleted element or an empty slot is found.

Quadratic probing offers better spreading of elements compared to linear probing, reducing primary clustering. However, it can still suffer from secondary clustering, where elements end up being placed in the same positions due to the nature of the quadratic sequence. The choice of `c1` and `c2` in the quadratic function affects the probing sequence, and careful selection can help improve the distribution of elements and reduce clustering.

 

By using quadratic probing, collisions in hash tables can be resolved more effectively, resulting in improved performance and reduced clustering.

 

However, there are a few considerations when using quadratic probing:

  1. Probing Sequence: The quadratic probing sequence allows for better spreading of elements compared to linear probing. However, it’s crucial to select appropriate values for `c1` and `c2` in the quadratic function. Careful selection helps avoid clustering and ensures that all positions in the hash table are probed eventually.
  2. Quadratic Probing Limitations: Quadratic probing can still suffer from secondary clustering, where elements tend to cluster in certain regions of the hash table. This occurs when the probing sequence encounters previously probed positions, leading to further collisions. As the load factor increases, the likelihood of clustering also increases, impacting performance.
  3. Table Size: The size of the hash table is crucial when using quadratic probing. It should be chosen carefully to balance the number of elements and the available slots. If the table size is too small relative to the number of elements, the probing sequence may not cover all positions, resulting in inefficient probing and increased collisions.
  4. Performance: Quadratic probing can provide better performance than linear probing, but it may not be as efficient as other advanced techniques like double hashing. In certain scenarios, the secondary clustering effect can still impact performance. Therefore, it’s important to consider the expected data distribution and load factor when choosing a probing technique.

Overall, quadratic probing offers an improvement over linear probing by reducing primary clustering. It provides a more evenly spread probing sequence, leading to better utilization of the hash table. However, it’s important to carefully select the parameters and monitor the load factor to ensure efficient collision resolution and minimize clustering effects.

 

Double hashing is an advanced probing technique used to resolve collisions in hash tables. It employs a secondary hash function to determine the next probing position when a collision occurs. By using two hash functions, double hashing offers a more varied and spread-out probing sequence, reducing clustering and improving the overall performance of the hash table.

 

Here’s a detailed explanation of double hashing:

  1. Collision Occurrence:
    – Collisions happen when two or more elements produce the same hash code or map to the same index in the hash table.
    – Double hashing is used to find an alternative slot for storing collided elements.
  2. Probing Sequence:
    – Double hashing uses a secondary hash function to determine the next probing position.
    – The probing sequence can be represented by the formula: `probe(i) = (hash1(key) + i * hash2(key)) % table_size`, where:
    – `probe(i)` is the index being probed.
    – `hash1(key)` is the primary hash function applied to the key.
    – `hash2(key)` is the secondary hash function applied to the key.
    – `i` is the number of probing attempts.
    – `table_size` is the size of the hash table.
  3. Collision Resolution:
    – When a collision occurs during an insert operation, double hashing calculates the next probing position using the secondary hash function.
    – The secondary hash function is applied to the key, and the resulting value is multiplied by the probing attempt `i`.
    – The primary hash function result and the multiplied secondary hash function result are then summed together.
    – The modulus operation `table_size` is applied to ensure the index wraps around within the bounds of the hash table.
  4. Insertion Process:
    – To insert an element using double hashing:
    1. Compute the hash code of the element’s key using the primary hash function.
    2. Map the hash code to an initial index in the hash table.
    3. If the initial index is unoccupied, insert the element at that index.
    4. If the initial index is occupied, start double hashing by calculating the next probing position using the probing sequence formula.
    5. Repeat the probing process until an empty slot is found, and insert the element at that index.
  5. Retrieval Process:
    – To retrieve an element using double hashing:
    1. Compute the hash code of the key for the element to be retrieved using the primary hash function.
    2. Map the hash code to the initial index.
    3. If the element is found at the initial index, return it.
    4. If not, start double hashing by calculating the next probing position using the probing sequence formula.
    5. Repeat the probing process until the element is found or an empty slot is encountered.
    6. If an empty slot is encountered, the element does not exist in the hash table.
  6. Deletion Process:
    – Deletion using double hashing can be more complex due to the need to maintain the integrity of the probing sequence.
    – When deleting an element, it is marked as deleted or replaced with a special value to signify its absence without breaking the probing sequence.
    – During subsequent searches, if a deleted element is encountered, the search continues until a non-deleted element or an empty slot is found.

Double hashing provides a more varied and spread-out probing sequence compared to linear probing or quadratic probing. This reduces clustering and improves the performance of the hash table. However, it requires evaluating both hash functions for each probing attempt, which may introduce a performance overhead compared to simpler probing techniques. Therefore, the choice of double hashing should consider the trade-off between reduced clustering and the additional computational cost of evaluating two hash functions.

  1. Choosing Hash Functions:
    – The selection of appropriate hash functions is critical for effective double hashing.
    – The primary and secondary hash functions should be independent and produce evenly distributed hash values to minimize collisions and clustering.
    – Ideally, the secondary hash function should have a different distribution than the primary hash function to ensure a more varied probing sequence.
  2. Resolving Clustering:
    – Double hashing helps alleviate primary clustering, where elements cluster around the same index due to collisions.
    – By using a secondary hash function, the probing sequence becomes more varied, distributing collided elements across different positions in the hash table.
    – This reduces the likelihood of elements repeatedly colliding and forming clusters, leading to improved overall performance.
  3. Handling Load Factor:
    – Like other probing techniques, the performance of double hashing can be influenced by the load factor, which is the ratio of the number of elements to the size of the hash table.
    – As the load factor increases, the likelihood of collisions also increases, affecting the efficiency of double hashing.
    – To maintain optimal performance, it is recommended to resize the hash table and rehash the elements if the load factor exceeds a certain threshold.
  4. Trade-offs:
    – Double hashing generally provides better performance compared to linear probing and quadratic probing by reducing clustering and improving the distribution of elements.
    – However, double hashing requires evaluating two hash functions, which may introduce additional computational overhead.
    – The choice of probing technique, including double hashing, depends on factors such as expected data distribution, load factor, and desired trade-offs between performance and complexity.

Double hashing is a powerful technique for collision resolution in hash tables. By using a secondary hash function, it offers a more varied and evenly spread probing sequence, reducing clustering and improving the overall efficiency of the hash table. With careful selection of hash functions and proper handling of the load factor, double hashing can be a highly effective solution for resolving collisions.

 

It’s worth noting that different probing techniques have their advantages and disadvantages. The choice of a probing technique depends on the specific requirements of the data structure and the expected data distribution.

 

By using probing techniques, collisions in data structures can be resolved, allowing for efficient storage and retrieval of data elements even when they map to the same position. Probing strategies help maintain a balanced distribution of elements and improve the overall performance of the data structure.

Related Links

Leave a Reply