Hashing
Concept
key -> [hash function] --> value ------------------> | hash table |
| | |
(collision) | |
| | |
[rehash func] -> value -----------> |____________|
- buckets:
- uni-address(index), with several data in each one.
- slots:
- one data in a single slot.
- Identifier Density:
- n: 表內目前有幾筆資料
- T: Key possibilities
- Loading Density:
- Usability
用途
Hashing的insertion, deletion可以在constant time作完,因為只需要hash function就可以知道資料是在哪個bucket。
分類
- Static hashing
- Hash func是固定的,不論資料的分佈及數量是多少都不會改變
- Dynamic hashing (extensible hashing、linear hashing)
- 資料的觀點來說是動態,根據資料的不同來調整hash function
- Uniform hashing function
- 計算出的hashing address對應到的buckets機率均勻相同。
- Perfect hashing function
- 任兩個key的value均不相同。
Hash Overflow Handling
- Open addressing (Closed Hashing)
- When a new identifier is hashed into a full bucket, find the closest unfilled bucket.
- Linear Probing, Quadratic Probing.
- Chaining (Open Hashing, Separate Chaining)
- 用linked list串接同個位置。
Double Hashing
Double hashing uses the idea of applying a second hash function to the key when a collision occurs. The result of the second hash function will be the number of positions from the point of collision to insert.
- A popular second hash function is:
- R is a prime number (smaller than table size)
- Example:
Performance
The performance of closed hashing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. Open hashing, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed.
Refer to: http://www.cs.cornell.edu/courses/cs312/2008sp/lectures/lec20.html
Naming Confusion
Meaning of Open hashing and Closed hashing
The use of "closed" vs. "open" reflects whether or not we are locked in to using a certain position or data structure.
For instance, the "open" in "open addressing" tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what's already in the hash table.
The "closed" in "closed hashing" refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table's internal array. Note that this is only possible by using some sort of open addressing strategy.
Contrast this with open hashing - in this strategy, none of the objects are actually stored in the hash table's array; instead once an object is hashed, it is stored in a list which is separate from the hash table's internal array. "open" refers to the freedom we get by leaving the hash table, and using a separate list.
In short, "closed" always refers to some sort of strict guarantee, like when we guarantee that objects are always stored directly within the hash table (closed hashing).
