Chapter 5   Hashing & Sorting
5.1   Hashing
- Hash 定義
 - Dictionary problem, maintain a dictionary and support operations Get, Insert, and Delete be done in O(1) expected time
 - Static hash存在table中(array),hash table size = Bucket * Slot,例如, int hash_table[B][S],則B為分類數,S為每一類的承載量。
 - Hash 相關術語
 - Key density:n/T,例如,本文單字量/全球單字量,通常Key density的數值非常小。
 - Loading density:n /B*S,loading高表示hash table利用度高,collision也會偏高。
 - Collision:home bucket of a new pair is not empty.
 - Overflow:home bucket for a new pair is full. (有collision 不一定overflow,若Bucket只有一個slot, 則collision = overflow)
 - Hash 設計:一個良好的hash function須滿足
 - 計算簡單
 - collision少
 - hash table 不局部偏重
 - perfect hash function : 保證無collision,data已知。
 - uniform hash function : 均勻的,對應到每個bucket機率相等。
 
- 常見的Hash function
 - Division:h(k) = k%D,In practice, we choose D that has no prime factor smaller than 20
 - Multiplication:
h(k) = (A·k mod 2^w) right_shift (w–r) 
- Mid-Square:compute the home bucket by taking square of k, using an appropriate number of bits from the middle of the square
 - Folding:If k = 12320324111220, partition into 3 decimal digits long,  we have 
P1 =123, P2 =203, P3 =241, P4 =112, P5=20,  - Shift, 直接相加:h(k)= Pi =123+203+241+112+20=699
 - Boundary, 折紙, 偶數片段反向:h(k)=123+302+241+211+20=897
 - Hash的Overflow處理
 - Chaining:a bucket is a pointer to a list of keys having same home bucket
 - Open Addressing (rehashing)
 - Linear probing:(h(k) + i)%b
 - 缺點:形成primary clustering,具有相同hash address的資料容易聚集在鄰近的Bucket中,造成search time增加
 - Quadratic probing:(h(k) + i^2)%b and (h(k) − i^2)%b.
 - 缺點:hash table 不保證充分利用,會有secondary clustering problem, 探測順序皆相同, 增加search time
 - Double hashing:h(k,i) = ( h1(k) + i*h2(k) ) mod m
 - This method generally produces excellent results, but h2(k) must be relatively prime to m. One way is to make m a power of 2 and design h2(k) to produce only odd numbers
 
- Security Hash function 必要的特性
 - input 可變動大小
 - output 長度固定
 - Hash function 計算有效率
 - 已知H(x), 難找x (preimage resistant)
 - 已知x, 難找x != y 且 H(y) = H(x) (second preimage resistant, weak collision resistant
 - 難找任一對x, y 使H(x) = H(y) (strong collision resistant)
 - peudorandomness
 
5.2   Sorting
- 排序的種類
 - Internal sort 與 External sort
 - Internal sort:排序工作主要在memory完成,資料量較少者適用。
 - External sort:排序工作主要是在disk完成,資料量較大者適用。
 - stable sorting 與 unstable sorting
 - stable sorting:如果鍵值相同之資料,在排序後相對位置與排序前相同。
排序前 : 3,5,19,1,3*,10
排序後 : 1,3,3*,5,10,19
(因為兩個 3, 3*的相對位置在排序前與後皆相同。) - simple sorting 與 Advanced sorting
 - 常見的排序(重要)
 
- Bubble sorting:每一回合逐一比較相臨資料,依排序之順序交換位置,每回合至少會有一次交換位置,至沒交換位置則停止。
 - Selection sorting:未排序前之第一筆資料可視為已排序好之資料,每一回合逐一比較相臨資料,依排序之順序交換位置。
 - Insertion sorting:將待排序之資料逐一與已排序後之資料比較,再將資料放入適當位置
 - Merge sorting:將兩個已排序過的記錄合併而得到另一個排序好的記錄。
 - Heap Sort:將資料先以 “Bottom-up” 的方式建立Max-Heap,再執行 n-1 回合的 “Delete Max.” 動作
 
5.3   Quick sorting
- 原理:找資料pivot,將小於pivot放右邊,大於pivot放左邊,再以相同方法分左右兩邊序列直到排序完成。
 - 性質:最差 O(n2)與平均時間 O(nlog2n),需要額外堆疊空間,不穩定排序,快速排序是平均時間最快之內部排序法。
 - pseudocode for quick sort :Initial call: QUICKSORT(A, 1, n)
 - QUICKSORT(A, p, r) 
if p < r
then q ← PARTITION(A, p, r) - QUICKSORT(A, p, q–1)
 - QUICKSORT(A, q+1, r)
 
- Partitioning subroutine (in-place version):
 
- Analysis of quick sort:Randomized quick sort, we can make sure we are usually lucky
 - L(n) = 2U(n/2) + Θ (n) lucky
 - U(n)= L(n –1) + Θ (n) unlucky
 
- Summary of randomized order-statistic selection
 
- Works fast: linear expected time.
 - Excellent algorithm in practice.
 - But, the worst case is very bad: Θ(n^2).
 - Q. Is there an algorithm that runs in linear time in the worst case? 
A. Yes, due to Blum, Floyd, Pratt, Rivest, and Tarjan [1973].
IDEA : Generate a good pivot recursively. - [改善pivot方法1] Median-of-3 Partition
 - choose the pivot as the median of a set of 3 elements randomly selected from the array,for example…
 
- What is the probability of getting an OK split if the pivot is chosen at random? (A split is “OK” if the smaller piece has at least n/4 elements.)
 - what is the probability of getting an OK split with Median-of-3 Partition?
 
- [改善pivot方法2] columns-of-5 Partition
 
- Choose the pivot : At least half the group medians are ≤ x , which is at least n/5/2= n/10 group medians.
 - Therefore, at least 3n/10 elements are ≤ x.
 - In practice, this algorithm runs slowly, because the constant in front of n is large. 
The randomized algorithm is far more practical. - Exercise: Why not divide into groups of 3? 會失敗!
 
5.4   Linear-Time Sorting
- Decision-tree model
 - Theorem:A decision tree can model the execution of any comparison sort:
Theorem:Any decision tree that can sort n elements must have height Ω(nlgn). 
- Counting sort: No comparisons between elements.
 - Input: A[1 . . n], where A[j]∈{1, 2, ..., k}.
 - Output: B[1 . . n], sorted.
 - Auxiliary storage: C[1 . . k].
 
- 原理
 - 計算每個key的出現次數
 - 計算「之後排好串列」的起始位置
 - 放進去,C[A[j]] - 1使insert時按順序且stable
 - 結論
 - O(n + k), 假設key的範圍是受到限制的則 k 可視為constant
 - 手上有counting sort可以排0~9 key的範圍可否排00~99, 000~999?   
可! 用radix sort , LSD 
- Radix sort:
 - 可以降成O(n)的理由:若可知key的範圍限制,就可知要做幾回合
ex : 3位數需要3回合 - 原理
 
- radix sort 結論
 - time : 每回合O(n + r), 總共花O(d*(n+r))
 - 分派 : O(n), n個數字要分
 - 合併 : O(r), r個桶子
 - d:回合數
 - space : Buckets space需求龐大, O(r*n)
 - stable, linear time
 - Downside: Unlike quicksort, radix sort displays little locality of reference, and thus a well-tuned quicksort fares better on modern processors, which feature steep memory hierarchies.
 

















沒有留言:
張貼留言