Data Structure - Ch5 Hashing & Sorting | Mr. Opengate

2015年1月18日星期日

Data Structure - Ch5 Hashing & Sorting

於 1/18/2015 08:04:00 下午沒有留言:

標籤： Computer Science-Data Structure

Chapter 5 Hashing & Sorting

5.1 Hashing

Hash 定義

Dictionary problem, maintain a dictionary and support operations Get, Insert, and Delete be done in O(1) expected time
Static hash存在table中(array)，hash table size = Bucket * Slot，例如， int hash_table[B][S]，則B為分類數，S為每一類的承載量。

Hash 相關術語

Key density：n/T，例如，本文單字量/全球單字量，通常Key density的數值非常小。
Loading density：n /B*S，loading高表示hash table利用度高，collision也會偏高。
Collision：home bucket of a new pair is not empty.
Overflow：home bucket for a new pair is full. (有collision 不一定overflow，若Bucket只有一個slot, 則collision = overflow)

Hash 設計：一個良好的hash function須滿足

計算簡單
collision少
hash table 不局部偏重

perfect hash function : 保證無collision，data已知。
uniform hash function : 均勻的，對應到每個bucket機率相等。

常見的Hash function

Division：h(k) = k%D，In practice, we choose D that has no prime factor smaller than 20
Multiplication：
h(k) = (A·k mod 2^w) right_shift (w–r)

Mid-Square：compute the home bucket by taking square of k, using an appropriate number of bits from the middle of the square
Folding：If k = 12320324111220, partition into 3 decimal digits long, we have
P1 =123, P2 =203, P3 =241, P4 =112, P5=20, 􏰀

Shift, 直接相加：h(k)= Pi =123+203+241+112+20=699
Boundary, 折紙, 偶數片段反向：h(k)=123+302+241+211+20=897

Hash的Overflow處理

Chaining：a bucket is a pointer to a list of keys having same home bucket
Open Addressing (rehashing)

Linear probing：(h(k) + i)%b

缺點：形成primary clustering，具有相同hash address的資料容易聚集在鄰近的Bucket中，造成search time增加

Quadratic probing：(h(k) + i^2)%b and (h(k) − i^2)%b.

缺點：hash table 不保證充分利用，會有secondary clustering problem, 探測順序皆相同, 增加search time

Double hashing：h(k,i) = ( h1(k) + i*h2(k) ) mod m

This method generally produces excellent results, but h2(k) must be relatively prime to m. One way is to make m a power of 2 and design h2(k) to produce only odd numbers

Security Hash function 必要的特性

input 可變動大小
output 長度固定
Hash function 計算有效率
已知H(x), 難找x (preimage resistant)
已知x, 難找x != y 且 H(y) = H(x) (second preimage resistant, weak collision resistant
難找任一對x, y 使H(x) = H(y) (strong collision resistant)
peudorandomness

5.2 Sorting

排序的種類

Internal sort 與 External sort

Internal sort：排序工作主要在memory完成，資料量較少者適用。
External sort：排序工作主要是在disk完成，資料量較大者適用。

stable sorting 與 unstable sorting

stable sorting：如果鍵值相同之資料，在排序後相對位置與排序前相同。
排序前 : 3,5,19,1,3*,10
排序後 : 1,3,3*,5,10,19
(因為兩個 3, 3*的相對位置在排序前與後皆相同。)

simple sorting 與 Advanced sorting

常見的排序（重要）

Bubble sorting：每一回合逐一比較相臨資料，依排序之順序交換位置，每回合至少會有一次交換位置，至沒交換位置則停止。
Selection sorting：未排序前之第一筆資料可視為已排序好之資料，每一回合逐一比較相臨資料，依排序之順序交換位置。
Insertion sorting：將待排序之資料逐一與已排序後之資料比較，再將資料放入適當位置
Merge sorting：將兩個已排序過的記錄合併而得到另一個排序好的記錄。
Heap Sort：將資料先以 “Bottom-up” 的方式建立Max-Heap，再執行 n-1 回合的 “Delete Max.” 動作

5.3 Quick sorting

原理：找資料pivot，將小於pivot放右邊，大於pivot放左邊，再以相同方法分左右兩邊序列直到排序完成。
性質：最差 O(n2)與平均時間 O(nlog2n)，需要額外堆疊空間，不穩定排序，快速排序是平均時間最快之內部排序法。
pseudocode for quick sort ：Initial call: QUICKSORT(A, 1, n)

QUICKSORT(A, p, r)
if p < r
then q ← PARTITION(A, p, r)

QUICKSORT(A, p, q–1)
QUICKSORT(A, q+1, r)

Partitioning subroutine (in-place version)：

Analysis of quick sort：Randomized quick sort, we can make sure we are usually lucky

L(n) = 2U(n/2) + Θ (n) lucky
U(n)= L(n –1) + Θ (n) unlucky

Summary of randomized order-statistic selection

Works fast: linear expected time.
Excellent algorithm in practice.
But, the worst case is very bad: Θ(n^2).

Q. Is there an algorithm that runs in linear time in the worst case?
A. Yes, due to Blum, Floyd, Pratt, Rivest, and Tarjan [1973].
IDEA : Generate a good pivot recursively.

[改善pivot方法1] Median-of-3 Partition

choose the pivot as the median of a set of 3 elements randomly selected from the array，for example…

What is the probability of getting an OK split if the pivot is chosen at random? (A split is “OK” if the smaller piece has at least n/4 elements.)
what is the probability of getting an OK split with Median-of-3 Partition?

[改善pivot方法2] columns-of-5 Partition

Choose the pivot : At least half the group medians are ≤ x , which is at least n/5/2= n/10 group medians.
Therefore, at least 3n/10 elements are ≤ x.

In practice, this algorithm runs slowly, because the constant in front of n is large.
The randomized algorithm is far more practical.
Exercise: Why not divide into groups of 3? 會失敗！

5.4 Linear-Time Sorting

Decision-tree model

Theorem：A decision tree can model the execution of any comparison sort:
Theorem：Any decision tree that can sort n elements must have height Ω(nlgn).

Counting sort: No comparisons between elements.

Input: A[1 . . n], where A[j]∈{1, 2, ..., k}.
Output: B[1 . . n], sorted.
Auxiliary storage: C[1 . . k].

原理

計算每個key的出現次數
計算「之後排好串列」的起始位置
放進去,C[A[j]] - 1使insert時按順序且stable

結論

O(n + k), 假設key的範圍是受到限制的則 k 可視為constant
手上有counting sort可以排0~9 key的範圍可否排00~99, 000~999?
可! 用radix sort , LSD

Radix sort:

可以降成O(n)的理由：若可知key的範圍限制，就可知要做幾回合
ex : 3位數需要3回合
原理

radix sort 結論

time : 每回合O(n + r), 總共花O(d*(n+r))

分派 : O(n), n個數字要分
合併 : O(r), r個桶子
d:回合數

space : Buckets space需求龐大, O(r*n)
stable, linear time
Downside: Unlike quicksort, radix sort displays little locality of reference, and thus a well-tuned quicksort fares better on modern processors, which feature steep memory hierarchies.

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)

© 2025 Mr. Opengate