University of Veszprem
Hungary
DATA STRUCTURES
AND
ALGORITHM ANALYSIS
lecture notes
Sandor Dominich
Adrienn Skrop
Maria Horvath
2001.
Mathematics Review
1. Exponents
a ba
a
a
= a
bc
= a
(a )
b+c
= a
bc
+ a
= 2a
+ 2
= 2
2n
n +1
2. Logarithms
typically log 2 = log
log a b = x a x = b
properties : log ab = log a + log b
a
log = log a log b
b
log c b
log a b =
log c a
log 1 = 0
log 2 = 1
Note : ln , lg
log 2 is just a compromise
3. Series
Geometric series
a n +1 1
a =
a 1
i =0
n
a = 2,
2i =
i =0
0 < a < 1,
2 n +1 1
1
a
i =0
1
1 a
' =' n
1 + a + a 2 + a 3 + ... = S | a
a + a 2 + a 3 + a 4 + ... = aS
Note : only valid if convergence
S=
1
1 a
HW :
i =1
=2
Arithmetic series
n
i =
i =1
n
n(n + 1)
2
i =1
n(n + 1)(2n + 1)
6
Harmonic number
n
1
Hn = = log e n + n
i =1 i
n : const
n = 0.57721566 Euler' s constant
Hn ln n
4. Proofs
1. Proof by induction
We want to prove p(n)
1. Prove that p(n) holds for a base case p(n0)
2. Assume p(n) holds, n > n0
3. Prove p(n) p(n+1)
Fibonacci numbers
F0 = 1, F1 = 1 (by definition)
F2 = F1 + F0 = 2
Fi = Fi-1 + Fi-2, i 2
i
5
Fi < , i 1
3
2. Proof by contradiction (indirect, reductio ad absurdum)
To prove: P
1. Assume that P holds
2. If we contradict some known property, then:
3. P (must be true)
Ex.:
P: there are infinitely many primes
P: there are a finite number of primes
P1, P2, , Pn
P1+ P2 + + Pn
N = P1 P2 Pn + 1
N > Pi, i = 1,,n
N is not prime
contradict s
rem. (N : Pi) = 1 0 Fundamental Theorem of Arithmetic
MUST conclude: P is false P is true
Ex.: 2 irrational , 2 I = R \ Q
P: 2 irrational
P: 2 rational
2 = a / b, (a, b) = 1
2b2 = a2
a = 2c
b2 = 2c2
b = 2d (a, b) 1
Note: P
P: The sun is shining
P: The sun is NOT shining
It is not the sun that is shining
contradiction
P: The rain, whose mean value in South Africa exceeds that of
Central Europe in august, is only half welcome in parts of a rain
forest.
S | xn xm | < S , n > m
3. Evaluate
1 i
;
i i
i =0 4
i =0 4
n
Prove (2i 1) = n 2
i =1
Asymptotic notation
- notation
(g (n )) = {f (n ) c1 , c 2 , n0 : 0 c1 g (n ) f (n ) c 2 g (n ), n > n0 }
def
c1g(n)
f(n)
f (n ) (g (n ))
c2g(n)
n0
: asymptotically tight bound
Ex.:
n2
3n
2
1
1
c1 = , c2 = , n0 = 3
6
2
2
c1n f (n) c2 n 2
f ( n) =
f ( n ) = ( n 2 )
HW: 6n3 (n2)
Property:
d
p(n) = ai n i
i =0
p ( n) = ( n d )
c = const = (n0) = (1)
O notation
f (n ) = (g (n ))
O( g (n)) = {f (n) c, n0 : 0 f (n) cg (n), n n0 }
def
n2 = O(n2)
n = O(n2)
cg(n)
f (n) = O( g (n))
f(n)
O upper bound for f(n)
- notation
( g (n)) = {f (n) c, n0 : 0 cg (n) f (n), n n0 }
def
f(n)
cg(n)
n0
lower bound for f(n)
n3
n2
nlogn
f (n) = ( g (n))
Ex.:
i, S: integers
S: = 0
running time
measure effective time
FOR i = 1 to n
algorithm analysis
S=S+i*i
1
1 1
declaration takes no time
assignment: 1 unit
FOR: 1 unit n + 1 unit
1 + 1 + n + 1 +3 = n + 6 = (n) = O(n)
Similarly (loop in loop):
FOR
.
FOR
.
(n2)
1. Calculate:
2
i =1
S=
1 2 3
1
+ + + ...
2 4 8
2
S 1 2 3
= + + + ...
2 4 8 16
+
1 1 1
+ + + ...
4 8 16
1 2 3
= + + + ...
2 4 8
1 1
1
S S + i = S = 2
2
i =2 2
2
1
2
2. Prove:
a,
5
Fi < , i 1
3
5
i = 1, Fi = 1 <
3
25
i = 2, Fi = 2 <
9
5
Fi +1 <
3
i +1
5 5
Fi +1 = Fi + Fi 1 < +
3 3
i 1
5
=
3
i +1
3 3 2 5 i +1 24 5 i +1
+ = <
5 5 3 25 3
b,
n(n + 1)
2
i =
i =1
n =11=1
n
i =
i =1
n +1
i =
n(n + 1)
2
(n + 1)(n + 2) = n(n + 1) + 2(n + 1)
2
i =1
c,
n
i =1
n(n + 1)(2n + 1)
6
n = 1: 1 =
n +1:
n +1
1 2 3
6
n +1
i
i =1
n
i = i
2
i =1
=
2
(n + 1)(n + 2)(2n + 3)
6
+ (n + 1) =
2
i =1
n(n + 1)(2n + 1)
2
+ (n + 1) =
6
n(2n + 1) + 6n + 6
n(2n + 1)
= (n + 1)
+ n + 1 = (n + 1)
=
6
6
2n 2 + 7 n + 6
6
(n + 2)(2n + 3) = 2n 2 + 7 n + 6 igaz
= (n + 1)
3.
n
(2i 1) = n
i =1
n
i =1
i =1
i =1
(2i 1) = 2i 1 = 2
n(n + 1)
n = n2 + n n = n2
2
4.
n2
F = F
i =1
n = 3:
F =1= F
i
2 =1
i =1
n2
= Fn 2
= Fn +1 2
i =1
n 1
i =1
n 1
i =1
Fi = F1 +
n 1
n 1
n 1
i =2
i =2
i=2
(Fi 1 + Fi2 ) = F1 + Fi 1 + Fi 2
Fn +1 2 = Fn + Fn 1 2 = (Fn 2 ) + (Fn 1 2 ) + 2 =
n2
n 3
F +F +2
i
i =1
i =1
Asymptotic notation in equations
n = (n2)
n = O(n2)
2n2 + 3n + 1 = 2n2 + (n) There exists some functions such that the equation holds.
2n2 + (n) = (n2)
O, ,
upper bound: O tight, o loose
o,
lower bound: tight, loose
f ( n)
=0
n g ( n)
f o( g )
=
f ( n)
lim
=
n g ( n)
f (g )
=
lim
g(n) grows much rapidly than f(n)
f(n) grows much rapidly than g(n)
Properties
Transitivity
( f(n) = . (g(n)) g(n) = . (h(n)) ) f(n) = . (h(n))
where . {O, , , o, }
Reflexivity
f(n) = . (f(n))
Symmetry
f(n) = (g(n)) g(n) = (f(n))
Analogy between asymptotic notation and real numbers
f(n) = O(g(n))
ab
f(n) = (g(n))
ab
f(n) = (g(n))
a=b
f(n) = o(g(n))
a<b
f(n) = (g(n))
a>b
Trichotomy
Dichotomy: Boolean logic
Trichotomic: three different values
Trichotomy: a, b R: exactly one of the followings holds:
a<b
a=b
a>b
Note: for every real number holds, and we can prove it.
Asymptotic notation: trichotomy doesnt hold
Not any two functions can be compared in an asymptotic sense.
Ex.: f(n) = n
g(n) = n
1+sin n
They cant be compared.
f(n) = n
g(n) = n
sin n
They can be compared.
f(n) is a tight upper bound for g(n).
Floor, ceiling
Floor of x R:
x: the greatest integer less than or equal to x
Ceiling of x R:
x: the least integer greater than or equal to x.
x 1 < x x x < x + 1
n Z: n / 2 + n / 2 = n
Rate of growth of polynomials and exponentials
n
lim x x = 0
x a
n \ {0}, a (1, )
Series
ex =
Any exponential grows faster than any polynomial!
xi
i!
i =0
e x 1+ x
1+ x e x 1+ x + x 2 , x 1
( )
x 0, e x = 1 + x + x 2
e 1+ x
x
ex =
lim
n
x
1+
ln(1 + x ) = x
x2 x3
+
..., x < 1
2
3
Stirling approximation:
n
n!= 2 n
e
n
1
1 +
n
n+
n
n 12 n
2 n n! 2 n
e
e
Fibonacci numbers and the Golden ratio
Fi = Fi-1 + Fi-2,
i 2,
F0 = 0, F1 = 1
AB AP
=
AP PB
AB
AP =
5 1
2
5 1
AB = 1, AP =
0.61803
2
1+ 5 1 5
, =
2
2
i
i
The relation between the Fibonacci numbers and the Golden ratio.
Fi =
2
Fibonacci numbers grow exponentially.
Summations
Sequence: a1,, an
(finite)
Series: a1 + ... + an + ... = ai
(infinite)
i =1
lim ak If it exists we have convergent series (else divergent).
n
k =1
Linearity
n
(ca
k =1
k =1
k =1
+ bk ) = c ak + bk
( f (k )) = f (k )
k =1
k =1
Differentiation
k =0
kx
k =0
1
, x <1
1 x
x
(1 x )2
Telescoping
Telescoping series (sums)
n
(a
k =1
ak 1 ) = a/ 1 a0 + a/ 2 a/ 1 + a/ 3 a/ 2 + ... + an a/ n 1 = an a0
n 1
n 1
1
1
1
1
=
= 1
k +1
n
k =1 k (k + 1)
k =1 k
Products
n
n
lg ak = lg ak
k =1 k =1
Bounding summations
By induction
Prove:
( )
= O 3k
k =0
c : 3k c 3
k =0
Initial value : n = 0
1 c 1, c 1
Assume:
n +1
k =0
k =0
3k = 3k + 3n+1
c 3n + 3n +1
1 1
= + c 3n +1
3 c
c 3n +1
By bounding terms
n
, amax = max ak
k
k =1
n
Ex.
k n
k =1
namax
n
k =1
By splitting
n even
n
n
2
k =1
k =1
k = k +
n
2
0+
k =1
n
k = +1
2
n
n
n 2
k = +1
0+
n
4
( )
= n2
By integrals
n
n +1
k =m
f (x )dx f (k ) f (x )dx
m 1
k =1 k
n
Ex.
n +1
dx
x
|
m-1 m m+1 n+1
Elementary Data Structures
Data should be organised into stuctures so that it can be processed as required by the problem
Elementary (Basic or fundamental):
There are just a few elementary data structure. All the other rely on the elementary data structures.
STACK
a) LIFO (Last in, first out)
b) FIFO (First in, first out)
The stack is organised into locations: address
p: pointers
stack pointer: the address of the next available
locations
LIFO
4
3
2
Note:
p: the address of the top which isnt empty
(another view)
p
7
B
locations
address
Two basic operations:
write: PUSH: D (new data) p (location) ; p p+1
read: POP: p p-1; p(location) D
Note:
In theory : the stack is infinite
In practice: finite
its impossible to increment the stack pointer: overflow: the stack is full
before writing overflow check is needed
Note:
before read: check whether the stack is empty (underflow check) (we want to extract a data from the empty stack)
Application:
management of teh main memory
when the OS is loaded
compilers: evaluating expressions
FIFO
its called a QUEUE
Head
Tail
A
Locations
It is infinite at both ends. Tha data is written only from one direction at one end.
Tail: where we write the data in
Head: where we read the data out
The space we can allocate for a QUEUE is finite.
overflow check: ENQUEUE
underflow check: DEQUEUE
write: ENQUEUE
read: DEQUEUE
Application:
modelling the dataflow in online processing (pressure, temperature)
ARRAY
1 dimensional
sequence: a1, a2, a3,, an
2 dimensional
matrix: (ai,j)nn
higher dimensional
matrix of matrices
Usage: represent stack, queue, list
LIST
L :E1, E2, , Ei,En the elements of the list
Ei: Di pi data and pointer fields
pi can contain several pointers: pij , j{1,}
j=1
single linked list
D1
p 1 D2
E1
p2
E2
...
Dn
NIL
En
NIL: no more list elements
p1: the address of E2
pi: the address of Ei+1
Important: the order in wich the elements are linked through the pointers (the logical order) not necessarily the same as the
physical order.
Example: Application:
to store data on disk (files)
element = track : the elements are stored on available tracks, ??????????
Operations:
INSERT: write a new element into the list
a. front of the first element of the list
E1
NIL
E2
En
E
pointer of E points towards E1
b. after the last element
E1
NIL
E2
En
E
Replacing NIL with a pointer to E
c. between the existing elements
NIL
E1
Ei
Ej
E
Rearraging the pointers.
En
SEARCHING: (n)
D1
p1
Di
pi
Dn
Ei
E1
D : to be searched
sequentially:
NIL
En
compare D with Di i=1,
we either find
it or not
DELETION: delete an existing element
D
NIL
E1
Ei
Ei+1
En
D: find the element contains D and take it out from the list
a) searching
b) deletion Ei
Ei-1
Ei
Ei+1
Logical deletion: just the pointers are rearranged, not physical deletion
Example:
array representation of a list
E1
E2
E3
two dimensional array
1
1
2
2
B
(1,2)
(1,3)
3
C
NIL
Data
Pointers
Doubly linked list
D1
D2
E1
Dn
E2
Sentinel: contains no data , points to the first element
Circular list
Circular singly linked list
D1
E1
p1
D2
p2
E2
The sentinel always tells us which is the first element of the list.
Dn
En
TREE
Mathematically: acyclic connected graph
ROOT level 0: parent of its children, has no parent
level 1: children of the root, parents of level 2
level 2: children of the nodes of level 1, parents of
level 3
last level:
leaves: have no children
..
Binary tree: at most two children (except the leaves)
ROOT number of levels: HIGHT (h) of the tree
number of leaves 2h
left child
right child
Implementation: by means of lists
Example:
MULTI-LIST
(linked)
NIL
NIL
NIL
NIL
NIL
Binary search tree
5
3
2
Every key in every left subtree is at most the key of
its root.
Every key in every righta subtree is at most the value
of its root.
7
5
8
Binary search tree property
WALK:
R
L
INORDER:
PREORDER:
POSTORDER:
L Root R
Root L R
L R Root
Example:
INORDER:
PREORDER:
POSTORDER:
235578
532578
25 3875
ascending order of keys
Application:
(A+B)*C-D mathematical form (usual)
Evaluation of an arithmetic expression :
parentheses
operators
form: Pohsh:
no parentheses
precedence rules
no need the check the precedence of the operators
postorder walk: AB+C*D- POSTFIX POLISH FORM
*
+
A
D
C
Application:
SORTING: arrange data
numbers 1, 7, 6
1, 6, 7
Comparison
Decision tree (specialised binary tree)
a1, a2, a3
h=?
a1:a2
How many comparisons needed to sat the
numbers?
a2:a3
Lower bound for comparison bored
sorting
a1:a2:a3
n! : number of leaves
. . .
n! 2h
h log (n!)
>
a1:a3
>
HASHING
Hash table
K1 K2
SLOTS (locations): every slot is
identified by a code (e.g., serial number)
:
.
m-1
U: Universe of all possible key values (from where the keys can take values).
K1, K2, K: keys, they are used to identify records.
Hash function: h, meets the key values with the slots.
h: U {0, 1, , m-1}
h: K h(K)
Collision: h(K1) = h(K2) when K1 K2
Resolve: chaining, data that collide are chained.
Sentinel
Complexity
Analysing hashing with chaining.
Search: (1+)
n: number of keys
K1
..
K2
..
m: number of slots
n
: average number of elements in a list
m
n
= : load factor (the analysis is made in term of )
m
Search:
K
h(K)
(1)
search in the list
()
(1) + () = (1 + )
Assume: n = O(m) (as many keys as slots)
n O ( m)
=
= O(1)
m
m
The search takes constant time.
Uniform hashing: any given key is equally likely to hash into any of the slots.
Probability:1 / m
Application
Spelling checker
Compilers
Game playing
Graphs
Hash function
What makes a good hash function? (Uniform hashing is ensured 1 / m)
Usually we do not know the distribution of the key values difficult to design good hash function.
In practice: spread the key values as much as possible.
Division method
h(K) = K mod m
It does not work well in every case m = 2p
Reminder
p bits
p = 2 m = 22
0, 1, 2, 3
K=5
2 bits
1
3 bits
M should be a prime near
n = 2000, = 2000 / 3
Multiplication method
h(K) = m((KA) mod 1)
0 < A < 1, A = 0.618033 (Golden section) good results
Uniform method
h(K) = Km
K uniform distribution [0, 1]
Practically and theoretically good function
Interpreting strings as numbers
SOUNDEX CODING
RADIX 128
C1 C2 Cn
ASCII
C1 C2 Cn
number 0 Cj 127
Ex. p = 112, t = 116, pt = 116 + 128 112 = 14452
Universal hashing
There are h hashing functions s.t. there exist key values that more than one will be hashed into the
same slot.
Any fixed hashing is valid.
H = {h1, h2, , hr} set of hash functions
For any random hi uniform hashing
Statement
If h is chosen from a universal collection H of hash functions and is used to hash n keys into a hash
table of size m, where n m, then the expected number of collisions involving the particular key x is
less than 1.
Proof
1, h( y ) = h( z ), y, z
Let cyz a random variable : c yz =
0, otherwise
H universal: probability for y and z to collide is 1 / m (by definition)
[ ]
1
m
, Pn = P ( = xn )
E
p
x
[
]
n n n
E c y,z =
Let cx be the number of collisions involving key x.
y1
cxy1
E c xy1 =
[ ]
1
m
[ ]
1
m
cxy2
y2
E c xy 2 =
cxyn-1
yn-1
[ ]
E [c x ] = E cxy =
y
E c xyn 1 =
1 n 1
=
<1
m
m
1
m
c x = c xy1 + ... + c xyn 1
E i = E [ i ]
i i
Construction of a universal set H.
m prime
given key x
decompose x = x0 x1 xr , value xi < m
{0, 1, , m-1}, let a be a sequence of slots
ha (x ) = ai xi mod m
i =0
H = ! {ha } the universal set
a
a = <a0 a1 ar>, every ai is chosen randomly from the set {0, 1, , m-1}
Theorem
The set H is universal.
Proof
x y
r
ha (x ) = ai xi mod m
i =0
r
ha ( y ) = ai yi mod m
i =0
y = y0 y1... yr
x y x0 y 0
ha (x ) = ha ( y )
r
ai xi mod m = ai yi mod m
i =0
i =0
a x a y
i =0
i i
i =0
(mod m )
a0 (x0 y0 ) ai (xi yi )
i =1
[mod m]
As many collisions as the number of equations.
As many equations as different ai.
mr
H = m r +1
mr
1
=
r +1
m
m
H(n m) n, m should be comparable.
Choose at random any time when hashing should be done, apply h.
Application
Data bases
Hash table + B- tree
Not easy to estimate n
Not easy the choice of h.
ESTIMATION OF COMPLEXITY
Running time: the time required for a computer to execute a program.
Running time dependes on the following factors:
CPU: the higher the speed, the quicker the computer.
Memory: main and secondary memory available to execute the program.
Input data: size, type, operations.
Software: compiler, operating system, etc.
Algorithm (based on which the particular program is written).
Example: a1 + a2 + a3 + a4
S=0
S = S + a1
S=0
FOR i = 1 TO 4
S = S + a2
S = S + ai
S = S + a3
S = S + a4
The asymptotic notation is a way to express how bad (slow) or good (fast) an algorithm is.
Expression of complexity we get a measure
to express the 'character' or behaviour of an
algorthm
comparison of algorthms in term of complexity.
We can decide which algorithm is better or we should choose.
Note: the complexity will not guarantee that the algorithm will really be faster or that the computer
will always execute the program faster, because physical running time, as seen above, depends on
other factors, too.
Estimation of the complexity of an algorithm:
1. Assignment
Operations
1 umit
Ex.: S = 0
1 unit each
2. Consecutive statement
sum of each
3. Loop: time required to evaluate the body multiplied by the number of iterations
4. Nested loop: inside multiplied by the product of iterations
5. IF: test + max (THEN, ELSE)
This technique can overestimate (ex. 5), but it will never underestimate the complexity.
SORTING
To arrange given data according to given criteria.
Ex.:
1. ci, i = 1, , n
j, j = 1, , m (criteria)
arrange ci taking into account every j
2. List of names alphabetic order
3. Temperature values ascending or descending order
4. Sorting a1, a2,,an input data
criterion: ascending (descending)
BUBBLE SORTING
Main idea: find the smallest and put it on the top
find the second smallest and put it next to the top one
Ex. 2, 1, 7, 6
Fix the first number and compare it with all the other. If
wrong order swap and change.
1, 2, 7, 6
Repeat from the second position.
2, 7, 6
7, 6
1, 2, 6, 7
a1, a2, , ai, aj,an
FOR i = TO n 1
FOR j = i + 1 TO n
IF ai > aj THEN swap (ai, aj)
2 (n - 1)(n - 1) = O(n2)
Best case: the input is already in the right order (no swap) O(n2)
Worst case: all the numbers are in the wrong order O(n2)
Average case: the numbers are given at random (typical case)
Bubble sorting (comparison-based): O(n2)
Comparison-based sorting: (n log n)
O(n log n)
QUICK SORTING
The fastest known method.
Divide and conquer philosophy
A(p, q), x q
A(pr)
A(q + 1, r), y q
q: computed value
Ex. 9, 2, 11, 20, 7
Q: pivot = (9 + 7) / 2 = 8 not necessarily belongs to the sequence
7, 2
11, 20, 9
q=4
q = 10
2 7
9 11, 20
q = 15
2 7 9 11 20
QUICKSORT (A, p, r)
IF p < r THEN
q PARTITION (A, p, r)
QUICKSORT (A (p, q))
QUICKSORT (A (q + 1, r))
Initial call: QUICKSORT (A, 1, length(A))
PARTITION (A, p, r)
x A(p), i p 1, j r + 1
WHILE TRUE DO
REPEAT j j 1 UNTIL A(j) x
REPEAT i i + 1 UNTIL A(i) x
IF i < j THEN SWAP (A(i), A(j))
ELSE RETURN j
Recurrence: resolved by telescoping
n
T (n) = 2T + (n )
2
n
T (n) = 2T + n
|: n
2
n
T
T ( n)
2
= + 1
n
n
n
T
T ( n)
4
= + 1
n
n
+
2
4
T (2) T (1)
=
+1
2
1
T(n) = nc + nlogn
T(n) = O(nlogn) average (and best) case
Worst case: O(n2)
Worst case: one element / region
T(n) T( 1 )
=
+ log n
n
1
T (n) = T (n 1) + (n)
T (n 1) = T (n 2) + (n 1)
.
+
.
T (2) = T (1) + (2)
________________________
n
T (n) = T (1) + (k ) =
k =2
T (n) = (n ) = (k ) = ( k )
2
SEARCHING
There may be different situations where searching is performed
SEQUENTIAL SEARCH
Given a1, , ai, , an
(numbers / characters: objects to be searched)
Find: x
(nave or brute force search or straight search)
It is in fact one loop:
FOR i = 1 TO n
IF x = ai THEN STOP
O(n)
Note: this kind of search is very simple (primitive),
but what is if ai is a matrix?
records of a file?
The comparison is more complicated here. But in principle it is very simple.
RANDOM SEARCH
Introduce some sort of probability.
Given a1, , ai, , an to be searched.
Find: x
Coin: probability element
If head: search from 1 to n.
Flip a coin
If tail: search from n to 1.
Two sequential searches are combined, applied together.
Assume: x = ai (ith position)
The coin is fair: the head and the tail occur with equal probabilities.
If head: i comparisons to find x.
If tail: n i + 1 comparisons to find x.
i + (n i + 1) = (n + 1) / 2 better than n. (Average the two case.)
In average we need (n + 1) / 2 comparisons rather than n.
The order of the elements is irrelevant (no need for pre-sorting) in
Sequential search,
Randomized search.
BINARY SEARCH
It is very quick and used almost everywhere.
The elements to be searched are sorted.
Given a1 ai an
Find: x
Idea: guess the number I am thinking at
Ex. 2 3 7 | 8 9 | 10
Find: 9
Half the sequence.
Compare the last element with x.
a1, , an
x
low: leftmost element in the half = 1
high: rightmost element in the half = n
REPEAT
mid = (low + high) DIV 2
IF low > high THEN mid = 0
ELSE
IF a(mid) < x THEN low = mid + 1
ELSE high = mid 1
UNTIL x = a(mid)
n
20
n
21
.
.
.
O(log n) , very fast.
Note: n finite
If n infinite then binary search: div 2m
It may not happen in practice, just in theory.
n
2
log n
Note:
1. Will the search work when all elements are given at once (at the same time)?
2. Will the search work when all elements are given on line (one by one)?
1.
2.
Sequential search
YES
YES
Randomized search
YES
NO
Binary search
YES
NO
In every case we assume that we have just one processor to do the job.
PARALLEL BINARY SEARCH
It speeds up the binary search.
P processors with shared memory (PRAM parallel RAM).
CREW: Concurrent Read Exclusive Write
Given the elements a1, , ai, , an sorted.
Find: x
Divide the sequence into p + 1 parts:
a1, , ai1 | ai1+1, , ai2 | , aij | , an
x is compared with the boundary elements (or the leftmost or the rightmost)
in parallel:
processor j compares x with the jth boundary
processor j sets a variable cj:
0, if x > aij
1, otherwise
Thus: s: cs = 0 cs + 1 = 1 (to locate part)
Repeat recursively until x = aij
Complexity:
BS
n
20
PBS
n
( p + 1)0
n
( p + 1)2
.
.
.
n
=1
( p + 1)n
n
2
log n
O (logp + 1 n) better than O(log n), if p > 1
Note: PBS doesnt online
PBS works offline
STRING SEARCHING (straight search, nave, brute force)
Idea:
text (string of characters) of length n, i (index) pointer
find: pattern of length m, j
Comparisons from left to right
match: both i and j are incremented
mismatch: reset j to the beginning of the pattern
i set to the position corresponding to moving the pattern to the right one
position
Ex. 3 2 4 5 6
4 5
3
text: n = 5
pattern: m = 2
2 4 5 6
|
4 5
|
4 5
O(n m)
KNUTH
MORRIS
PRATT ALGORITHM
Given text: n, i
pattern: m, j
Both pointers are incremented by 1 as long as there is a match.
Mismatch at position k: j is reset and comparison is restarted.
I. 3 2 3 4 5
3 4
j=1
II. 3 2 3 4 5
3 4
IV.
i=1
i=2
j=2
III. 3 2 3 4 5
i=2
3 4
j=1
3 2 3 4 5
3 4
O(n + m)
i=3
j=1
BOYER
MOORE ALGORITHM
text: s1, m
pattern: s2, m
Idea: the pattern is searched from right to left, while it is moved from left to right an appropriate
number of positions.
I. 3 2 7 4 5 6
4 5 mismatch: no matching characters between pattern and text we move the
pattern m position to the right
II. 3 2 7 4 5 6
4 5 mismatch: align the four
III. 3 2 7 4 5 6
4 5 stop
O(n/m) really very fast.
Application: word processing (spelling checker)
SIGNATURE FILES
Given text
Every word is hashed (transformed) into a string of bits. (Ex. radix 128, soundex coding)
synonym: in hashing sense, not grammatically
hash table:
The locations contains the possible bit representation
sector (block)
Find word x:
x is transformed into a string of bits,
the hash table is presorted,
x is searched using binary search in the hash table or better hash function,
we get a block address on the disk, we find the word on that block somewhere,
we apply string searching within the block
maximumsized file
relatively constant
bibliographic DB searching
INVERTED FILE
Huge file (DB) such as in a library, WWW
Search engines work as this: locate a word (not when the Web page is returned)
Crawler (programme): scans the Web all the time. Builds (updates) an inverted file.
Inverted file:
w1
U RL 1
w2
U RL 2
wn
U RL n
wi: words on the Web
Records
URLs: containing that word
search (whether the word exists in
the inverted file): Btree (to
implement the inverted file)
sorting (the inverted file is updated every time
a new word appears)
When we enter a word to search:
The search engine locates the word by searching the inverted file string searching.
If the word is found in the inverted file then depending on the retrieval techniques the
document / URL will be presented or not (this depends not solely on whether the word is present
or not).