51 Stringsorts
51 Stringsorts
STRINGS
Algorithms
F O U R T H E D I T I O N
R O B E R T S E D G E W I C K K E V I N W A Y N E
! ! ! ! !
Copyright 20022012
String processing String. Sequence of characters. Important fundamental abstraction. Information processing. Genomic sequences. Communication systems (e.g., email). Programming systems (e.g., Java programs).
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. M. V. Olson
The char data type C char data type. Typically an 8-bit integer. Supports 7-bit ASCII. Need more bits to represent certain characters.
6.5
Data Compression
667
Supports 21-bit Unicode 3.0 (awkwardly). data working with compression requires us to reorient our thinking about
ng. When you HexDump a bit0 1 2 3 4 5 6 7 8 9 A B C D ntains ASCII-encoded charac- 0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR t right is useful for reference. 1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS hex number, use the rst hex 2 SP ! # $ % & ( ) * + , ndex and the second hex digit 3 0 1 2 3 4 5 6 7 8 9 : ; < = ference to nd the character 4 @ A B C D E F G H I J K L M . For example, 31 encodes the 5 P Q R S T U V W X Y Z [ \ ] odes the letter J, and so forth. 6 ` a b c d e f g h i j k l m r 7-bit ASCII, so the rst hex 7 p q r s t u v w x y z { | } or less. Hex numbers starting Hexadecimal to ASCII conversion table and the numbers 20 and 7F) non-printing control charache control characters are left over from the days when physical devices s were controlled by ASCII input; the table highlights a few that you Java char data type. A 16-bit unsigned integer. umps. For example SP is the space character, NUL is the null character, LF d CR is carriage-return. Supports original 16-bit Unicode.
E F
SO SI RS US
. / > ? N O ^ _ n o ~
DEL
and standard output to include binary encoding of data. BinaryStdIn dOut provide the methods that we need. They provide a way for you to
The String data type String data type. Sequence of characters (immutable). Indexing. Get the ith character. Substring extraction. Get a contiguous sequence of characters from a string. String concatenation. Append one character to end of another string.
s.length()
0
s
1 T
2 T
3 A
4 C
5 K
6 A
7 T
8 D
9 10 11 12 A W N
val[]
x
0
x
1
s
2
t
3
r
4
i
5
n
6
g
7
x
8
}
offset
private String(int offset, int length, char[] val) { this.offset = offset; this.length = length; this.val = val; copy of reference to } original char array public String substring(int from, int to) { return new String(offset + from, to - from, val);
The String data type: performance String data type. Sequence of characters (immutable). Underlying implementation. Immutable char[] array, offset, and length.
The StringBuilder data type StringBuilder data type. Sequence of characters (mutable). Underlying implementation. Resizing char[] array and length.
Remark. StringBuffer data type is similar, but thread safe (and slower).
A.
public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }
quadratic time
B.
public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }
linear time
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
10 11 12 13 14
su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c
c a a g t t t a c a a g c
a a g t t t a c a a g c
a g t t t a c a a g c
g t t t a c a a g c
t t t a c a a g c
t t a c a a g c
t a c a a g c
a c a a g c
c a a g c
a a g c a g c g c c
A.
public static String[] suffixes(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); return suffixes; }
B.
public static String[] suffixes(String s) { int N = s.length(); StringBuilder sb = new StringBuilder(s); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = sb.substring(i, N); return suffixes; }
10
Longest common prefix Q. How long to compute length of longest common prefix?
p
0
r
1
e
2
f
3
e
4
t
5
c
6
h
7
public static int lcp(String s, String t) { int N = Math.min(s.length(), t.length()); for (int i = 0; i < N; i++) if (s.charAt(i) != t.charAt(i)) return i; return N; }
Running time. Proportional to length D of longest common prefix. Remark. Also can compute compareTo() in sublinear time.
11
name
R()
lgR()
characters
BINARY OCTAL DECIMAL HEXADECIMAL DNA LOWERCASE UPPERCASE PROTEIN BASE64 ASCII EXTENDED_ASCII UNICODE16
1 3 4 4 2 5 5 5 6 7 8 16
Standard alphabets
holds the frequencies in Count is an example of a character-indexed array. With a Java String, we have to use an array of size 256; with Alphabet, we just need an array with one entry for each alphabet character. This savings might seem modest, but, as you will
12
Algorithms
F O U R T H E D I T I O N
R O B E R T S E D G E W I C K K E V I N W A Y N E
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
Copyright 20022012
Review: summary of the performance of sorting algorithms Frequency of operations = key compares.
algorithm
guarantee
random
extra space
stable?
insertion sort
N2 / 2
N2 / 4
yes
mergesort
N lg N
N lg N
yes
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
heapsort
2 N lg N
2 N lg N
no
Lower bound. ~ N lg N compares are required by any compare-based algorithm. Q. Can we do better (despite the lower bound)? A. Yes, if we don't depend on key compares.
14
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
15
Key-indexed counting: assumptions about keys Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm. [stay tuned]
name section Anderson 2 Brown 3 Davis 3 Garcia 4 Harris 1 Jackson 3 Johnson 4 Jones 3 Martin 1 Martinez 2 Miller 2 Moore 1 Robinson 2 Smith 4 Taylor 3 Thomas 4 Thompson 4 White 2 Williams 3 Wilson 4 keys are small integers
Typical candidate for key-indexed counting
16
input
Remark. Keys may have associated data ! can't just count up number of keys of each value.
(by section) Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson
sorted result
1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i] offset by 1 [stay tuned]
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
a b c d e f -
0 2 3 1 2 1 3
17
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++;
compute cumulates
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
a b c d e f -
0 2 5 6 8 9 12
for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i] i aux[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r];
move items
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
r count[r]
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
a b c d e f -
2 5 6 8 9 12 12
for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i];
19
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
i a[i] i aux[i]
int N = a.length; int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i]+1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i]]++] = a[i];
copy back
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
r count[r]
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
a b c d e f -
2 5 6 8 9 12 12
20
Key-indexed counting: analysis Proposition. Key-indexed counting uses ~ 11 N + 4 R array accesses to sort N items whose keys are integers between 0 and R - 1.
for (int i = 0; i < N; i++) aux[count[a[i].key(d)]++] = a[i];
Anderson Brown Davis Garcia Harris Jackson Johnson Jones Martin Martinez Miller Moore Robinson Smith Taylor Thomas Thompson White Williams Wilson
2 3 3 4 1 3 4 3 1 2 2 1 2 4 3 4 4 2 3 4
Harris Martin Moore Anderson Martinez Miller Robinson White Brown Davis Jackson Jones Taylor Williams Garcia Johnson Smith Thomas Thompson Wilson
1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4
aux[0] aux[1] aux[2] aux[3] aux[4] aux[5] aux[6] aux[7] aux[8] aux[9] aux[10] aux[11] aux[12] aux[13] aux[14] aux[15] aux[16] aux[17] aux[18] aux[19]
21
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
22
Least-significant-digit-first string sort LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexed counting).
sort key sort key sort key
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
a d a a e a a e e e b c
b d b d e d d e d d b e
0 1 2 3 4 5 6 7 8 9 10 11
d c e a f b d f b f b a
a a b d a a a e e e e c
b b b d d d d d d e e e
0 1 2 3 4 5 6 7 8 9 10 11
d c f b d e a a f b f b
a a a a a b c d e e e e
b b d d d b e d d d e e
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
c d a e e a a a b a e e
e d d d e b b d b d d e
LSD string sort: correctness proof Proposition. LSD sorts fixed-length strings in ascending order. Pf. [by induction]
0 d c f b d e a a f b f b a a a a a b c d e e e e b b d d d b e d d d e e sort key
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
c d a e e a a a b a e e
e d d d e b b d b d d e
1 2 3 4 5 6 7 8 9 10 11
last i characters. If two strings differ on sort key, key-indexed sort puts them in proper relative order. If two strings agree on sort key, stability keeps them in proper relative order.
public class LSD { public static void sort(String[] a, int W) { int R = 256; int N = a.length; String[] aux = new String[N]; for (int d = W-1; d >= 0; d--) { int[] count = new int[R+1]; for (int i = 0; i < N; i++) count[a[i].charAt(d) + 1]++; for (int r = 0; r < R; r++) count[r+1] += count[r]; for (int i = 0; i < N; i++) aux[count[a[i].charAt(d)]++] = a[i]; for (int i = 0; i < N; i++) a[i] = aux[i]; } } }
key-indexed counting
25
algorithm
guarantee
random
extra space
stable?
insertion sort
N2 / 2
N2 / 4
yes
mergesort
N lg N
N lg N
yes
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
heapsort
2 N lg N
2 N lg N
no
LSD
2WN
2WN
N+R
yes
String sorting challenge 1 Problem. Sort a huge commercial database on a fixed-length key. Ex. Account number, date, Social Security number, ... Which sorting method to use? Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort.
B14-99-8765 756-12-AD46 CX6-92-0112 332-WX-9877 375-99-QWAX CV2-59-0221 387-SS-0321 KJ-00-12388 715-YT-013C MJ0-PP-983F 908-KK-33TY BBN-63-23RE 48G-BM-912D 982-ER-9P1B
27
String sorting challenge 2a Problem. Sort one million 32-bit integers. Ex. Google (or presidential) interview. Which sorting method to use? Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort.
String sorting challenge 2b Problem. Sort huge array of random 128-bit numbers. Ex. Supercomputer sort, internet router.
01110110111011011101...1011101
Which sorting method to use? Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort.
29
String sorting challenge 2b Problem. Sort huge array of random 128-bit numbers. Ex. Supercomputer sort, internet router.
01110110111011011101...1011101
Which sorting method to use? Insertion sort. Mergesort. Quicksort. Heapsort. LSD string sort.
Divide each word into eight 16-bit chars 216 = 65,536 counters. Sort in 8 passes.
30
String sorting challenge 2b Problem. Sort huge array of random 128-bit numbers. Ex. Supercomputer sort, internet router. Which sorting method to use? Mergesort. Quicksort. Heapsort. LSD string sort.
! Insertion sort.
Divide each word into eight 16-bit chars 216 = 65,536 counters LSD sort on leading 32 bits in 2 passes Finish with insertion sort Examines only ~25% of the data
31
How to take a census in 1900s? 1880 Census. Took 1,500 people 7 years to manually process data. Herman Hollerith. Developed counting and sorting machine to automate. Use punch cards to record data (e.g., gender, age). Machine sorts one column at a time (into one of 12 bins). Typical question: how many women of age 20 to 30?
How to get rich sorting in 1900s? Punch cards. [1900s to 1950s] Also useful for accounting, inventory, and business processes. Primary medium for data entry, storage, and processing.
Hollerith's company later merged with 3 others to form Computing Tabulating Recording Corporation (CTRC); the company was renamed in 1924.
card punch
punched cards
card reader
mainframe
line printer
To sort a card deck - start on right column - put cards into hopper - machine distributes into bins - pick up cards (stable) - move left one column - continue until sorted card sorter
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
35
Most-significant-digit-first string sort MSD string (radix) sort. Partition array into R pieces according to first character (use key-indexed counting). Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).
0 a a d c d e
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
a d a a e a a e e e b c
b d b d e d d e d d b e
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
d c a e e a a a b a e e
d e d e d b b d b d e d
count[] a b c d e f 0 2 5 6 8 9 12
2 3 4
b b b
a e e
d e d
6 7 8 9 10
d d e f f f
a a b a e e
b d b d e d
36
sort key
11
she sells seashells by the sea shore the shells she sells are surely seashells
are by lo she sells seashells sea shore shells she sells surely seashells the hi the
are by sells seashells sea sells seashells she shore shells she surely the the
are by seashells sea seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by sea seashells seashells sells sells she shore shells she surely the the
are by seas seashells seashells sells sells she shells shore she surely the the
are by sea seashells seashells sells sells she shells shore she surely the the
need to examine every character in equal keys are are are are by by by by sea sea sea sea seashells seashells seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells sells sells she she she she shells shells shells shells she she she she shore shore shore shore surely surely surely surely the the the the the the the the
end-of-string goes before any char value are are are by by by sea sea sea seashells seashells seashells seashells seashells seashells sells sells sells sells sells sells she she she she she she shells shells shells shore shore shore surely surely surely the the the the the the
output
are by sea seashells seashells sells sells she she shells shore surely the the
37
Trace of recursive calls for MSD string sort (no cutoff for small subarrays, subarrays of size 0 and 1 omitted)
Variable-length strings Treat strings as if they had an extra char at end (smaller than any char).
why smaller? 0 1 2 3 4 5 6 7 s s s s s s s s e e e h h h h u a a l e e e o r -1 s l -1 -1 l r e l e l s -1 y -1 -1 she before shells h s e -1 l l s -1
private static int charAt(String s, int d) { if (d < s.length()) return s.charAt(d); else return -1; }
public static void sort(String[] a) { aux = new String[a.length]; sort(a, aux, 0, a.length, 0); }
private static void sort(String[] a, String[] aux, int lo, int hi, int d) { if (hi <= lo) return; int[] count = new int[R+2]; key-indexed counting for (int i = lo; i <= hi; i++) count[charAt(a[i], d) + 2]++; for (int r = 0; r < R+1; r++) count[r+1] += count[r]; for (int i = lo; i <= hi; i++) aux[count[charAt(a[i], d) + 1]++] = a[i]; for (int i = lo; i <= hi; i++) a[i] = aux[i - lo];
sort R subarrays recursively for (int r = 0; r < R; r++) sort(a, aux, lo + count[r], lo + count[r+1] - 1, d+1);
}
39
MSD string sort: potential for disastrous performance Observation 1. Much too slow for small subarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N = 2. Unicode (65,536 counts): 32,000x slower for N = 2.
count[]
a[]
0 1 b a 0 1
aux[]
a b
40
Cutoff to insertion sort Solution. Cutoff to insertion sort for small subarrays. Insertion sort, but start at dth character. Implement less() so that it compares starting at dth character.
public static void sort(String[] a, int lo, int hi, int d) { for (int i = lo; i <= hi; i++) for (int j = i; j > lo && less(a[j], a[j-1], d); j--) exch(a, j, j-1); } private static boolean less(String v, String w, int d) { return v.substring(d).compareTo(w.substring(d)) < 0;
in Java, forming and comparing substrings is faster than directly comparing chars with charAt()
41
MSD string sort: performance Number of characters examined. MSD examines just enough characters to sort the keys. Number of characters examined depends on keys. Can be sublinear in input size!
Random (sublinear)
1EIO402 1HYL490 1ROZ572 2HXE734 2IYE230 2XOR846 3CDB573 3CVP720 3IGJ319 3KNA382 3TAV879 4CQP781 4QGI284 4YHV229
are by sea seashells seashells sells sells she she shells shore surely the the
1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377 1DNB377
algorithm
guarantee
random
extra space
stable?
insertion sort
N2 / 2
N2 / 4
yes
mergesort
N lg N
N lg N
yes
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
heapsort
2 N lg N
2 N lg N
no
LSD
2NW
2NW
N+R
yes
MSD
2NW
N log R N
N+DR
yes
MSD string sort vs. quicksort for strings Disadvantages of MSD string sort. Accesses memory "randomly" (cache inefficient). Inner loop has a lot of instructions. Extra space for count[]. Extra space for aux[].
Disadvantage of quicksort. Linearithmic number of string compares (not linear). Has to rescan many characters in keys with long prefix matches.
44
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
45
3-way string quicksort (Bentley and Sedgewick, 1997) Overview. Do 3-way partitioning on the dth character. Less overhead than R-way partitioning in MSD string sort. Does not re-examine characters equal to the partitioning char (but does re-examine characters not equal to the partitioning char).
partitioning item use first character to partition into "less", "equal", and "greater" subarrays
she sells seashells by the sea shore the shells she sells are surely seashells
by are seashells she seashells sea shore surely shells she sells sells the the
46
she sells seashells by the sea shore the shells she sells are surely seashells
by are seashells she seashells sea shore surely shells she sells sells the the
are by seashells she seashells sea shore surely shells she sells sells the the
are by seashells sells seashells sea sells shells she surely shore she the the
are by seashells sea seashells sells sells shells she surely shore she the the
Trace of rst few recursive calls for 3-way string quicksort (subarrays of size 1 not shown)
47
private static void sort(String[] a) { sort(a, 0, a.length - 1, 0); } private static void sort(String[] a, int lo, int hi, int d) { if (hi <= lo) return; 3-way partitioning (using dth character) int lt = lo, gt = hi; int v = charAt(a[lo], d); int i = lo + 1; while (i <= gt) to handle variable-length strings { int t = charAt(a[i], d); if (t < v) exch(a, lt++, i++); else if (t > v) exch(a, i, gt--); else i++; } sort(a, lo, lt-1, d); if (v >= 0) sort(a, lt, gt, d+1); sort(a, gt+1, hi, d); }
48
3-way string quicksort vs. standard quicksort Standard quicksort. Uses ~ 2 N ln N string compares on average. Costly for keys with long common prefixes (and this is a common case!)
3-way string (radix) quicksort. Uses ~ 2 N ln N character compares on average for random strings. Avoids re-comparing long common prefixes.
Abstract
We present theoretical algorithms for sorting and searching multikey data, and derive from them practical C implementations for applications in which keys are character strings. The sorting algorithm blends Quicksort and radix sort; it is competitive with the best known C sort codes. The searching algorithm blends tries and binary search trees; it is faster than hashing and other commonly used search methods. The basic ideas behind the algorithms date back at least to the 1960s but their practical utility has been overlooked. We also present extensions to more complex string problems, such as partial-match searching.
that is competitive with the most efficient string sorting programs known. The second program is a symbol table implementation that is faster than hashing, which is commonly regarded as the fastest symbol table implementation. The symbol table implementation is much more space-efficient than multiway trees, and supports more advanced searches. In many application programs, sorts use a Quicksort implementation based on an abstract compare operation, and searches use hashing or binary search trees. These do not take advantage of the properties of string keys, which are widely used in practice. Our algorithms provide a natural and elegant way to adapt classical algorithms to this important class of applications. Section 6 turns to more difficult string-searching problems. Partial-match queries allow dont care characters
49
1. Introduction
Section 2 briefly reviews Hoares [9] Quicksort and
3-way string quicksort vs. MSD string sort MSD string sort. Is cache-inefficient. Too much memory storing count[]. Too much overhead reinitializing count[] and aux[].
Bottom line. 3-way string quicksort is the method of choice for sorting strings.
50
insertion sort
N2 / 2
N2 / 4
yes
mergesort
N lg N
N lg N
yes
quicksort
1.39 N lg N
1.39 N lg N
c lg N
no
heapsort
2 N lg N
2 N lg N
no
LSD
2NW
2NW
N+R
yes
MSD
2NW
N log R N
N+DR
yes
1.39 W N lg N
1.39 N lg N
log N + W
no
! ! ! ! !
key-indexed counting LSD radix sort MSD radix sort 3-way radix quicksort suffix arrays
52
Longest repeated substring Given a string of N characters, find the longest repeated substring.
a g c c a a t c a g
a g c t c g a a a a
c a t g c a g c c c
a g a t g t a t a a
a a a c g a t c c g
g g t g a g c t a a
t t c t a a t c c a
t t c c g t a a t a
t a t g g a g c a a
a t t t c g c a c a
c a g c c a t c t a
a c t a g c a t a a
a t g t g c g c c a
g g t a a c c a g c
c g g t c c t a a t
a t t a a t a g c c
t c a t a a g a a t
g g c c g g c g g a
a t a g g a t t a t
t c c a c t c t c a
g a a g g a a a g t
c a c a g c t t a c
t a a t g a c a c t
g c c c g c g c c a
t c t a g a a t a t
a t a t g t t g a a
c g c c t a a g c a
t a t g a c c t c a
a a a a t a a c a a
53
http://www.bewitched.com
54
Longest repeated substring Given a string of N characters, find the longest repeated substring.
Brute-force algorithm. Try all indices i and j for start of possible match. Compute longest common prefix (LCP) for each pair.
55
a
0
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
10 11 12 13 14
form su"xes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c
c a a g t t t a c a a g c
a a g t t t a c a a g c
a g t t t a c a a g c
g t t t a c a a g c
t t t a c a a g c
t t a c a a g c
t a c a a g c
a c a a g c
c a a g c
a a g c a g c g c c
0 11 3 9 1 12 4 14 10 2 13 5 8 7 6
a a a a a a a c c c g g t t t
a a a c c g g a a c t a t t
c g g a a c t
a c t a a
a g t t t a c a a g c t t a c a a g c g c g t t t a c a a g c
t t a c a a g c
a g c a g t t t a c a a g c t c a t t a c a a a a c c g a a a a g c c g c a g c
a
0
a
1
c
2
a
3
a
4
g
5
t
6
t
7
t
8
a
9
c
56
10 11 12 13 14
public String lrs(String s) { int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); Arrays.sort(suffixes); String lrs = ""; for (int i = 0; i < N-1; i++) { int len = lcp(suffixes[i], suffixes[i+1]); if (len > lrs.length()) lrs = suffixes[i].substring(0, len); } return lrs; }
create suffixes (linear time and space) sort suffixes find LCP between adjacent suffixes in sorted order
% java LRS < mobydick.txt ,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th
57
characters 2,162 18,369 191,945 1.2 million 7.1 million 10 million 20 million
brute 0.6 sec 37 sec 1.2 hours 43 hours 2 months 4 months forever
suffix sort 0.14 sec 0.25 sec 1.0 sec 7.6 sec 61 sec 84 sec ???
estimated
58
Suffix sorting: worst-case input Bad input: longest repeated substring very long. Ex: same letter repeated N times. Ex: two copies of the same Java codebase.
form su"xes
0 1 2 3 4 5 6 7 8 9
sorted su"xes
a a a a a a a a a a
a a a a a a a a a
a a a a a a a a
a a a a a a a
a a a a a a
a a a a a
a a a a a a a a a a
9 8 7 6 5 4 3 2 1 0
a a a a a a a a a a
a a a a a a a a a
a a a a a a a a
a a a a a a a
a a a a a a
a a a a a
a a a a a a a a a a
Running time. Quadratic (or worse) in the length of the longest match.
59
! !
Q. What is worst-case running time of best algorithm for problem? Quadratic. Linearithmic. Linear. Nobody knows.
Manber's algorithm suffix trees (see COS 423)
60
Suffix sorting in linearithmic time Manber's MSD algorithm overview. Phase 0: sort on first character using key-indexed counting sort. Phase i: given array of suffixes sorted on first 2i-1 characters, create array of suffixes sorted on first 2i characters.
Worst-case running time. N lg N. Finishes after lg N phases. Can perform a phase in linear time. (!) [ahead]
61
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 1 16 3 4 5 6 15 14 13 12 10 0 9 11 7 2 8
0 a a a a a a a a a a a b b b b b c
b 0 a a a b a a a a b a a a c a b
a a a a b c b a b a a a a a 0 a a b c 0 a a a a b b a b a a a b c b 0 a a a a a a a a b b c b a c b a b b a b a a b a a b a a a a a a a a a a a a a a 0 a a 0 a 0 0
0 a a a a a b a a
0 a a a a a b a
a a a 0 a c a
0 b c b a b a a a a a 0 a 0 a a a 0 b a b a a a a a 0 a a 0
sorted
62
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 12 3 4 5 13 15 14 6 1 10 0 9 11 2 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
a a a b a 0 a c a a b b a a b a
a a b c a 0 b a a a a a a a b
a b c b 0
0 c b a b a a a a a 0 b a b a a a a a 0 a b a a a a a 0
a a a a a a a b a
b a a a a a b a a
a b a a a 0 c a a
a c 0 b a
a a a 0 b a b a a a a a 0 c b a b a a a a a 0 0
b a b a a a a a 0 a a a 0 a a 0
sorted
63
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 3 12 13 4 5 1 10 6 2 11 0 9 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
b a 0 c b a a a a a a a b a
c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0
sorted
64
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 13 12 3 4 5 10 1 6 11 2 9 0 7 8
0 a a a a a a a a a a a b b b b b c
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
0 a b c b a a a a a a a b a
0 c b a a a b a b a a a a
b a b a b a 0 c a a a a
a b a 0 c a b a b a a
b a a a a a 0 a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a 0 c a a b a a a a a 0 a 0 b a b a a a a a 0 a 0 0
65
inverse
0
b a b a a a a b c b a b a a a a a 0
a b a a a a b c b a b a a a a a 0
b a a a a b c b a b a a a a a 0
a a a a b c b a b a a a a a 0
a a a b c b a b a a a a a 0
a a b c b a b a a a a a 0
a b c b a b a a a a a 0
b c b a b a a a a a 0
c b a b a a a a a 0
b a b a a a a a 0
a b a a a a a 0
b a a a a a 0
a a a a a 0
a a a a 0
a a a 0 a a 0 a 0 0
17 16 15 14 3 12 13 4 5 1 10 6
0 + 4 = 4
2 11
9 + 4 = 13
0 9 7 8
0 a a a a a a a a a a a b b b b b c
14 9 12 4 7 8 11 16 17 15 10 13 5 6 3 2 1 0
0 a a a a a a a b b b a a a a c b
0 a a a a a b a a c a a b b b a
0 a a a b c a a b a a a a a b
b a 0 c b a a a a a a a b a
c b a b a a a a a 0 0 b a a a b b a a a a a a b b a a c 0 a a a a b a c 0 a b b a a a a a a a a 0 a a a a 0 b a b a a a a a 0 a a a 0 a b a a a a a 0 a 0 c b a b a a a a a 0 0 a a 0 a 0
4 5 6 7 8 9 10 11 12 13 14 15 16 17
suffixes4[13] # suffixes4[4]
so suffixes8[9] # suffixes8[0]
Keyword-in-context search Given a text of N characters, preprocess it to enable fast substring search (find all occurrences of query string context).
characters of % java more KWIC tale.txt tale.txt 15 surrounding context it was the best of times search itst o was giless the worst to search of times for contraband it was her unavailing the age search of wisdom for your fathe it and le was gone the age in search of foolishness of her husband itprovinces t was the epoch in search of belief of impoverishe it dispersing was the epoch in search of incredulity of other carri itthat n was bed the and season search of light the straw hold it was the season of darkness it was thing better the spring of hope itis t was a far the far winter better of despair thing that i do than ... some sense of better things else forgotte was capable of better things mr carton ent
Preprocess: suffix sort the text. Query: binary search for query; scan until mismatch.
!
632698 713727 660598 67610 4430 42705 499797 182045 143399 411801 158410 691536 536569 484763
s e a l e d _ m y _ l e t t e r _ a n d _ s e a m s t r e s s _ i s _ l i f t e d _ s e a m s t r e s s _ o f _ t w e n t y _ s e a m s t r e s s _ w h o _ w a s _ w i s e a r c h _ f o r _ c o n t r a b a n d s e a r c h _ f o r _ y o u r _ f a t h e s e a r c h _ o f _ h e r _ h u s b a n d s e a r c h _ o f _ i m p o v e r i s h e s e a r c h _ o f _ o t h e r _ c a r r i s e a r c h _ t h e _ s t r a w _ h o l d s e a r e d _ m a r k i n g _ a b o u t _ s e a s _ a n d _ m a d a m e _ d e f a r s e a s e _ a _ t e r r i b l e _ p a s s s e a s e _ t h a t _ h a d _ b r o u g h !
68
String sorting summary We can develop linear-time sorts. Key compares not necessary for string keys. Use characters as index in an array.
We can develop sublinear-time sorts. Should measure amount of data in keys, not number of keys. Not all of the data has to be examined.
3-way string quicksort is asymptotically optimal. 1.39 N lg N chars for random data.
Long strings are rarely random in practice. Goal is often to learn the structure! May need specialized algorithms.
69