Soft Eng Interview Prep
Soft Eng Interview Prep
of Contents
Introduction 1.1
Complexity 1.2
Data Structures 1.3
Data Structures Examples 1.4
Algorithms 1.5
Algorithms Examples 1.6
Bit Operators 1.7
Numbers 1.8
Operating Systems 1.9
System Architecture 1.10
System Architecture Examples 1.11
Networking 1.12
Strings 1.13
Java 1.14
Java Examples 1.15
OOP 1.16
P,NP 1.17
1
Introduction
Start from SUMMARY (or see below). It's also available in GitBook format for easier reading
and navigation.
Topics
Complexity
Data Structures
Data Structures Examples
Algorithms
Algorithms Examples
Bit Operators
Numbers
Operating Systems
System Architecture
System Architecture Examples
Networking
Strings
Java
Java Examples
OOP
P,NP
2
Complexity
Complexity
Big Oh – measuring run time by counting the number of steps an algorithm takes
(translating to actual time is trivial).
f(n) = O(g(n)) means c*g(n) is an upper bound on f(n) .
f(n) = Θ(g(n)) means c1*g(n) is an upper bound on f(n) and c2*g(n) is a lower
bound on f(n) .
With Big Oh we discard multiplication of constants: n^2 and 1000*n^2 are treated
identically.
Growth rates: 1 < log(n) < n < n*log(n) < n^2 < n^3 < 2^n < n!
Constant functions
Logarithmic: binary search, arises in any process where things are repeatedly
halved or doubled
Linear: traversing every item in an array
Super-linear: quicksort, mergesort
Quadratic: going thru all pairs of elements, insertion sort, selection sort
Cubic: enumerating all triples of items
Exponential: enumerating all subsets of n items
Factorial: generating all permutations or orderings of n items
O(f(n)) + O(g(n)) => O(max(f(n), g(n))) => n^3 + n^2 + n + 1 = O(n^3)
Geometric progression:
3
Complexity
a(n) = a(1)*q^(n-1)
S(n) = a(1)*(q^n-1)/(q - 1)
If it converges:
S(inf) = a(1)/(1 - q)
Combinatorics
All pairs: O(n^2)
All triples: O(n^3)
Number of all permutations: n!
n over k : number of combinations for choosing k items from a set of n
(n over k) = n!/(k!*(n-k)!)
Handy Formulas
1 + 2 + ... + n = n * (n + 1) / 2
4
Data Structures
Data Structures
Contiguously-allocated structures are composed of single slabs of memory, and include
arrays, matrices, heaps, and hash tables.
Linked data structures are composed of distinct chunks of memory bound together by
pointers, and include lists, trees, and graph adjacency lists.
Arrays
Built from fixed-size records.
Constant access time given the index.
Space efficient – arrays consist only of the data, so no space is wasted with links or
formatting info.
Easy to iterate over quickly, because of memory locality.
Cannot adjust their size in the middle of a program's execution.
Dynamic arrays double in size whenever insert index is out of bound (Java's ArrayList
is dynamic, while int[] isn't).
Java array max length is Integer.MAX_VALUE = 2^31-1 (but could actually be a bit
shorter because of reserved memory).
Linked Lists
Singly-linked – each elements links to the next one.
Doubly-linked – each elements links to the next and previous elements (Java's
LinkedList is doubly-linked).
5
Data Structures
jobs).
LIFO usually happens in recursive algorithms.
Queues – FIFO (enqueue, dequeue)
Average "waiting time" for jobs is identical for FIFO and LIFO. Maximum time varies
(FIFO minimizes max waiting time).
Harder to implement, appropriate when order is important.
Used for searches in graphs.
Stacks and Queues can be effectively implemented by dynamic arrays or linked lists. If
upper bound of size is known, static arrays can also be used.
Dictionary Structures
Enable access to data items by content (key).
Operations: search, insert, delete, max, min, predecessor/successor (next or previous
element in sorted order).
Implementations include: hash tables, (binary) search trees.
In general, a tree's height can go from log(n) (best) to n (worst), this depends on the
order of creation (insertion).
For random insertion order, on average, the tree will be h = log(n) .
There are balanced binary search trees, which guarantee at each insertion/deletion that
the tree is balanced, thus guaranteeing dictionary performance (insert, delete, query) of
O(log(n)) . Such implementations: Red-Black tree, Splay tree (see below).
Tree Rotation is an operation on a binary tree that changes the structure without
interfering with the order of the elements.
6
Data Structures
A tree rotation moves one node up in the tree and one node down. It is used to change
the shape of the tree, and in particular to decrease its height by moving smaller
subtrees down and larger subtrees up, resulting in improved performance of many tree
operations.
Tree Traversal
DFS (depth first)
Pre-order: root, left, right
In-order: left, root, right
Post-order: left, right, root
BFS (breadth first)
Implemented using a queue, visit all nodes at current level, advance to next level
(doesn't use recursion).
Balanced BST
A self-balancing binary search tree is any node-based binary search tree that
automatically keeps its height small in the face of arbitrary item insertions and deletions.
Red-Black Tree
7
Data Structures
are black, every simple path from a given node to any of its descendant leaves
contains the same number of black nodes.
From this we get => the path from the root to the furthest leaf is no more than twice
as long as the path from the root to the nearest leaf. The result is that the tree is
roughly height-balanced.
Insertion/deletion may violate the properties of a red–black tree. Restoring the red-
black properties requires a small number ( O(log(n)) or amortized O(1) ) of color
changes. Although insert and delete operations are complicated, their times remain
O(log(n)) .
Splay Tree
A self-adjusting binary search tree with the additional property that recently
accessed elements are quick to access again (they will move nearer to the root).
Performs basic operations such as insertion, look-up and removal in O(log(n))
amortized time.
For many sequences of non-random operations, splay trees perform better than
other search trees, even when the specific pattern of the sequence is unknown.
All normal operations on a binary search tree are combined with one basic
operation, called splaying.
Splaying the tree for a certain element rearranges the tree so that the element is
placed at the root of the tree.
One way to do this is to first perform a standard binary tree search for the element
in question, and then use tree rotations in a specific fashion to bring the element to
the top.
Frequently accessed nodes will move nearer to the root where they can be
accessed more quickly, an advantage for nearly all practical applications and is
particularly useful for implementing caches and garbage collection algorithms.
Simpler to implement then red-black or AVL trees.
The most significant disadvantage of splay trees is that the height of a splay tree
can be linear (in worst case), on average log(n) .
By performing a splay operation on the node of interest after every access, the
recently accessed nodes are kept near the root and the tree remains roughly
balanced, so that we achieve the desired amortized time bounds.
The splay operation consists of rotating the tree around the accessed node and its
parent, and around the parent and the grandparent (depends on the specific
structure).
AVL Tree
The heights of the two child subtrees of any node differ by at most one; if at any
time they differ by more than one, re-balancing is done to restore this property.
Lookup, insertion, and deletion all take O(log(n)) time in both the average and
8
Data Structures
worst cases.
Insertions and deletions may require the tree to be rebalanced by one or more tree
rotations.
AVL trees are more rigidly balanced than red-black trees, leading to slower
insertion and removal but faster retrieval.
Hash Tables
An implementation of a dictionary.
A hash function is a mathematical function that maps keys to integers.
The value of the hash function is used as an index into an array, and the item is stored
at that position in the array.
The hash function H maps each string (key) to a unique (but large) integer by treating
the characters of the string as “digits” in a base-α number system (where α = number
of available characters, 26 for English).
Java's String.hashCode() function is computed as: s[0]*31^(n-1) + s[1]*31^(n-2) +
... + s[n-1] .
This produces a very large number, which is reduced by taking the remainder of H(S)
mod m (where m is the size of the hash table and H is the hashing function).
Two distinct keys will occasionally hash to the same value, causing a collision.
One solution to collision is chaining – representing the hash table as a array of m
linked lists, where each list's items are hashed to the same value.
A second solution to collision is open addressing – inserting an item in its place, or if it's
taken, in the next available place. Upon query, after finding the hash, you check if the
key is identical. If not, you move to the next cell in the array to check there (the array is
initiated with all null values, and moving on to the next cell is done until a null cell is
encountered). Upon deletion of an item, all items must be reinserted to the array.
A third solution is double hashing – repeatedly hashing on the first hash until the item is
found, an empty cell is found, or the entire table has been scanned.
All three solutions are O(m) for initializing an m -element hash table to set null
elements prior to first insertion.
Traversing all elements in the table is O(n + m) for chaining and O(m) for open
addressing ( n = inserted keys, m = max hash table size).
Pragmatically, a hash table is often the best data structure to maintain a dictionary.
In Java the size of a HashMap is always a power of 2, for the bit-mask (modulo
equivalent) to be effective.
In Java 8 the HashMap can store bins as trees in addition to linked lists (more than 8
objects get converted to a red-black tree).
9
Data Structures
String Hashing
Rabin-Karp algorithms – a way to find a substr in a string. Compute the hash of the
pattern p, and of every substring the length of p in the original string, then compare the
hashes. This usually takes O(n + m) , unless we have a hash collision in which case we
continue to check all collisions by the data itself.
Hashing is a fundamental idea in randomized algorithms, yielding linear expected-time
algorithms for problems otherwise Θ(n*log(n)) , or Θ(n^2) in the worst case.
Priority Queues
The heap is a priority queue.
Provide more flexibility than simple sorting, because they allow new elements to enter a
system at arbitrary intervals.
It is much more cost-effective to insert a new job into a priority queue than to re-sort
everything on each such arrival.
Operations: insert, find-minimum/maximum, delete-minimum/maximum.
The "priority" criteria can be "length", "size", "id", etc.
Heap
Heaps are a simple and elegant data structure for efficiently supporting the priority
queue operations insert and extract-min.
They maintain a partial order (which is weaker than the sorted order), but stronger than
random – the min item can be found quickly.
The heap is a slick data structure that enables us to represent binary trees without using
any pointers.
We will store data as an array of keys, and use the position of the keys to implicitly
satisfy the role of the pointers.
In general, we will store the 2^l keys of the l -th level of a complete binary tree from
left-to-right in positions 2^(l−1) to 2^l − 1 .
The positions of the parent and children of the key at position k are readily
determined. The left child of k sits in position 2k and the right child in 2k + 1 , while
the parent of k holds court in position floor(k/2) .
This structure demands that there are no holes in the tree – that each level is packed as
much as it can be – only the last level may be incomplete.
This structure is less flexible than a binary search tree: we cannot move subtrees
around easily by changing a single pointer. We also can't store arbitrary tree structures
(only full trees) without wasting a lot of space.
Performing a search in a heap requires linear scanning – we can't do a log(n) search
10
Data Structures
Fibonacci Heap
A more efficient priority queue than binary heaps in terms of time complexity.
Harder to implement.
Graphs
A Graph is comprised of vertices V (the points) and edges E (the lines connecting
the points).
Assume n is the number of vertices and m is the number of edges.
Directed graphs: inner city streets (one-way), program execution flows.
Undirected graphs: family hierarchy, large roads between cities.
Data structures:
Adjacency Matrix: we can represent a graph using an n * n matrix, where
element [i,j] = 1 if (i, j) is an edge (an edge from point i to point j ), and
0 otherwise.
The matrix representation allows rapid updates for edges (insertion, deletion) or to
tell if an edge is connecting two vertices, but uses a lot of space for graphs with
many vertices but relatively few edges.
Adjacency Lists: we can represent a sparse graph by using linked lists to store the
neighbors adjacent to each vertex.
Lists make it harder to tell if an edge is in the graph, since it requires searching the
appropriate list, but this can be avoided.
The parent list represents each of the vertices, and a vertex's inner list represents
11
Data Structures
all the vertices the vertex is connected to (the edges for that vertex).
Each vertex appears at least twice in the structure – once as a vertex with a list of
connected vertices, and at least once in the list of vertex for the vertices it's
connected to (in undirected graphs).
We'll mainly use adjacency lists as a graph's data structure.
Traversing a graph (breadth or depth) uses 3 states for vertices: undiscovered,
discovered, processed (all edges have been explored).
Most fundamental graph operation is traversal.
Additional Trees
Trie (Prefix Tree)
A trie is a tree structure, where each edge represents one character, and the root
represents the null string.
Each path from the root represents a string, described by the characters labeling the
edges traversed.
Any finite set of words defines a trie.
Two words with common prefixes branch off from each other at the first distinguishing
character.
Each leaf denotes the end of a string.
Tries are useful for testing whether a given query string q is in the set.
Can be used to implement auto-completion and auto-correction efficiently.
Supports ordered traversal and deletion is straightforward.
We traverse the trie from the root along branches defined by successive characters of
q . If a branch does not exist in the trie, then q cannot be in the set of strings.
12
Data Structures
Each node has a single character and pointers to three children: equal , lo and hi .
Can also contain pointer to parent and flag if it's the end of a word.
The lo pointer must point to a node whose character value is less than the current
node ( hi for higher).
The equal points to the next character in the word.
Average running time for operations: O(logn) ; Worst-case: O(n) .
Also relevant: Patricia Tree – A compact representation of a Trie in which any node that
is an only child is merged with its parent.
Radix Tree
A more memory-efficient (compressed) representation of a trie (prefix tree).
Space-optimized by merging parent that have single-child nodes.
Suffix Tree
A data structure to quickly find all places where an arbitrary query string q occurs in
reference string S (check if S contains q ).
Proper use of suffix trees often speeds up string processing algorithms from O(n^2) to
linear time.
Suffix tree is simply a trie of the n - 1 suffixes of an n -character string S (to
construct just iterate over all suffixes and insert them into the trie).
Suffix tree is simply a trie of all the proper suffixes of S . The suffix tree enables you to
test whether q is a substring of S , because any substring of S is the prefix of some
suffix. The search time is again linear in the length of q .
B-Tree
Balanced search trees designed to work well on disks; used by many databases (or
variants of B-trees).
Typical applications of B-trees have amounts of data that can't fit into main memory at
once.
Similar to red-black trees but they are better at minimizing disk IO.
All leaves are at the same depth in the tree.
Nodes may have many children (1s-1000s).
A node has k keys, which separate (partition) the range of values handled by k+1
children.
When searching for a key in the tree we make a k+1 -way decision based on
comparisons with the k keys stored at any node ( k can differ between nodes).
Nodes are usually as large as a whole disk page to minimize the number of
13
Data Structures
reads/writes.
The root of a B-tree is usually kept in main memory.
Interval Tree
An interval tree is a red-black tree that maintains a dynamic set of elements, each
containing an interval.
Two intervals a and b overlap if: a.low <= b.high and a.high >= b.low .
The key of each tree node is the low endpoint of the node's interval (start time) => an in-
order traversal lists the intervals in sorted start time.
Additionally, each node stores a value max which is the maximum value of any
interval's end time stored in the subtree rooted at that node.
To find an overlapping interval in the tree with some external interval i , we begin to
traverse the tree. If i.low <= root.left.max we recurse into the left tree; otherwise to
the right one. Keep going until one of the intervals overlap or we reach null .
DAWG
Directed acyclic word graph.
Represents the set of all substrings of a string.
Similar in structure to a suffix tree.
Bloom Filter
A probabilistic data structure.
Allows to test if an element is a member of a set.
False positive matches are possible, but false negatives are not.
A query returns either "possibly in set" or "definitely not in set".
14
Data Structures Examples
Stack
Backed by array:
Queue
Backed by array:
15
Data Structures Examples
if (size() == 0) {
front = 0;
rear = 0;
arr[front] = x;
} else {
rear = (rear + 1) % arr.length;
arr[rear] = x;
}
}
if (front == rear) {
front = -1;
rear = -1;
} else {
front = (front + 1) % arr.length;
}
return res;
}
}
Binary Tree
16
Data Structures Examples
class Tree {
public int value;
public Tree left;
public Tree right;
Graph
17
Data Structures Examples
/*
* For dijkstra priority queue
*/
@Override
public int compareTo(Vertex o) {
if (this.distance < o.distance) return -1;
else if (o.distance < this.distance) return 1;
else return 0;
}
}
Trie
18
Data Structures Examples
/*
* For root element that has no char
*/
public Trie() {
this(null);
}
public Trie(Character c) {
this.c = c;
}
}
BitSet
public class BitSet {
private byte[] a;
19
Algorithms
Algorithms
Dynamic programming – compute solutions for smaller instances of a given problem
and use these solutions to construct a solution to the problem. For example: adding a
cache to recursive Fibonacci calculation.
Greedy algorithms – compute a solution in stages, making choices that are locally
optimum at each step; these choices are never undone.
Divide-and-Conquer
Break a complex algorithm into small, often recursive parts.
Allow better parallelization.
To use divide-and-conquer as an algorithm design technique, we must divide the
problem into two smaller subproblems, solve each of them recursively, and then meld
the two partial solutions into one solution to the full problem. Whenever the merging
takes less time than solving the two subproblems, we get an efficient algorithm.
Mergesort is a classic example of divide and conquer (the merge operation is linear).
Sorting
"Naive" sorting algorithms run in O(n^2) while enumerating all pairs.
"Sophisticated" sorting algorithms run on O(nlog(n)) .
An important algorithm design technique is to use sorting as a basic building block,
because many other problems become easy once a set of items is sorted, like:
searching, finding closest pair, finding unique elements, frequency distribution, selection
by order, set of numbers intersection, etc.
Selection Sort – each time find the minimum item, remove it from the list, and continue
to find the next minimum; takes O(n^2) .
Insertion Sort – iterate thru i = 2 to n , j = i to 1 and swap needed items; takes
O(n^2) .
Insertion sort is a little more efficient than selection, because the inner j loop uses a
while, only scanning until the right place in the sorted part of the array is found for the
new item. Selection sort scans all items to always find the minimum item.
Selection and Insertion are very similar, with a difference that after k iterations
Selection will have the k smallest elements in the input, and Insertion will have the
arbitrary first k elements in the input that it processed.
Selection Sort writes less to memory (Insertion writes every step because of swapping),
20
Algorithms
Mergesort
A recursive approach to sorting involves partitioning the elements into two groups,
sorting each of the smaller problems recursively, and then interleaving (merging) the
two sorted lists to totally order the elements.
The efficiency of mergesort depends upon how efficiently we combine the two sorted
halves into a single sorted list.
Merging on each level is done by examining the first elements in the two merged lists.
The smallest element must be the head of either of the lists. Removing it, the next
element must again be the head of either of the lists, and so on. So the merge operation
in each level is linear.
This yields an efficiency of O(nlog(n)) .
Mergesort is great for sorting linked lists because it does not access random elements
directly (like heapsort or quicksort), but DON'T try to sort linked lists in an interview.
When sorting arrays with mergesort an additional 3rd array buffer is required for the
merging operation (can be implemented in-place tho without an additional buffer, but
requires complicated buffer manipulation).
Classic divide-and-conquer algorithm, the key is in the merge implementation.
Quicksort
Can be done in-place (using swaps), doesn't require an additional buffer.
Select a random item p from the items we want to sort, and split the remaining n-1
items to two groups, those that are above p and below p . Now sort each group in
itself. This leaves p in its exact place in the sorted array.
Partitioning the remaining n-1 items is linear (this step is equivalent to the "merge"
part in merge sort; merge => partition).
Total time is O(n * h) where h = height of the recursion tree (number of recursions).
If we pick the median element as the pivot in each step, h = log(n) ; this is the best
case of quicksort.
If we pick the left- or right-most element as the pivotal element each time (biggest or
smallest value), h = n (worst case).
On average quicksort will produce good pivots and have nlog(n) efficiency (like binary
search trees insertion).
If we are most unlucky, and select the extreme values, quicksort becomes selection sort
and runs O(n^2) .
For the best case to work, we need to actually select the pivot randomly, or always
21
Algorithms
Heapsort
(See also Heap in Data Structures)
Distribution Sort
Split a big problem into sub (ordered) "buckets" which need to be sorted themselves,
but then all results can just be concatenated. An example: a phone book. Split names
into buckets for the starting letter of the last name, sort each bucket, and then combine
all strings to one big sorted phone book. We can further split each pile based on the
second letter of the last name, reducing the problem even further.
Bucketing is effective if we are confident that the distribution of data will be roughly
uniform (much like hash tables).
Performance can be terrible if data distribution is not like we thought it is.
External Sort
22
Algorithms
and then accumulate each of the cells to know how many total elements were before
each index.
Radix sort – use counting sort multiple times to sort on the least significant digit, then
the next one, etc.
Bucket (Bin) sort – O(n) , sorts n uniformly spread real numbers between [0,M) , by
dividing the elements into M/n buckets, then sorting each bucket. Uniform distribution
will yield linear time; non-uniform will be O(nlogn) .
Searching
Binary Search
Fast algorithm for searching in a sorted array of keys.
To search for key q , we compare q to the middle key S[n/2] . If q appears before
S[n/2] , it must reside in the top half of S ; if not, it must reside in the bottom half of
S . Repeat recursively.
Efficiency is O(logn) .
Several interesting algorithms follow from simple variants of binary search:
Eliminating the "equals check" in the algorithm (keeping the bigger/smaller checks)
23
Algorithms
Randomization
Randomize array ( O(nlogn) ) – create a new array with "priorities", which are random
numbers between 1 - n^3 , then sort the original array based on new priorities array as
the keys.
Randomize array in place ( O(n) ) – swap a[i] with a[rand(i, n)] .
24
Algorithms
It then finds the median of these medians by recursively calling itself, and selects
the median of medians as the pivot for partition.
Graph Algorithms
BFS Graph Traversal
Breadth-first search usually serves to find shortest-path distances from a given source
in terms of number of edges (not weight).
Start exploring the graph from the given root (can be any vertex) and slowly progress
out, while processing the oldest-encountered vertices first.
We assign a direction to each edge, from the discoverer (u) to the discovered (v)
( u is the parent of v ).
This defines (results in) a tree of the vertices starting with the root (breadth-first tree).
The tree defines the shortest-path from the root to every other node in the tree, making
this technique useful in shortest-path problems.
Once a vertex is discovered, it is placed in a queue. Since we process these vertices in
first-in, first-out order, the oldest vertices are expanded first, which are exactly those
closest to the root.
During the traversal, each vertex discovered has a parent that led to it. Because
vertices are discovered in order of increasing distance from the root, the unique path
from the root to each node uses the smallest number of edges on any root to node path
in the graph.
This path has to be constructed backwards, from the destination node to the root.
BFS can have any node as the root – depends on what paths we want to find.
BFS returns the shortest-path in terms of number of edges (for both directed and
undirected graphs), not in terms of weight (see shortest-paths below).
Traversal is O(n + m) where n = num of vertex and m = total num of edges.
Applications:
Find shortest-path in terms of number of edges
Garbage collection scanning
Connected components
"Connected graph" = there is a path between any two vertices.
"Connected component" = set of vertices such that there is a path between
every pair of vertices; there must not be a connection between two
components (otherwise they would be the same component).
Many problems can be reduced to finding or counting connected components
(like solving a Rubik's cube – is the graph of legal configurations connected?).
A connected component can be found with BFS – every node found in the tree
25
Algorithms
26
Algorithms
Topological sort – create a linear ordering of a DAG (directed acyclic graph), such
that for edge (u, v) then u appears before v in the ordering; computed by
DFS and adding each "visited" vertex into the front of a linked list.
Finding cycles – any back edge means that we've found a cycle. If we only have
tree edges then there are no cycles (to detect a back edge, when you process the
edge (x,y) check if parent[x] != y ).
Finding articulation vertices (articulation or cut-node is a vertex that removing it
disconnects a connected component, causing loss of connectivity). Finding
articulation vertices by brute-force is O(n(n + m)) .
Strongly connected components – in a SCC of a directed graph, every pair of
vertices u and v are reachable from each other.
Shortest-Paths
Find the path with minimal edge weights between a source vertex and every other
vertex in the graph.
Similarly, BFS finds a shortest-path for an unweighted graph in terms of number of
edges.
Sub-paths of shortest-paths are also shortest-paths.
Graphs with negative-weight cycles don't have a shortest-paths solution.
Some algorithms assume no negative weights.
Shortest-paths have no cycles.
We will store the path itself in vertex.parent attributes, similar to BFS trees; the result
of a shortest-paths algorithm is a shortest-paths tree: a rooted tree containing a
shortest-path from the source to every vertex that is reachable.
There may be more than one shortest-path between two vertices and more than one
shortest-paths tree for a source vertex.
Each vertex v maintains an attribute v.distance which is an upper bound on the
weight of a shortest-path ("shortest-path estimate") from source source to v ;
v.distance is initialized to Integer.MAX_VALUE and source.distance is initialized to
0 .
The action of relaxing an edge (from, to) tests whether we can improve the shortest-
path to to found so far by going through from , and if so updating to.parent and
to.distance (and the priority queue).
27
Algorithms
Dijkstra's Algorithm
Algorithm for finding the shortest-paths between nodes in a graph, which may
represent, for example, road networks.
Assumes that edge weight is nonnegative.
Fixes a single node as the source node and finds shortest-paths from the source to all
other nodes in the graph, producing a shortest-path tree.
The algorithm uses a min-priority queue for vertices based on their .distance value
(shortest-path estimate).
It also maintains a set of vertices whose final shortest-path weights from the source
have already been determined.
A*
An extension of Dijkstra which achieves better time performance by using heuristics.
Backtracking
General algorithm for finding all (or some) solutions to a problem, that incrementally
builds candidates to the solution, and abandons ("backtracks") each partial candidate as
soon as it determines that it cannot possibly be completed to a valid solution.
The classic textbook example of the use of backtracking is the eight queens puzzle;
another one is the knapsack problem.
Backtracking can be applied only for problems which admit the concept of a "partial
candidate solution" and a relatively quick test of whether it can possibly be completed to
a valid solution.
When it is applicable, however, backtracking is often much faster than brute force
enumeration of all complete candidates, since it can eliminate a large number of
candidates with a single test.
Levenshtein Distance
A string metric for measuring the difference between two sequences.
The Levenshtein distance between two words is the minimum number of single-
character edits (i.e. insertions, deletions or substitutions) required to change one word
into the other.
28
Algorithms
29
Algorithms Examples
Tree Traversal
In-Order
inOrder(tree.left);
// process tree.value
inOrder(tree.right);
}
Without recursion:
curr = stack.pop();
// process curr.value
curr = curr.right;
}
}
Pre-Order
30
Algorithms Examples
// process tree.value
preOrder(tree.left);
preOrder(tree.right);
}
Without recursion:
while (!stack.isEmpty()) {
curr = stack.pop();
// process curr.value
if (curr.right != null) stack.push(curr.right);
if (curr.left != null) stack.push(curr.left);
}
}
Post-Order
postOrder(tree.left);
postOrder(tree.right);
// process tree.value
}
Without recursion:
31
Algorithms Examples
while (!tmp.isEmpty()) {
curr = tmp.pop();
all.push(curr);
if (curr.left != null) tmp.push(curr.left);
if (curr.right != null) tmp.push(curr.right);
}
while (!all.isEmpty()) {
curr = all.pop();
// process curr.value
}
}
Sorting
Insertion Sort
a[i + 1] = key;
}
}
Selection Sort
32
Algorithms Examples
return smallest;
}
Mergesort
33
Algorithms Examples
return res;
}
Quicksort
34
Algorithms Examples
return pivot;
}
Heapsort
35
Algorithms Examples
// extract max element from the head to the end and shrink the size of the heap
for (int last = a.length - 1; last >= 0; last--) {
swap(a, 0, last);
heapify(a, 0, last);
}
}
if (largest != root) {
swap(a, root, largest);
heapify(a, largest, length);
}
}
Counting Sort
36
Algorithms Examples
return sorted;
}
Searching
Binary Search
37
Algorithms Examples
Selection
Quickselect
return pivot;
}
Graph Algorithms
Utility function:
BFS
38
Algorithms Examples
while (!q.isEmpty()) {
Vertex from = q.remove();
if (!to.discovered) {
to.parent = from;
to.discovered = true;
q.add(to);
}
}
}
}
DFS
39
Algorithms Examples
if (to.discovered) {
// cycle found (back edge)! What should we do? depends on the application.
..
} else {
to.parent = v;
dfs(graph, to);
}
}
// TODO: insert application of DFS here. For example: if we're doing topological s
orting
// then add v to head of a linked list at this point
}
Dijkstra
40
Algorithms Examples
while (!q.isEmpty()) {
Vertex from = q.remove();
41
Bit Operators
Bit Operators
Leading zeros 0 (i.e., to the left) are meaningless
~ NOT
Invert a bit
Examples:
~0 = 1
~1 = 0
~0111 = 1000
~100 = 011
int b = 0b10;
~b == 11111111111111111111111111111101
& AND
Results in 1 in each position if the corresponding first bit and second bit are 1 ,
otherwise 0 .
Enables to find if a certain bit in a number contains 1 or 0 . Can be considered like
multiplying all bits.
Examples:
10 & 11 = 10
| OR
Results in 1 in each position if the corresponding first bit or second bit are 1 ,
otherwise 0 .
Enables to set a specific bit to 1 .
Examples:
10 | 11 = 11
42
Bit Operators
^ XOR
Exclusive OR – results in 1 in each position if the corresponding first bit or second bit
are 1 , but not both, otherwise 0 .
Enables to compare two bits – 1 means they are different, 0 means they are the
same.
Can be used to invert selected bits in a register. Any bit can be toggled by XOR-ing it
with 1 .
XOR-ing a value against itself yields zero.
Examples:
0101 ^ 0011 = 0110
See also Java section on Bit Arithmetics/Operations and Data Structures code examples for
BitSet.
Binary Numbers
43
Bit Operators
0 = 0
1 = 1
10 = 2
11 = 3
100 = 4
101 = 5
110 = 6
111 = 7
1000 = 8
1001 = 9
1010 = 10
1011 = 11
1100 = 12
1101 = 13
1110 = 14
1111 = 15
10000 = 16
...
10010110
^ ^
| |------- bit 0
|
|-------------- bit 7
Having a 1 in the k -th bit, means that the decimal number is comprised of 2^k . For
example, for the above number:
44
Numbers
Numbers
https://gist.github.com/jboner/2841832
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Latency
Memory
Operation Time Note
L1 ref 1ns
L2 ref 4ns
From Time
RAM 20us
SSD 300us
HDD 2ms
CPU
Operation Time
Mutex lock 17ns
Network
45
Numbers
Operation Time
Round-trip in DC 500us
Data Speeds
Operation Speed
Time
Period Seconds
1 hour 3.6K
1 day 86K
1 year 30M
Powers of 2
46
Numbers
2^3 8
2^4 16
2^5 32
2^6 64
2^7 128
2^8 256
2^9 512
47
Operating Systems
Operating Systems
Memory
Byte is the smallest addressable unit of memory.
The stack is the memory set aside as scratch space for a thread of execution. When a
function is called, a block (also called stack frame) is reserved on the top of the stack for
local variables and some bookkeeping data. When that function returns, the block
becomes unused and can be used the next time a function is called. The stack is
always reserved in a LIFO order; the most recently reserved block is always the next
block to be freed. This makes it really simple to keep track of the stack; freeing a block
from the stack is nothing more than adjusting one pointer.
The heap is memory set aside for dynamic allocation. Unlike the stack, there's no
enforced pattern to the allocation and deallocation of blocks from the heap; you can
allocate a block at any time and free it at any time.
Each thread gets a stack, while there's typically only one heap for the application
(although it isn't uncommon to have multiple heaps for different types of allocation).
The stack is attached to a thread, so when the thread exits the stack is reclaimed. The
heap is typically allocated at application startup by the runtime, and is reclaimed when
the application (technically process) exits.
The size of the stack is set when a thread is created. The size of the heap is set on
application startup, but can grow as space is needed (the allocator requests more
memory from the operating system).
The stack is faster because the access pattern makes it trivial to allocate and deallocate
memory from it (a pointer/integer is simply incremented or decremented), while the
heap has much more complex bookkeeping involved in an allocation or memory freeing.
Also, each byte in the stack tends to be reused very frequently which means it tends to
be mapped to the processor's cache, making it very fast.
Virtual memory technique virtualizes the main storage available to a process or task, as
a contiguous address space which is unique to each running process, or virtualizes the
main storage available to all processes or tasks on the system as a contiguous global
address space.
Virtual memory illustrated
x64 is 64 bit platform while x86 is 32bit platform. Pointers are 64 bit vs 32 bit.
Lock is a generic term for an object that works like a key and allows the access to a
target object only by one thread. Only the thread which owns the key can unlock the
"door".
Monitor is a cross-thread lock.
In Java every object is a monitor.
Lock allows only one thread to enter the part that's locked and the lock is not shared
with any other processes.
Mutex and Semaphore are kernel resources that provide synchronization services.
Mutex
Mutex is a cross-process lock, same as a lock but system wide.
Mutex is a locking mechanism used to synchronize access to a resource.
Mutex provides mutual exclusion, either producer or consumer can have the key
(mutex) and proceed with their work.
Java has no system-wide mutex support.
Semaphore
Semaphore is signaling mechanism ("I am done, you can carry on" kind of signal).
A particular lock that allows more than one thread to access the target object. It's like
having more keys, so that many people can unlock the door.
Semaphore is a generalized mutex; if upper bound is set to one (1) it's the same as a
monitor.
Does the same as a lock but allows x number of threads to enter.
Processes
A process has a self-contained execution environment. A process generally has a
complete, private set of basic run-time resources; in particular, each process has its
own memory space.
Processes are often seen as synonymous with programs or applications. However, what
the user sees as a single application may in fact be a set of cooperating processes. To
facilitate communication between processes, most operating systems support Inter
Process Communication (IPC) resources, such as pipes and sockets. IPC is used not
just for communication between processes on the same system, but processes on
different systems.
Each process has it's own memory space, and addresses memory inside that space.
Each process can contain many threads.
Threads share the memory space of its parent process.
49
Operating Systems
Process resources:
Working directory
Environment variables
File handles
Sockets
Memory space
Process Signaling
Signals are a limited form of inter-process communication used in Unix, Unix-like, and
other POSIX-compliant operating systems.
A signal is an asynchronous notification sent to a process.
If the process has previously registered a signal handler, that routine is executed.
Otherwise, the default signal handler is executed.
Some signals may be ignored by an application (like SIGTERM ).
The kernel can generate signals to notify processes of events.
For example, SIGPIPE will be generated when a process writes to a pipe which has
been closed by the reader.
There are two signals which cannot be intercepted and handled: SIGKILL (terminate
immediately, no cleanup) and SIGSTOP .
50
Operating Systems
Pressing Ctrl+C in the terminal sends SIGINT (interrupt) to the process; by default,
this causes the process to terminate.
$ kill -9 will send a SIGKILL .
Notable signals:
SIGINT (2) – Terminal interrupt
Threads
Threads are sometimes called lightweight processes. Both processes and threads
provide an execution environment, but creating a new thread requires fewer resources
than creating a new process.
Threads exist within a process – every process has at least one. Threads share the
process' resources, including memory and open files. This makes for efficient, but
potentially problematic, communication.
Thread creation is a costly operation, and resources need to be constructed and
obtained for a new thread.
Creating and tearing down threads isn't free: there'll be some CPU overhead each time
we do so.
There may be some moderate limit on the number of threads that can be created,
determined by the resources that a thread needs to have allocated (if a process has
2GB of address space, and each thread has 512KB of stack, that means a maximum of
a few thousands threads per process).
Solution: use a threadpool.
Thread Scheduler
Responsible for sharing out the available CPUs in some way among the competing
threads.
On multiprocessor systems, there is generally some kind of scheduler per processor,
which then need to be coordinated in some way.
Most systems use priority-based round-robin scheduling to some extent:
A thread of higher priority (which is a function of base and local priorities) will
preempt a thread of lower priority.
51
Operating Systems
Otherwise, threads of equal priority will essentially take turns at getting an allocated
slice or quantum of CPU.
A thread can be in states:
New – created and waiting to create needed resources
paged/sleep to finish/etc.
Terminated – finished by waiting to clear resources
Each thread has a quantum, which is effectively how long it is allowed to keep hold of
the CPU.
Thread quanta are generally defined in terms of some number of clock ticks.
A clock tick is typically 1ms under Linux.
A quantum is usually a small number of clock ticks, between 10-200 clock ticks (i.e. 10-
200 ms) under Linux.
At key moments, the thread scheduler considers whether to switch the thread that is
currently running on a CPU. These key moments are usually: periodically, if a thread
ceases to be runnable, when some other attribute of the thread changes.
At these decision points, the scheduler's job is essentially to decide, of all the runnable
threads, which are the most appropriate to actually be running on the available CPUs.
Priorities differ between OS, and can be partially set by the user.
Great article on CPU Clocks and Clock Interrupts, and Their Effects on Schedulers.
Context Switching
This is the procedure that takes place when the system switches between threads
running on the available CPUs.
Every time the thread running on a CPU actually changes – often referred to as a
context switch – there'll be some negative impact due to, e.g., the interruption of the
instruction pipeline or the fact that the processor cache may no longer be relevant.
Switching between threads of different processes will carry a higher cost, since the
address-to-memory mappings must be changed, and the contents of the cache almost
certainly will be irrelevant to the next process.
Context switches appear to typically have a cost somewhere between 1 and 10
microseconds.
A modest number of fast-case switches — e.g. a thousand per second per CPU will
generally be much less than 1% of CPU usage for the context switch per se.
A few slower-case switches in a second, but where each switched-in thread can do, say,
a milliseconds or so of worth of real work (and ideally several milliseconds) once
switched in, where the more memory addresses the thread accesses (or the more
cache lines it hits), the more milliseconds we want it to run uninterrupted for.
52
Operating Systems
So the worst case is generally where we have several "juggling" threads which each
time they are switched in only do a tiny amount of work (but do some work, thus hitting
memory and contending with one another for resources) before context switching.
The cause of excessive context switching comes from contention on shared resources,
particularly synchronized locks:
Rarely, a single object very frequently synchronized on could become a bottleneck.
More frequently, a complex application has several different objects that are each
synchronized on with moderate frequency, but overall, threads find it difficult to
make progress because they keep hitting different contended locks at regular
intervals.
Solutions include holding on to locks for less time and (as part of this), reducing the
"housekeeping" involved in managing a lock.
53
Operating Systems
Husband and wife eating dinner with 1 spoon, and keep passing the spoon to the
other spouse
How to avoid live-locking – try to add some randomness.
Starvation describes a situation where a thread is unable to gain regular access to
shared resources and is unable to make progress. This happens when shared
resources are made unavailable for long periods by "greedy" threads. For example,
suppose an object provides a synchronized method that often takes a long time to
return. If one thread invokes this method frequently, other threads that also need
frequent synchronized access to the same object will often be blocked.
Race Conditions
A race condition occurs when 2 or more threads are able to access shared data and
they try to change it at the same time. Because the thread scheduling algorithm can
swap between threads at any point, you don't know the order at which the threads will
attempt to access the shared data. Therefore, the result of the change in data is
dependent on the thread scheduling algorithm, i.e. both threads are 'racing' to
access/change the data.
Often problems occur when one thread does a "check-then-act" (e.g. "check" if the
value is X, and then "act" to do something that depends on the value being X) and
another thread does something to the value in between the "check" and the "act".
In order to prevent race conditions occurring, typically you would put a lock around the
shared data to ensure that only one thread can access the data at a time.
Sockets
A socket is one endpoint of a two-way communication link between two programs
running on the network, existing on the transport layer.
You can send and receive things on a socket, you can bind and listen to a socket.
A socket is specific to a protocol, machine, and port, and is addressed as such in the
header of a packet.
Socket types:
Datagram – Uses UDP for delivering packets. No guaranty for order and delivery.
Stream – Typically uses TCP or SCTP. Order and delivery of packages is
guranteed.
Raw – No protocol specified on transport layer.
System calls:
bind – Assigns a name to an unnamed socket.
54
Operating Systems
accept – A connection from a client process is waited for by having the server
system calls.
close – Close a socket.
55
System Architecture
System Architecture
Elements of Scale: Composing and Scaling Data Platforms
Please stop calling databases CP or AP
What we talk about when we talk about distributed systems
A Comprehensive Guide to Building a Scalable Web App on Amazon Web Services
Notes on Google's Site Reliability Engineering Book or the book itself
HDD
Stores data on a series of constantly-spinning magnetic disks, called platters.
Actuator arm with read/write heads positions the heads over the correct area of the
drive to read/write (nanometers above platters).
The drive may need to read from multiple locations in order to launch a program or load
a file, which means it may have to wait for the platters to spin into the proper position
multiple times before it can complete the command.
If a drive is asleep or in a low-power state, it can take several seconds more for the disk
to spin up to full power.
Latency measured in milliseconds.
SAS – Serial Attached SCSI – faster than HDD but still with a spinning platter.
SSD
Don't rely on moving parts or spinning disks.
Data is saved to a pool of NAND flash.
NAND itself is made up of what are called floating gate transistors. Unlike the transistor
designs used in DRAM, which must be refreshed multiple times per second, NAND
flash is designed to retain its charge state even when not powered up.
NAND flash is organized in a grid. The entire grid layout is referred to as a block, while
the individual rows that make up the grid are called a page.
Common page sizes are 2K, 4K, 8K, or 16K, with 128 to 256 pages per block.
Block size therefore typically varies between 256KB and 4MB.
Data read/written at the page level (individual rows within the grid).
Data erased at the block level (requires a high amount of voltage).
Latency measured in microseconds.
SSD controllers have caches and a DDR3 memory pool to help with managing the
NAND.
56
System Architecture
RAID
Redundant Array of Independent Disks
Performance increase is a function of striping: data is spread across multiple disks to
allow reads and writes to use all the disks' IO queues simultaneously.
Redundancy is gained by creating special stripes that contain parity information, which
can be used to recreate any lost data in the event of a hardware failure.
Hardware RAID – a dedicated hardware controller with a processor dedicated to RAID
calculations and processing.
RAID 0 – Striping – splits blocks of data into as many pieces as disks are in the array,
and writes each piece to a separate disk.
Single disk failure means all data lost.
Increased in throughput: throughput of one disk * number of disks.
Total space is sum of all disks.
Requires min. 2 drives.
RAID 1 – Mirroring – duplicates data identically on all disks.
Data preserved until loss of last drive.
Reads faster or the same, writes slower.
Total space = size of smallest disk.
Requires min. 2 drives.
RAID 10 – combination of RAID 1 and 0 (in that order). Create arrays of RAID 1, and
then apply RAID 0 on top of them. Requires min. 4 disks, additional ones need to be
added in pairs. A single disk can be lost in each RAID 1 pair. Guaranteed high speed
and availability. Total space = 50% of disk capacity.
RAID 5 – uses a simple XOR operation to calculate parity. Upon single drive failure, the
information can be reconstructed from the remaining drives using the XOR operation on
the known data. Any single drive of the drives we have can fail (total number of drives
>= 2). In case of drive failure the rebuilding process is IO intensive.
RAID 6 – like RAID 5, but two disks can fail and no data loss.
CAP Theorem
Shared-data systems can have at most two of the three following properties:
Consistency, Availability, and tolerance to network Partitions.
Consistency
There must exist a total order on all operations such that each operation looks as if
it were completed at a single instant.
A system is consistent if an update is applied to all relevant nodes at the same
logical time.
57
System Architecture
Standard database replication is not strongly consistent because special logic must
be introduced to handle replication lag.
Consistency which is both instantaneous and global is impossible.
Mitigate by trying to push the time resolutions at which the consistency breaks
down to a point where we no longer notice it.
Availability
Every request received by a non-failing node in the system must result in a
response (that contains the results of the requested work).
Even when severe network failures occur, every request must terminate.
Partition Tolerance
When a network is partitioned, all messages sent from nodes in one component of
the partition to nodes in another component are lost.
Any distributed system can experience a network partition.
For a distributed system to not require partition-tolerance it would have to run on a
network which is guaranteed to never drop messages (or even deliver them late) –
but such a network doesn't exist!
If one node is partitioned, then that node is either inconsistent or not available.
If a system chooses to provide Consistency over Availability in the presence of
partitions, it will preserve the guarantees of its atomic reads and writes by refusing to
respond to some requests (like refuse writes on all nodes).
If a system chooses to provide Availability over Consistency in the presence of
partitions, it will respond to all requests, potentially returning stale reads and accepting
conflicting writes (which can later be resolved with Vector Clocks, last-write wins, etc.).
Most real-world systems require substantially less in the way of consistency guarantees
than they do in the way of availability guarantees.
Failures of consistency are usually tolerated or even expected, but just about every
failure of availability means lost money.
The choice of availability over consistency is a business choice, not a technical one.
MapReduce
Programming model and an associated implementation for processing and generating
large data sets with a parallel, distributed algorithm on a cluster.
Map procedure performs filtering and sorting.
Reduce procedure performs a summary operation.
Data transferred between map and reduce servers is compressed. Because servers
aren't CPU bound it makes sense to spend on data compression and decompression in
order to save on bandwidth and I/O.
Advantages:
58
System Architecture
// Mapper
/*
* Processes a key/value pair to generate a set of intermediate key/value pairs
*
* @param key id of document
* @param value text of page
* @returns pair of word and document id
*/
public Tuple<String, String> map(String key, String value) {
for (String word : tokenize(value))
return new Tuple<>(word, key);
}
// Reducer
/*
* Merges all intermediate values associated with the same intermediate key
*
* @param key word
* @param value list of document ids
* @returns word to document ids pair
*/
public Tuple<String, List<String>> reduce(String key, List<String> value) {
return new Tuple<>(key, value)
}
59
System Architecture
Consistent Hashing
Special kind of hashing such that when a hash table is resized and consistent hashing
is used, only K/n keys need to be remapped on average, where K is number of keys
and n is number of buckets.
In contrast, in most traditional hash tables, a change in the number of array slots causes
nearly all keys to be remapped.
Originated as a way of distributing requests among a changing population of Web
servers.
Consistent hashing maps objects to the same cache machine, as long as possible.
When a cache machine is added, it takes its share of objects from all the other cache
machines and when it is removed, its objects are shared between the remaining
machines.
Consistent hashing is based on mapping each object to a point on the edge of a circle
(or equivalently, mapping each object to a real angle). The system maps each available
machine (or other storage bucket) to many pseudo-randomly distributed points on the
edge of the same circle.
Apache Cassandra, for example, uses consistent hashing for data partitioning in the
cluster.
An alternative is Rendezvous Hashing.
60
System Architecture
(source)
61
System Architecture
Prevents a standing queue (because the lastEmptyTime will be in the distant past,
causing an M -ms queuing timeout) while allowing short bursts of queuing for
reliability purposes.
Values of 5ms for M and 100ms for N tend to work well across a wide set of use
cases.
Adaptive LIFO
During normal operating conditions, requests are processed in FIFO order, but
when a queue is starting to form, the server switches to LIFO mode.
Places new requests at the front of the queue, maximizing the chance that they will
meet the deadline set by Controlled Delay.
Backup requests
Initiate a second ("backup") request to a different server N ms after the first
request was made, if it still hasn't returned.
Dramatic improvement for high percentiles.
Example flow:
1. Send request to first replica.
2. Wait 2 ms, and send to second replica.
3. Cancel request on other replica when starting to read second request's
response.
62
System Architecture
Load-balancing
High availability
MapReduce
CDN
Queues
Timeouts
Fail fast
Circuit breakers
Throttling
Consistent hashing for sharding/partitioning
Compression
Possible to work harder on writes in order to make reads easier
Platform Needs
Monitoring (app, resp. time, throughput, OS level, errors)
Alerting
Metrics
Distributed tracing + logging
Datastores
Caching
Queues
Service discovery
Configuration
Debugging, profiling
ALM and deployment
Auto scaling
Load balancing, traffic splitting/control
KPIs and actions based on them
Deployment (blue-green), rollback, canary/testbed
Determining who are the live peers out of a dynamic group of members.
Communicating over a network, assuming: packet loss, partitions.
Useful for discovery in distributed systems.
Protocols measured by:
63
System Architecture
Heartbeating
Every interval T , notify peers of liveness.
If no update received from peer P after T * limit , mark as dead.
Provides membership + failure detection.
Broadcast a liveness message on every interval.
Can be implemented with IP broadcast/multicast (generally disabled) or gossip.
Network load is O(n^2) messages on each interval – becomes a bottleneck for large
clusters.
SWIM
Scalable Weakly-Consistent Infection-Style Process Group Membership Protocol.
Failure detection done by randomly pinging a single peer every interval and marking
failed nodes.
To improve accuracy, before marking a node as failed try indirect ping.
Indirect ping is asking a set of K random peers to ping the failing candidate. This
improves accuracy by attempting other network routes and reduce chance of packet
loss.
Network load is O(n) , making is feasible for large clusters.
State updates on membership (joining/leaving the cluster) is achieved by piggybacking
on the ping messages for gossip (infection/epidemic).
Each ping/ack message sent between random nodes adds information about new and
left/failed nodes.
Because pings are sent randomly, nodes will learn about new/failing nodes at different
times => eventually (weakly) consistent.
Ping + piggybacking state updates = membership protocol.
64
System Architecture Examples
Search Engine
Separate the system to different parts:
Web crawler – retrieve new documents, update content of existing ones
Document store – contain document content and all metadata (URL, date, etc.)
Indexer – create the index
Querying – serving search requests
Ranking – sorting the results
Indexing
Define the corpus (document collection). How will we get it? Designing a web crawler is
a whole topic on its own (but might still be in the scope of such a question).
Likely want to ignore casing, grammatical tenses, "stop words" (most common words in
65
System Architecture Examples
class DocumentOccurrences {
public String documentId;
public int[] positions;
}
Querying
Query types:
Single word queries
Free text queries – documents containing all keywords, regardless of order
Phrase queries – documents containing all keywords in exact query order
Read the inverted index from disk or datastore into the same data structure.
Document word transformations (grammar, stop words, stemming) will also be done on
incoming querie as well (but maybe not for exact phrase queries).
For single word queries: get the list of documentId s that contain that word from the
index and return that list.
For free text queries:
Tokenize the query.
Get the list of documents for each token.
Take the union of all results.
66
System Architecture Examples
Ranking
Ranking algorithms options:
Scaling
A single word could appear in too many documents, and so maintaining a single key-
value pair for that word is not fesible.
Most likely need to shard documents based on URL (domain), and then query all shards
(fan-out) for each keyword in the search query.
67
Networking
Networking
OSI Model
Open Systems Interconnection Model
Layers
Sr
Layer Data unit Examples
No
1 Physical bit Ethernet, USB, Wi-Fi, Bluetooth, DSL
2 Data link Frame L2TP, PPP
3 Network Packet IPv4, IPv6
Segment
4 Transport TCP, UDP
(Datagram)
HTTP, FTP, SMTP, DNS, IMAP, SSH,
5 Session Data
TLS
6 Presentation Data ASCII, JPEG
7 Application Data Chrome, Mail.app
Data Units
Frame
Structure: Frame header (e.g. Ethernet) + (Network header + Transport header +
data) + Frame footer
Source and destination MAC addresses, length, checksum
Packet
Structure: Network header (e.g. IP) + (Transport header + data)
Version (IPv4/IPv6), source and destination IP addresses, length flags, TTL,
protocol, checksum
Segment
Structure: Transport header (e.g. TCP) + data
Source and destination ports, sequence number, ack number, data offsets, flags,
checksum
68
Networking
IP
Internet Protocol
No concept of connection.
Packets are passed from one computer to the next, until reaching the destination.
No delivery guarantee, no receiving ack.
Sometimes multiple copies of the same packet are passes, taking different paths, and
thus arriving at different times.
Designed to be able to route around connectivity problems.
TCP
Transmission Control Protocol
Built on-top of IP.
Connection-based.
Once a connection has been made between two parties, sending data between the two
is much like writing to a file on one side and reading from a file on the other.
Reliable and ordered, i.e., arrival and ordering are guaranteed.
Takes care of splitting your data info packets and sending those across the network, so
you can write bytes as a stream of data.
Makes sure it doesn't send data too fast for the Internet connection to handle (flow
control).
Hides all complexities of packets and unreliability.
Sends an ack for every packet received.
Queues up data until there's enough to send as a packet.
TCP tends to induce packet loss for UDP packets whenever they share a bottleneck
node (same LAN/WAN).
UDP
User Datagram Protocol
Built on-top of IP, very thin layer over it.
Unreliable protocol, usually around 1-5% packet loss.
No guarantee of ordering.
Minimizes transmission delay.
Send a packet to destination IP address and port; the packet will get passed from
computer to computer and arrive at destination or get lost.
Receiver simply listens on specific port and gets notified when a packet arrives, with the
69
Networking
70
Strings
Strings
Char Sets
Character sets translate characters to numbers.
ASCII
American Standard Code for Information Interchange
Encodes 128 characters in 7 bits.
Encoded are numbers 0 to 9, lowercase letters a to z, uppercase letters A to Z, basic
punctuation symbols, control codes and space.
ANSI standard – different "code pages" for characters 128-255 (the 1 extra bit) which
differ between countries and languages.
Unicode
Character set for most of the world's writing systems.
List of characters with unique numbers (code points).
There are more than 120,000 characters covering 129 "scripts" (a collection of letters),
there's no limit on number of letters.
Letters map to code points.
Every letter in every alphabet is assigned a number, for example the letter A = 41
( U+0041 ); the number is hexadecimal.
For example, the list of numbers represent the string "hello": 104 101 108 108
111 .
There are more than 65,526 (2^16) chars, so not every Unicode letter can be
represented by two Bytes.
Unicode character in Java: \u00fc
String s = "\u00fc";
Encoding
An encoding is a way to translate between Strings and Bytes.
Encoding is how these numbers are translated into binary numbers to be stored on disk
or in memory (Encoding translates numbers into binary).
It doesn't make sense to have a string without knowing what encoding it uses!
71
Strings
UTF-8
UTF-8 is a transmission format for Unicode, i.e., encoding.
Capable of encoding all 1,112,064 possible characters (code points) in Unicode.
Variable-length, code points are encoded with 8-bit code units.
Every code point from 0-127 is stored in a single Byte.
Code points 128 and above are stored using 2, 3, or 4 Bytes.
English text looks exactly the same in UTF-8 as it did in ASCII.
ASCII text is valid UTF-8-encoded Unicode.
byte[] however has an encoding.
To convert a string object to UTF-8, invoke the getBytes(Charset charset) on the string
with UTF-8.
84.6% of all Web pages use UTF-8.
Java String uses UTF-16 encoding internally.
For example, UTF-8 encoding will store "hello" like this (binary): 01101000 01100101
01101100 01101100 01101111
UTF-16
Capable of encoding all 1,112,064 possible characters in Unicode.
Variable-length, code points are encoded with one or two 16-bit code units.
The String class in Java uses UTF-16 encoding internally and can't be modified.
Punycode
A way to represent Unicode with the limited character subset of ASCII supported by
DNS.
For example: "bücher" => "bcher-kva"
Communicating Encoding
Email:
Content-Type: text/plain; charset="UTF-8" header in the beginning of the
message.
Web page:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> meta tag,
72
Strings
73
Java
Java
Ideally, you should have one mainstream programming language down very well, to use in
interviews. Java is mine.
Threads
An interrupt is an indication to a thread that it should stop what it is doing and do
something else. It's up to the programmer to decide exactly how a thread responds to
an interrupt, but it is very common for the thread to terminate.
A thread sends an interrupt by invoking interrupt() on the Thread object for the
thread to be interrupted. For the interrupt mechanism to work correctly, the interrupted
thread must support its own interruption.
The join method allows one thread to wait for the completion of another. If it is a
Thread object that is currently executing, then thread.join(); causes the current
thread to pause execution until its thread terminates.
When a JVM starts up, there is usually a single non-daemon thread (which typically
calls the method named main of some designated class). The JVM continues to
execute threads until either of the following occurs:
The exit method of class Runtime has been called and the security manager has
permitted the exit operation to take place.
All threads that are not daemon threads ( thread.setDaemon(true); wasn't called)
have died, either by returning from the call to the run method or by throwing an
exception that propagates beyond the run method.
Memory
74
Java
75
Java
In JDK8 8 PermGen is replaced with Metaspace, which is very similar. The main
difference is that Metaspace can expand at runtime (less java.lang.OutOfMemoryError:
PermGen ).
If you don't specify MaxMetaspaceSize the Metaspace will dynamically resize depending
76
Java
Garbage Collection
Objects are created on the heap.
Garbage collection is a mechanism provided by the JVM to reclaim heap space from
objects eligible for GC.
GC is done by the GC thread.
Before removing an object from memory, GC thread invokes the finalize() method of
the object, allowing for cleanup operations (but don't use finalizers – Effective Java 2nd
Ed. Item 7).
As a developer you can't force GC, it will only be triggered if the JVM thinks it needs to
based on the Java heap size.
You can request GC (by Runtime.gc() or System.gc() ) but it's not guaranteed to
happen.
If the JVM heap is full, and a new object can't be created, an OutOfMemoryError
exception is thrown.
An object becomes eligible for GC if it's not reachable from any live thread of any static
reference, meaning if all its references are null.
Cyclic dependencies aren't references, and they will be GCed if no other objects have a
reference to any of them.
GC scenarios: all references to an object are set to null, object declared in a block and
now out of scope, object is a member and the parent is eligible for GC, object in a
WeakHashMap .
77
Java
young/survivor - 2)
-Xss256k – Thread frame stack size
Bit Arithmetics/Operations
Can use java.util.BitSet to work with single bits.
Positive integers are stored as simple binary numbers:
int a = 0b101;
assertEquals(5, a); // true
Negative integers are stored as the two's complement of their absolute value. The two's
complement of a positive number is, when using this notation, a negative number.
To find the negative of a number ( x ) in two's complement:
Invert all the bits ( ~x )
Add 1 bit ( x + 1 )
int x = 4;
int minusX = ~x + 1;
assertEquals(-4, minusX); // true
>>
78
Java
Primitives
int casting drops any decimal, essentially rounding down.
Strings
Backed by char array, can get it by calling toCharArray() .
Strings are always immutable, and the class is final .
Strings concatenation with the + operator translates to StringBuilder operations by
the compiler, but creates a new instance of StringBuilder for every concatenation.
Use StringBuilder directly for repeated concatenation in multiple statements with
intermediate strings (single statements are fine, e.g., for String s = a + b + c use
StringBuilder ).
Interning
String literals, i.e. String s = "hello" , are interned by the compiler.
Interned strings are privately maintained in a pool, which is initially empty, by the class
String .
The public String intern() function on String places that instance in the pool and
returns the canonical representation for that string object.
When the intern method is invoked, if the pool already contains a string equal to this
String object as determined by the equals(Object) method, then the string from the
pool is returned.
Otherwise, this String object is added to the pool and a reference to this String
object is returned.
All literal strings and string-valued constant expressions are interned.
== tests for reference equality (whether they are the same object).
.equals() tests for value equality (whether they are logically equal).
You almost always want to use .equals() . In the rare situation where you know you're
79
Java
Concurrency
ITC (Inter-Thread Communication)
Java includes an inter-process communication mechanism via the wait() , notify() ,
and notifyAll() methods.
These methods are implemented as final methods in Object , so all classes have them.
All three methods can be called only from within a synchronized method.
wait() tells the calling thread to give up the monitor and go to sleep until some other
notifyAll() wakes up all the threads that called wait() on the same object. The
Access Levels
Having no modifier is also called package-private.
When a class or interface is accessible to clients for each modifier:
Autoboxing-Unboxing
Autoboxing is the automatic conversion that the Java compiler makes between the
primitive types and their corresponding object wrapper classes (e.g., converting an
int to an Integer ).
If the conversion is the other way around (object to primitive) it's unboxing.
The Java compiler applies autoboxing/unboxing when a primitive/object is:
80
Java
81
Java Examples
Arrays
int[] a = new int[10];
int[] b = new int[]{34};
Threads
public static void main(String[] args) {
Thread thread = new Thread(new MyRunnable());
thread.start();
}
Threadpool ( ExecutorService )
82
Java Examples
import java.util.concurrent.*;
pool.submit(new Runnable() {
@Override
public void run() {
// ...
}
});
// can also submit a Callable that returns a value (instead of Runnable's void):
Future<String> result = pool.submit(new Callable<String>() {
@Override
public String call() throws Exception {
return null;
}
});
Number Parsing
// java.lang.NumberFormatException thrown for bad formatting:
int i = Integer.parseInt("42");
float f = Float.parseFloat("1.2");
long l = Long.parseLong("1000");
double d = Double.parseDouble("3.14");
Random
Random rnd = new java.util.Random();
int i = rnd.nextInt(5); // exclusive of upper bound
83
Java Examples
Inheritance
interface Fruit {
String name();
}
interface Serializable {
String serialize();
}
@Override
public String serialize() {
return null;
}
}
Try/Catch/Finally
try {
// ...
} catch (IllegalArgumentException | NullPointerException e) {
// ...
} finally {
// ...
}
84
Java Examples
Enums
public enum Color {
Yellow,
Green
}
Regex
import java.util.regex.Matcher;
import java.util.regex.Pattern;
if (matcher.find()) {
String firstGroup = matcher.group(0);
// ...
} else {
// no match found
}
85
Java Examples
String s = "abcdefg";
String s = String.valueOf(1);
String s = new String("abc") // or new String(chars) for char[]
char c = s.charAt(index);
int res = s.compareTo("abd"); // == -1 (compare lexicographically; also: compareToIgno
reCase)
boolean b1 = s.contains("abc"); // true
boolean b2 = s.endsWith("efg"); // true
boolean b3 = s.equalsIgnoreCase("AbCDeFg"); // true
byte[] bytes = s.getBytes(); // {97, 98, 99} (array of 8-bit numbers [-128,127])
int i1 = s.indexOf('d'); // or with string "d"; can also use lastIndexOf()
int i2 = s.indexOf('f', startIndex);
boolean empty = s.isEmpty();
int len = s.length();
boolean b4 = s.matches("[a-z]*");
String s2 = s.replace('a', 'A');
String s3 = s.replace("ab", "AB");
String s4 = s.replaceAll("[a-z]", "x");
String[] arr = s.split("[0-9]"); // split around digits and remove them from output
boolean b5 = s.startsWith("abc"); // true
String sub = s.substring(startIncluding, endExcluding);
char[] chars = s.toCharArray();
String lower = s.toLowerCase(); // also: toUpperCase
String trimmed = s.trim(); // remove any leading/trailing whitespace
Generics
86
Java Examples
// generic classes:
class FruitWriter<S, T extends Fruit & Serializable> {
public FruitWriter(S s) {
// ...
}
// generic methods:
public <T, E> T foo(T t, E e) {
return t;
}
// bounded wildcard:
interface HasWord {}
class Baz<T> {}
public void foo(Baz<? extends HasWord> bazWithWord) {
// method foo only accepts types that extends HasWord
}
Iterator
interface Iterator<T> {
boolean hasNext();
T next();
void remove();
}
Iterable
87
Java Examples
public T next() {
// TODO: implement
}
return it;
}
}
Comparator
import java.util.Comparator;
import java.util.Objects;
@Override
public int compare(Integer o1, Integer o2) {
if (o1 < o2) return -1;
else if (Objects.equals(o1, o2)) return 0;
else return 1;
}
}
Comparable
88
Java Examples
@Override
public int compareTo(MyClass o) {
if (this < o) return -1;
else if (this.equals(o)) return 0;
else return 1;
}
}
Collections
The following overview is greatly simplified and leaves out many
details/interfaces/methods
// Implementing this interface allows an object to be the target of the "foreach" stat
ement
interface Iterable<E> {
Iterator<E> iterator();
}
89
Java Examples
Instanceof
A well-designed object-oriented program should almost never use instanceof
90
Java Examples
Varargs
A shortcut to creating an array manually
Inside method varargs parameter is seen as an array
Varargs parameter must be the last one, and only one in a method signature
// invocation:
foo();
foo("a");
foo("a", "b");
foo(new String[]{"a", "b"});
Read/Write File
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
Read
91
Java Examples
// alternatively:
List<String> lines = Files.readAllLines(path, charset) // other overloads available
Write
String s = ...;
// alternatively:
Files.write(path, lines, charset);
Synchronized
92
Java Examples
class MyClass {
Locks, Semaphores
Lock
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
Semaphore
93
Java Examples
import java.util.concurrent.Semaphore;
94
OOP
OOP
Encapsulation – an object contains (encapsulates) both (1) data and (2) the relevant
processing instructions, as we have seen. Once an object has been created, it can be
reused in other programs.
Inheritance – once you have created an object, you can use it as the foundation for
similar objects that have the same behavior and characteristics.
Polymorphism – generics, the presence of "many shapes." In object-oriented
programming, polymorphism means that a message (generalized request) produces
different results based on the object that it is sent to.
Composition (Structural)
Adapter
Facade
Decorator
Proxy
Behavioral
Chain of responsibility
Command
Iterator
Visitor
95
P,NP
P, NP, NP-complete
A problem is in class P if its solution may be found in polynomial time.
A problem is in class NP if its solution may be verified in polynomial time.
A problem in P is in NP by definition, but the converse may not be the case.
NP -complete is a family of NP problems for which you know that if one of them has a
96