Algorithms & Data Structures - Introduction
When we devise an algorithm, we should be able to do the following three:
- Identify what we're trying to achieve - formalize
- We need to analyze it's correctness.
- Also, we have to analyze it's efficiency (time, memory usage, power consumption).
Here is a very simple code. Can you guess what it's doing?
#include <iostream> int what(int a, int b) { int x = a, y = b, z = 0; while (x > 0) { z = z + y; x = x - 1; } return z; } int main() { std::cout << what(10, 2) << std::endl; return 0; }
The answer is what(a,b)=a*b.
- Proof of correctness of the algorithm.
Claim: before and after the while, ab = xy+z.
Base case:beginning of the code - x=a, y=b, z=0, so ab = ab + 0 which is true.
Inductive step: if ab = xy + z, then for the new values the same should be true: ab=x'y'+ z'
Let's check the values within the while loop: x'=x-1, y'=y, z'=z+y
Then, the claim was x'y'+z' should be equal to ab.
x'y'+z'=(x-1)(y)+(z+y)=xy-y+z+y=xy+z
Form the inductive step, x'y'+z' equal to ab: x'y'+z'=ab. So, if ab = xy + z, then for the new values, the same should be true: ab=x'y'+ z' is confirmed.
ab=xy+z has been holding true since the beginning of the loop until the end of the while loop.
We proved it's correctness! - How what(a,b)=a*b?
At the beginning:
x=0, xy+z=ab Then, xy=z becomes 0*z=z
So, xy+z=ab=z, in other words, what(a,b) is doing multiplication, a*b.
- Efficiency:
If we increase the number, something like what(98765, 4321), how long it will take?
Answer: it's linear, if number gets bigger, time increases accordingly.
"On the basis of the issues discussed here, I propose that members of SIGACT, and editors of computer science and mathematics journals, adopt O, Ω, and θ notations as defined above, unless a better alternative can be found reasonable soon."
-D.E. Knuth, "Big Omicron and Big Omega and Big Theta", SIGACT News, 1976. Reprinted in "Selected Papers on Analysis of Algorithms."
- T(n) = O(f(n)) (big-oh) means that the growth rate of T(n) is asymptotically less than or equal to to the growth rate of f(n).
- T(n) = Ω(f(n)) (Omega) means that the growth rate of T(n) is asymptotically greater than or equal to to the growth rate of f(n).
- T(n) = θ(f(n)) (Theta) means that the growth rate of T(n) is asymptotically equal to to the growth rate of f(n).
- So, if T(n) = n^2 + 5n, the followings are true.
T(n) = Ω(n)
T(n) = θ(n^2)
T(n) = O(n^3)
When we do an operation a million times, we don't really care about the worst-case or the best-case of that operation. What we care about is how much time is taken in total when we repeat the operation a million times.
So, while certain operations may be extremely costly in resources, they cannot occur at a high-enough frequency to weigh down the entire program because the number of less costly operations will far outnumber the costly ones in the long run.
A good example is a dynamic array to which we repeatedly add new items. Normally adding an item takes constant time (O(1)). But each time the array is full, we allocate twice as much space, and it usually takes O(n) time where n is the current size of the array.
Though each time the array double the size when it's full, it does not happen every time when we add an item, and actually, the cost of this doubling is spread out among the additions. This means that in the long term, the average time taken for adding n items to the array is O(n), and so the amortized time (i.e. time per addition) is O(1).
Why data strcutures matter?
We need to organize data so that it can be accessed quickly and usefully. Examples of data structures are queues, stacks, lists, heaps, search trees, hashtables, bloom filters, union-find, etc.
Why do we have so many data structures?
That's because different data structures support different sets of operations. In other words, each of them is suitable for different types of tasks.
So, a programmer should choose the minimal data structure that supports all the operations that's needed.
- Stacks and Queues
Stacks and queues are dynamic sets in which the element removed from the set by delete operation.
In a stack, the element deleted from the set is the one most recently inserted: the stack implementations a last-in, first-out or LIFO, policy.
Similarly, in a queue, the element deleted is always the one that has been in the set for the longest time:
the queue implements a first-in, first-out, or FIFO, policy.
Here are examples:
Stack with linked list data structure
Class Stack with linked list data structure
Generic Stack with array
Example 8 - Queue Struct : Using Linked List
Example 8B - Queue Class: Using Linked List
Queue: Using Vector
- Linked Lists
A linked list is a data structure in which the objects are arranged in a linear order.
Unlike an array, however, in which the linear order is determined by the arrays indices, the order in a linked list is determined by a pointer in each object.
Linked lists provide a simple, flexible representation for dynamic sets, supporting several operations: search, insert, delete, minimum, maximum, successor, and predecessor.
A list may have one of several forms.
It may be either singly linked or doubly linked, it may be sorted or not, and it may be circular or not.
If a list is singly linked, we omit the prev pointer in each element.
If a list is sorted, the linear order of the list corresponds to the linear order of keys stored in elements of the list; the minimum element is then the head of the list, and the maximum element is the tail.
If the list is unsorted, the elements can appear in any order.
In a circular list, the prev pointer of the head of the list points to the tail, and the next pointer of the tail of the list points to the head. We can think of a circular list as a ring of elements. Here is some Linked List Examples in C++.
- Binary Search Trees
The search tree data structure supports many dynamic-set operations, including search, minimum, maximum, predecessor, successor, insert, and delete.
Thus, we can use a search tree both as a dictionary and as a priority queue.
Basic operations on a binary search tree take time proportional to the height of the tree.
For a complete binary tree with n nodes, such operations run in O(log n) worst-case time.
If the tree is a linear chain of n nodes, however, the same operations take O(n) worst-case time.
In general, the expected height of a randomly built binary search tree is O(log n), so that basic dynamic-set operations on such a tree take O(log n) time on average.
In practice, we can't guarantee that binary search trees are built randomly, but we can design variations of binary search trees with good guaranteed worst-case performance. One such variation is red-black tree which gives us height O(log n).
Here is some Binary Tree Examples in C++.
- B-Trees
From "Introduction to Algorithms" 3rd Ed., 2009.
B-trees are balanced search trees designed to work well on disks or other direct-access secondary storage devices. B-trees are similar to red-black trees, but they are better at minimizing disk I/O operations. Many database systems use B-trees to store information.
- B-trees differ from red-black trees in that B-tree nodes may have many children, from a few to thousands. The branching factor of a B-tree can be quite large, although it usually depends on characteristics of the disk unit used.
- B-trees are similar to red-black trees in that every n-node B-tree has height of O(log n). The exact height of a B-tree can be considerably less than that of a red-black tree, however, because of its branching factor, and hence the base of the logarithm that expresses its height, can be much larger. Therefore, we can also use B-trees to implement many dynamic-set operations in time O(log n).
- Red-Black Trees
A Red-Black tree is a binary search tree with one extra bit of storage per node: its color, which can be either RED or BLACK.
By constraining the node colors on any simple path from the root to leaf, red-black trees ensure that no such path is more than twice as long as any other, so that the tree is approximately balanced.
Each node of the tree now contains the attributes color, key, left, right, and p.
If a child or the parent of a node does not exist, the corresponding pointer attribute of the node contains the value NIL.
We consider these NILs as being pointers to leaves (external nodes) of the binary search tree and the normal, key-bearing nodes as being internal nodes of the tree.
A red-black tree is a binary tree that satisfies the following red-black properties:
- Every node is either red or black.
- The root is black.
- Every leaf(NIL) is black.
- If a node is red, then both its children are black.
- For each node, all simple paths from the node to descendant leaves contain the same number of black nodes.
- AVL Trees
An AVL tree is a BST that is height balanced: for each node x, the heights of the left and right subtrees of x differ by at most 1. - RB-tree vs AVL-tree
AVL trees are always at the optimal balance point, but can slow down inserts and deletes because they are more rigidly balanced than red-black trees. But we will have the fastest look times.
Red black trees are also self balancing but can become slightly imbalanced to improve insert and delete times, with a small potential hit to search times.
- Hash Tables
Many applications require a dynamic set that support only the dictionary operations such as insert, search, and delete.
A hash table is an effective data structure for implementing dictionaries. Although searching for an element in a hash table can take as long as searching for an element in a linked list - O(n) time in the worst case - in practice, hashing performs extremely well.
Under reasonable assumptions, the average time to search for an element in a hash table is O(1).
A hash table generalizes the simpler notion of an ordinary array.
Directly addressing into an ordinary array makes effective use of our ability to examine an arbitrary position in an array in O(1) time.
When the number of keys actually stored is small relative to the total number of possible keys, hash table becomes an effective alternative to directly addressing an array, since a hash table typically uses an array of size proportional to the number of keys actually stored. Instead of using the keys as an array index directly, the array index is computed from the key.
Implementing hash table involves couple of ideas such as chaining as a way to handle collisions in which more than one key map to the same array index and open addressing which is another way to deal with collisions, and perfect hashing which gives us O(1) worst-case search time.
Check an example for Hash Table from K&R.;
More on hashing Hash Map Sample codes
Comparison: Hash Table vs Tree
Examples of application: 1. De-duplication
- Removing duplicates from a stream of objects such as linear scan through a huge file or object arriving in real time (packets).
- Keep track of unique objects.
- reports unique visitors to web site
- avoid duplicates in search results
- How?
When a new object x arrives, lookup x in hash table. If not found, insert x into hash table.
- Input: unsorted array of n integers. Target sum t with any two numbers in the array.
- Goal: Determine whether or not there are two numbers x, y in the array with x+y = t.
- symbol tables in compilers
- blocking network traffic
- search algorithms (e.g., game tree exploration)
- Heap
In a heap, the records are stored in an array so that each key is guaranteed to be larger than its two children keys. So, no node in a heap-ordered tree has a key larger than the key at the root.
The basic operations are:
- insert - adding a new object to the heap, O(log(n))
- extract - extracting max (or min) value, O(log(n))
- heapify - n batched inserts, O(n)
- delete - O(log(n))
- Spatial Data Structures
A spatial data structure is one that organized geometry in two- and three- dimensional hierarchical structures. So, the structure is nested and of recursive nature.
Sorting Algorithms
n : the number of items to be sorted
k : the size of each key
d : the digit size used by the implementation.
Name | Average | Worst | Memory | Stable |
---|---|---|---|---|
Bubble sort | n2 | n2 | 1 | Yes |
Selection Sort | n2 | n2 | 1 | No |
Insertion sort | n2 | n2 | 1 | Yes |
Shell Sort | - | nlog2n | 1 | No |
Binary Tree sort | n log n | n log n | n | Yes |
Merge Sort | n log n | n log n | Depends | Yes |
Heap Sort | n log n | n log n | 1 | No |
Quick Sort | n log n | n 2 | log n | Depends |
Bucket Sort | n k | n 2 k | n k | Yes |
Counting Sort | n + k | n + k | n + 2k | Yes |
LSD Radix Sort | n k/d | n k/d | n | Yes |
MSD Radix sort | n k/d | n k/d | n + k/d 2d | Yes |
- Bubble Sort
- Insertion Sort
- Quick Sort
- Selection Sort
- Heap Sort
- Bucket Sort
- Merge Sort
- Radix Sort
- Counting Sort
- Divide and Conquer
- Randomized Algorithms
- Greedy Algorithms
Greedy algorithms build up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate benefit (local optimization). Although such an approach can be disastrous for some computational tasks, there are many for which it is optimal.
Greey algorithms iteratively make myopic decisions, hoping everything works out at the end.
- Greey algorithm always makes the choice that currently seems to give the highest gain, that is to be as greedy as possible.
- It makes a locally optimal choice in the hope that the remaining unique subproblem leads to a globally optimal solution.
- For many problems, a greedy algorithm gives an optimal solution, but not for all problems.
- Even when Greey algorithm does not construct an optimal solution, the solution can be used as a starting point to actually construct an optimal solution.
Example: Dijkstra's shortest path algorithm which process each destination once, irrevocably.
Greedy vs. Divide & Conquer
- Easy to propose multiple greey algorithms for many problems
- Easy to analyze the running time
- Hard to establish correctness contrast with straightforward inductive correctness proofs
- Most of the greedy algorithms are not correct even if our intuition say otherwise!
- Dynamic Programming (DP)
NP(Nondeterministic Polynomial time)-Complete = incredibly hard problem to solve quickly even with the fastest machine.
Here is a good one on the issue: Your Favorite NP-Complete Cheat.
- Depth-First Search (DFS)
- explore aggressively, only retreat when necessary
- computes a topological ordering of a directed acyclic graph and strongly connected components of directed graphs
picture from Strongly Connected Component - linear runtime
- Breadth-First Search (BFS)
- explore nodes in layers
- runtime - linear time
- application - shortest paths
Compression Algorithm - Huffman
Riaa
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization