Data structures and Algorithms

Learning resources and reference notes. Updated Jan 2025 # Learning DS&A Resources for learning Data structures and Algorithms (DS&A). If you are a beginner programer then read the intro books. If you are job searching then read the Algorithm design manual. And if you already have some knowledge and want to go to next step, take a look at Algorithms (4th ed) and its [webpage at Princeton](https://algs4.cs.princeton.edu/home/) (MOOC available). !!! WARNING: Prerequisites For the intro level of resources you don't need almost any math, however as you go deeper into data structures and algorithms it would be good if you had college level of math knowledge. Chapter one of TAOCP has a very rigorous mathematics introduction you might want to take a look at if you intent do get serious about DS&A. - Intro - Grokking Algorithms 2nd ed (2024) - light, illustrated introduction - Grokking Data strctures (2024) - light, illustrated introduction - Problem Solving with Algorithms and Data Structures using Python - online and free - Medium - The Algorithm Design Manual - good for job interview prep - Open Data Structures (Morin Pat) - online and free - Advanced (more rigorous books) - Sedgewick: Algorithms (4th Edition) - good for long term growth as engineer - Introduction to algorithms (CLRS) - reference - The Art of Computer Programming (TAOCP) - reference - Other - [Big-O cheatsheet](https://www.bigocheatsheet.com/) # Binary search - Its input is a sorted list of elements, - If an element you're looking for is in that list, binary search returns the position where it's located. Otherwise, binary search returns `null`. - In general, for any list of n, binary search will take log2(n) steps (log time) to run in the worst case, whereas simple search will take n steps (linear time). - For a list of 1,024 elements, log 1,024 = 10, because 2^10 == 1,024. So for a list of 1,024 numbers, you'd have to check 10 numbers at most. ```python # The binary_search function takes a sorted array and an item. If the # item is in the array, the function returns its position. def binary_search(lst, item): # keep track of where we are in the lsit low = 0 high = len(lst) -1 # while we arent down to one element... while low <= high: # check middle element mid = (low + high) // 2 guess = lst[mid] if guess == item: # found the item return mid elif guess > item: # the guess was too high high = mid -1 elif guess < item: # the guess was too low low = mid + 1 return None # we didnt find the item #lets test it test = [1,2,3,4,5,6,7,8,9,10] print(binary_search(test, 3)) print(binary_search(test, 8)) ``` ## Big O notation Run time of algorithms is expressed in Big O notation. ``` | num elements | linear search | logarithmic search | | ------------ | ------------- | ------------------ | | 100 | 100ms | 7ms | | 10000 | 10seconds | 14ms | | 1000000000 | 11days | 32ms | ``` Big O establishes a worst-case run time. Along with the worst-case run time, it's also important to look at the average-case run time. **Common Big O run times sorted from fastest to slowest** - O(log n), also known as log time. Binary search. - O(n), also known as linear time. Simple search. - O(n * log n) A fast sorting algorithm, like quicksort - O(n^2 ) A slow sorting algorithm, like selection sort - O(n!) A really slow algorithm, like the traveling salesperson Algorithm speed isn't measured in seconds, but in growth of the number of operations. Instead, we talk about how quickly the run time of an algorithm increases as the size of the input increases. # Selection Sort ## Arrays and linked lists Using an array means all your tasks are stored contiguously (right next to each other) in memory. With linked lists, your items can be anywhere in memory since each item stores the address of the next item in the list. Linked lists are great if you're going to read all the items one at a time: you can read one item, follow the address to the next item, and so on. But if you're going to keep jumping around, linked lists are terrible. Arrays are different. You know the address for every item in your array. The position of an element is called its index. Run times for common operations on arrays and lists: ``` | operation | arrays | lists | | --------- | ------ | ----- | | reading | O(1) | O(n) | | insertion | O(n) | O(1) | | deletion | O(n) | O(1) | ``` O(n) = linear time, O(1) = constant time Arrays see a lot of use because they allow random access. Here are two different types of access: random access and sequential access. Sequential access means reading the elements one by one, starting at the first element. Linked lists can only do sequential access. If you want to read the 10th element of a linked list, you have to read the first 9 elements and follow the links to the 10th element. Random access means you can jump directly to the 10th element. ## Selection sort In computer science, selection sort is a sorting algorithm, specifically an in-place comparison sort. It has O(n2) time complexity, making it inefficient on large lists, and generally performs worse than the similar insertion sort. Selection sort is noted for its simplicity, and it has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited. ``` | Class | Sorting algorithm | | --------------------------- | ----------------------------- | | Data structure | Array | | Worst-case performance | О(n2) comparisons, О(n) swaps | | Best-case performance | О(n2) comparisons, О(n) swaps | | Average performance | О(n2) comparisons, О(n) swaps | | Worst-case space complexity | O(1) auxiliary | ``` ### Algorithm The algorithm divides the input list into two parts: the sublist of items already sorted, which is built up from left to right at the front (left) of the list, and the sublist of items remaining to be sorted that occupy the rest of the list. Initially, the sorted sublist is empty and the unsorted sublist is the entire input list. The algorithm proceeds by finding the smallest (or largest, depending on sorting order) element in the unsorted sublist, exchanging (swapping) it with the leftmost unsorted element (putting it in sorted order), and moving the sublist boundaries one element to the right. ### Selection sort Implementation ```python def findSmallest(arr): smallest = arr[0] smallest_index = 0 for i in range(1, len(arr)): if arr[i] < smallest: smallest_index = i return smallest_index def selectionSort(arr): newArr = [] for i in range(len(arr)): smallest = findSmallest(arr) newArr.append(arr.pop(smallest)) return newArr print(selectionSort([5,4,6,2,10,13,9])) ``` **Recap:** Arrays allow fast reads. Linked lists allow fast inserts and deletes. All elements in the array should be the same type (all ints, all doubles, and so on). # Recursion hat's why every recursive function has two parts: the base case, and the recursive case. he recursive case is when the function calls itself. he base case is when the function doesn't call itself again ... so it doesn't go into an infinite loop. ```python # countdown using recursion def countdown(i): print(i) if i <= 0: # Base case return else: # Recursive case countdown(i-1) ``` ## The stack In computer science, a stack is an abstract data type that serves as a collection of elements, with two principal operations: - push, which adds an element to the collection, and - pop, which removes the most recently added element that was not yet removed. ### The call stack In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This kind of stack is also known as an execution stack, program stack, control stack, run-time stack, or machine stack, and is often shortened to just "the stack" All function calls go onto the call stack. he call stack can get very large, which takes up a lot of memory. # Quicksort ## Divide & conquer Divide and conquer is an algorithm design paradigm based on multi-branched recursion. A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem. This divide and conquer technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g., quicksort, merge sort), multiplying large numbers (e.g. the Karatsuba algorithm), finding the closest pair of points, syntactic analysis (e.g., top-down parsers), and computing the discrete Fourier transform (FFTs). ```python # sum of numbers in array using loop def sum(arr): total = 0 for i in arr: total += i return total # sum of numbers in array using recursion def sum2(arr): if arr == []: return 0 return arr[0] + sum(arr[1:]) # recursive function to count the number of items in a list def count(lst): if lst == []: return 0 return 1 + count(lst[1:]) # Find the maximum number in a list using recursion. def find_max(lst): if len(lst) == 0: return None if len(lst) == 1: return lst[0] else: sub_max = find_max(lst[1:]) return lst[0] if lst[0] > sub_max else sub_max print(sum([1,2,3,4,5])) print(sum2([1,2,3,4,5])) print(count([1,2,3,'a',4,'b',5])) print(find_max([1,2,3,4,5])) ``` ## Quicksort Quicksort (sometimes called partition-exchange sort) is an efficient sorting algorithm, serving as a systematic method for placing the elements of an array in order. Quicksort is a comparison sort, meaning that it can sort items of any type for which a "less-than" relation (formally, a total order) is defined. In efficient implementations it is not a stable sort, meaning that the relative order of equal sort items is not preserved. Quicksort can operate in-place on an array, requiring small additional amounts of memory to perform the sorting. It is very similar to selection sort, except that it does not always choose worst-case partition. The C standard library has a function called `qsort`, which is its implementation of quicksort. ``` | Class | Sorting algorithm | | --------------------------- | ----------------- | | Worst-case performance | O(n2) | | Best-case performance | O(n log n) | | Average performance | O(n log n) | | Worst-case space complexity | O(n) / O(log n) | ``` ### Algorithm Quicksort is a divide and conquer algorithm. Quicksort first divides a large array into two smaller sub-arrays: the low elements and the high elements. Quicksort can then recursively sort the sub-arrays. The steps are: - Pick an element, called a pivot, from the array. - Partitioning: reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation. - Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values. The base case of the recursion is arrays of size zero or one, which are in order by definition, so they never need to be sorted. ### Quicksort Implementation ```python def quicksort(array): if len(array) < 2: return array # Base case: arrays with 0 or 1 element are already "sorted." else: pivot = array[0] # Recursive case less = [i for i in array[1:] if i <= pivot] # Sub-array of all the elements less than the pivot greater = [i for i in array[1:] if i > pivot] # Sub-array of all the elements greater than the pivot return quicksort(less) + [pivot] + quicksort(greater) print(quicksort([10, 5, 2, 3])) ``` ## Merge sort Merge sort (also commonly spelled mergesort) is an efficient, general-purpose, comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the implementation preserves the input order of equal elements in the sorted output. Merge sort is a divide and conquer algorithm ``` | Class | Sorting algorithm | | --------------------------- | ---------------------------------------------------------------- | | Data structure | Array | | Worst-case performance | O(n log n) | | Best-case performance | O(n log n) typical, O(n) natural variant | | Average performance | O(n log n) | | Worst-case space complexity | О(n) total with O(n) auxiliary, O(1) auxiliary with linked lists | ``` ### Algorithm Conceptually, a merge sort works as follows: - Divide the unsorted list into n sublists, each containing 1 element (a list of 1 element is considered sorted). - Repeatedly merge sublists to produce new sorted sublists until there is only 1 sublist remaining. This will be the sorted list. ### Merge sort Implementation ```python def mergeSort(lst): if len(lst)>1: mid = len(lst)//2 lefthalf = lst[:mid] righthalf = lst[mid:] mergeSort(lefthalf) mergeSort(righthalf) i = 0, j = 0, k = 0 while i < len(lefthalf) and j < len(righthalf): if lefthalf[i] < righthalf[j]: lst[k] = lefthalf[i] i = i + 1 else: lst[k] = righthalf[j] j = j + 1 k = k + 1 while i < len(lefthalf): lst[k] = lefthalf[i] i = i + 1 k = k + 1 while j < len(righthalf): lst[k] = righthalf[j] j = j + 1 k = k + 1 lst = [54,26,93,17,77,31,44,55,20] print(mergeSort(lst)) ``` # Hash Tables Also known as hash maps, maps, dictionaries, and associative arrays. In computing, a hash table (hash map) is a data structure that implements an associative array abstract data type, a structure that can map keys to values. A hash table uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found. Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions where the hash function generates the same index for more than one key. Such collisions must be accommodated in some way. ``` | Algorithm | Average | Worst case | | --------- | ------- | ---------- | | Space | O(n) | O(n) | | Search | O(1) | O(n) | | Insert | O(1) | O(n) | | Delete | O(1) | O(n) | ``` Python has hash tables; they're called dictionaries. You can make a new hash table using the `dict` function: `book = dict()` ```python >>> book["apple"] = 0.67 # an apple costs 67 cents >>> book["milk"] = 1.49 # Milk costs $1.49. >>> book["avocado"] = 1.49 >>> print book {'avocado': 1.49, 'apple': 0.67, 'milk': 1.49} >>> print book["avocado"] 1.49 ``` In C++, unordered associative containers are a group of class templates in the C++ Standard Librarythat implement hash table variants. Being templates, they can be used to store arbitraryelements, such as integers or custom classes. The following containers are defined in thecurrent revision of the C++ standard: `unordered_set`, `unordered_map`, `unordered_multiset`, `unordered_multimap`. Each of these containers differ only on constraints placed on their elements. The unordered associative containers are similar to the associative containers in the C++ Standard Library but have different constraints. As their name implies, the elements in the unordered associative containers are not ordered. This is due to the use of hashing to store objects. The containers can still be iterated through like a regular associative container. ```cpp #include #include #include int main(){ std::unordered_map months; months["april"] = 30; months["september"] = 30; std::cout << "september -> " << months["september"] << std::endl; std::cout << "april -> " << months["april"] << std::endl; return 0; } ``` **Recap** To recap, hashes are good for - Modeling relationships from one thing to another thing - Filtering out duplicates - Caching/memorizing data instead of making your server do work # Breadth-first Search ## Graph In computer science, a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from mathematics, specifically the field of graph theory. A graph data structure consists of a finite (and possibly mutable) set of vertices or nodes or points, together with a set of unordered pairs of these vertices for an undirected graph or a set of ordered pairs for a directed graph. These pairs are known as edges, arcs, or lines for an undirected graph and as arrows, directed edges, directed arcs, or directed lines for a directed graph. The vertices may be part of the graph structure, or may be external entities represented by integer indices or references. A graph data structure may also associate to each edge some edge value, such as a symbolic label or a numeric attribute (cost, capacity, length, etc.). ## Breadth-first search Breadth-first search (BFS) is an algorithm for traversing or searching tree or graph data structures. It starts at the tree root (or some arbitrary node of a graph, sometimes referred to as a 'search key'), and explores all of the neighbor nodes at the present depth prior to moving on to the nodes at the next depth level. It uses the opposite strategy as depth-first search, which instead explores the highest-depth nodes first before being forced to backtrack and expand shallower nodes. ## Queue In computer science, a queue is a particular kind of abstract data type or collection in which the entities in the collection are kept in order and the principal (or only) operations on the collection are the addition of entities to the rear terminal position, known as enqueue, and removal of entities from the front terminal position, known as dequeue. This makes the queue a First-In-First-Out (FIFO) data structure. In a FIFO data structure, the first element added to the queue will be the first one to be removed. This is equivalent to the requirement that once a new element is added, all elements that were added before have to be removed before the new element can be removed. Often a peek or front operation is also entered, returning the value of the front element without dequeuing it. A queue is an example of a linear data structure, or more abstractly a sequential collection. Queues provide services in computer science, transport, and operations research where various entities such as data, objects, persons, or events are stored and held to be processed later. In these contexts, the queue performs the function of a buffer. Queues are common in computer programs, where they are implemented as data structures coupled with access routines, as an abstract data structure or in object-oriented languages as classes. Common implementations are circular buffers and linked lists. ``` | Algorithm | Average | Worst case | | --------- | ------- | ---------- | | Space | O(n) | O(n) | | Search | O(n) | O(n) | | Insert | O(1) | O(1) | | Delete | O(1) | O(1) | ``` ## Implementing the graph ```python # find mango seller among friends/mutal friends from collections import deque graph = {} graph["you"] = ["alice", "bob", "claire"] graph["bob"] = ["anuj", "peggy"] graph["alice"] = ["peggy"] graph["claire"] = ["thom", "jonny"] graph["anuj"] = [] graph["peggy"] = [] graph["thom"] = [] graph["jonny"] = [] def person_is_seller(name): return name[-1] == 'm' #if name ends withm m they are mango seller. def find_mango_seller(name): search_queue = deque() # Creates a new queue search_queue += graph["you"] # Adds all of your neighbors to the search queue searched = [] # keep track of which people you've searched before. while search_queue: # While the queue isn't empty ... person = search_queue.popleft() # ... grabs the first person off the queue if person_is_seller(person): # Checks whether the person is a mango seller print(person + " is a mango seller!") # Yes, they're a mango seller. return True else: # No, they aren't. Add all of this person's friends to the search queue. search_queue += graph[person] searched.append(person) # Mark this person as searched return False # If you reached here, no one in the queue was a mango seller. find_mango_seller("you") ``` If you search your entire network for a mango seller, that means you’ll follow each edge (remember, an edge is the arrow or connection from one person to another). So the running time is at least O(number of edges). You also keep a queue of every person to search. Adding one person to the queue takes constant time: O(1). Doing this for every person will take O(number of people) total. Breadth-first search takes O(number of people + number of edges), and it’s more commonly written as O(V+E) (V for number of vertices, E for number of edges). **Recap** - Breadth-first search tells you if there’s a path from A to B. - If there’s a path, breadth-first search will ind the shortest path. - If you have a problem like "find the shortest X," try modeling your problem as a graph, and use breadth-first search to solve. - A directed graph has arrows, and the relationship follows the direction of the arrow (rama -> adit means "rama owes adit money"). - Undirected graphs don’t have arrows, and the relationship goes both ways (ross - rachel means "ross dated rachel and rachel dated ross"). - Queues are FIFO (First In, First Out). - Stacks are LIFO (Last In, First Out). - You need to check people in the order they were added to the search list, so the search list needs to be a queue. Otherwise, you won’t get the shortest path. - Once you check someone, make sure you don’t check them again. Otherwise, you might end up in an infinite loop. # Dijkstra's algorithm Dijkstra's algorithm is an algorithm for finding the shortest paths between nodes in a graph. Dijkstra’s algorithm has four steps: 1. Find the cheapest node. This is the node you can get to in the least amount of time. 2. Check whether there’s a cheaper path to the neighbors of this node. If so, update their costs. 3. Repeat until you’ve done this for every node in the graph. 4. Calculate the final path. When you work with Dijkstra’s algorithm, each edge in the graph has a number associated with it. These are called weights. A graph with weights is called a weighted graph. A graph without weights is called an unweighted graph. To calculate the shortest path in an unweighted graph, use breadth-first search. To calculate the shortest path in a weighted graph, use Dijkstra’s algorithm. Graphs can also have cycles. It means you can start at a node, travel around, and end up at the same node. An undirected graph with only 2 nodes means that both nodes point to each other. That’s a cycle. With an undirected graph, each edge adds another cycle. Dijkstra’s algorithm only works with directed acyclic graphs, called DAGs for short. You can’t use negative-weight edges with Dijkstra’s algorithm. If you want to ind the shortest path in a graph that has negative-weight edges use the Bellman-Ford algorithm. In some fields, artificial intelligence in particular, Dijkstra's algorithm or a variant of it is known as uniform cost search and formulated as an instance of the more general idea of best-first search. ## Dijkstra's algorithm Implementation ```python # the graph hash table graph = {} graph["start"] = {} graph["start"]["a"] = 6 graph["start"]["b"] = 2 graph["a"] = {} graph["a"]["fin"] = 1 graph["b"] = {} graph["b"]["a"] = 3 graph["b"]["fin"] = 5 graph["fin"] = {} # the costs table infinity = float("inf") costs = {} costs["a"] = 6 costs["b"] = 2 costs["fin"] = infinity # the parents table parents = {} parents["a"] = "start" parents["b"] = "start" parents["fin"] = None processed = [] # keep track of processed nodes # while we have nodes to proces: (grab the node closes to start-> # -> updates costs for its neihgbours -> if any of neighbours costs were updated # update the parents too -> mark this node as processed ) loop def find_lowest_cost_node(costs): lowest_cost = float("inf") lowest_cost_node = None for node in costs: # Go through each node. cost = costs[node] # If it's the lowest cost so far and hasn't been processed yet... if cost < lowest_cost and node not in processed: # ... set it as the new lowest-cost node. lowest_cost = cost lowest_cost_node = node return lowest_cost_node # Find the lowest-cost node that you haven't processed yet. node = find_lowest_cost_node(costs) # If you've processed all the nodes, this while loop is done. while node is not None: cost = costs[node] neighbors = graph[node] for n in neighbors.keys(): # Go through all the neighbors of this node. new_cost = cost + neighbors[n] # If it's cheaper to get to this neighbor by going through this node... if costs[n] > new_cost: costs[n] = new_cost # ... update the cost for this node. parents[n] = node # This node becomes the new parent for this neighbor. processed.append(node) # Mark the node as processed. node = find_lowest_cost_node(costs) # Find the next node to process, and loop. print("Cost from the start to each node: " + costs) ``` **Recap** - Breadth-first search is used to calculate the shortest path for an unweighted graph. - Dijkstra’s algorithm is used to calculate the shortest path for a weighted graph. - Dijkstra’s algorithm works when all the weights are positive. - If you have negative weights, use the Bellman-Ford algorithm. # Greedy algorithms A greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the intent of finding a global optimum. In many problems, a greedy strategy does not usually produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a globally optimal solution in a reasonable amount of time. For example, a greedy strategy for the traveling salesman problem (which is of a high computational complexity) is the following heuristic: "At each step of the journey, visit the nearest unvisited city." This heuristic doesn't intend to find a best solution, but it terminates in a reasonable number of steps; finding an optimal solution to such a complex problem typically requires unreasonably many steps. In general, greedy algorithms have five components: - A candidate set, from which a solution is created - A selection function, which chooses the best candidate to be added to the solution - A feasibility function, that is used to determine if a candidate can be used to contribute to a solution - An objective function, which assigns a value to a solution, or a partial solution, and - A solution function, which will indicate when we have discovered a complete solution ## Approximation algorithms When calculating the exact solution will take too much time, an approximation algorithm will work. Approximation algorithms are judged by - How fast they are - How close they are to the optimal solution ```python # covering most states with least radio stations states_needed = set(["mt", "wa", "or", "id", "nv", "ut", "ca", "az"]) stations = {} stations["kone"] = set(["id", "nv", "ut"]) stations["ktwo"] = set(["wa", "id", "mt"]) stations["kthree"] = set(["or", "nv", "ca"]) stations["kfour"] = set(["nv", "ut"]) stations["kfive"] = set(["ca", "az"]) final_stations = set() while states_needed: best_station = None states_covered = set() for station, states in stations.items(): covered = states_needed & states # intersection if len(covered) > len(states_covered): best_station = station states_covered = covered states_needed -= states_covered final_stations.add(best_station) print(final_stations) ``` **Recap** - Greedy algorithms optimize locally, hoping to end up with a global optimum. - NP-complete problems have no known fast solution. - If you have an NP-complete problem, your best bet is to use an approximation algorithm. - Greedy algorithms are easy to write and fast to run, so they make good approximation algorithms. # Dynamic programming In computer science, a problem is said to have optimal substructure if an optimal solution can be constructed from optimal solutions of its subproblems. This property is used to determine the usefulness of dynamic programming and greedy algorithms for a problem. Dynamic programming is both a mathematical optimization method and a computer programming method. In both contexts it refers to simplifying a complicated problem by breaking it down into simpler sub-problems in a recursive manner. While some decision problems cannot be taken apart this way, decisions that span several points in time do often break apart recursively. Likewise, in computer science, if a problem can be solved optimally by breaking it into sub-problems and then recursively finding the optimal solutions to the sub-problems, then it is said to have optimal substructure. ```python # Longest common substring # Alex typed hish. Which word did Alex mean to type: ish or vista? if word_a[i] == word_b[j]: # The letters match. cell[i][j] = cell[i-1][j-1] + 1 else: # The letters don’t match. cell[i][j] = 0 # Longest common subsequence # Suppose Alex accidentally searched for fosh. Which word did he mean: # fish or fort? if word_a[i] == word_b[j]: # The letters match. cell[i][j] = cell[i-1][j-1] + 1 else: # The letters don't match. cell[i][j] = max(cell[i-1][j], cell[i][j-1]) ``` **Recap** - Dynamic programming is useful when you’re trying to optimize something given a constraint. - You can use dynamic programming when the problem can be broken into discrete subproblems. - Every dynamic-programming solution involves a grid. - he values in the cells are usually what you’re trying to optimize. - Each cell is a subproblem, so think about how you can divide your problem into subproblems. - here’s no single formula for calculating a dynamic-programming solution. - `diff`s use dynamic programming # K-nearest neighbors In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.). Classification is an example of pattern recognition. In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors'). More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms. ## OCR Optical character recognition (also optical character reader, OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Feature extraction basic type of core OCR algorithms decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. The extraction features reduces the dimensionality of the representation and makes the recognition process computationally efficient. These features are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software. Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match. ## Native Bayes classifier In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features. It is a popular (baseline) method for text categorization, the problem of judging documents as belonging to one category or the other (such as spam or legitimate, sports or politics, etc.) with word frequencies as the features. With appropriate pre-processing, it is competitive in this domain with more advanced methods including support vector machines. It also finds application in automatic medical diagnosis. **Recap** - KNN is used for classiication and regression and involves looking at the k-nearest neighbors. - Classiication = categorization into a group. - Regression = predicting a response (like a number). - Feature extraction means converting an item (like a fruit or a user) into a list of numbers that can be compared. - Picking good features is an important part of a successful KNN algorithm. # Other ## Trees In computer science, a tree is a widely used abstract data type (ADT)—or data structure implementing this ADT—that simulates a hierarchical tree structure, with a root value and subtrees of children with a parent node, represented as a set of linked nodes. A tree data structure can be defined recursively (locally) as a collection of nodes (starting at a root node), where each node is a data structure consisting of a value, together with a list of references to nodes (the "children"), with the constraints that no reference is duplicated, and none points to the root. Alternatively, a tree can be defined abstractly as a whole (globally) as an ordered tree, with a value assigned to each node. Both these perspectives are useful: while a tree can be analyzed mathematically as a whole, when actually represented as a data structure it is usually represented and worked with separately by node (rather than as a set of nodes and an adjacency list of edges between nodes, as one may represent a digraph, for instance). For example, looking at a tree as a whole, one can talk about "the parent node" of a given node, but in general as a data structure a given node only contains the list of its children, but does not contain a reference to its parent (if any). ### B-trees B-tree is a self-balancing tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children. Unlike self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as discs. It is commonly used in databases and filesystems. ``` | Algorithm | Average | Worst case | | --------- | -------- | ----------- | | Space | O(n) | O(n) | | Search | O(log n) | O(log n) | | Insert | O(log n) | O(log n) | | Delete | O(log n) | O(log n) | ``` ### Red-black trees A red–black tree is a kind of self-balancing binary search tree in computer science. Each node of the binary tree has an extra bit, and that bit is often interpreted as the color (red or black) of the node. These color bits are used to ensure the tree remains approximately balanced during insertions and deletions. The tree does not contain any other data specific to its being a red–black tree so its memory footprint is almost identical to a classic (uncolored) binary search tree. In many cases, the additional bit of information can be stored at no additional memory cost. ``` | Algorithm | Average | Worst case | | --------- | -------- | ---------- | | Space | O(n) | O(n) | | Search | O(log n) | O(log n) | | Insert | O(log n) | O(log n) | | Delete | O(log n) | O(log n) | ``` ### Splay trees A splay tree is a self-adjusting binary search tree with the additional property that recently accessed elements are quick to access again. It performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For many sequences of non-random operations, splay trees perform better than other search trees, even when the specific pattern of the sequence is unknown. ``` | Algorithm | Average | Worst case | | --------- | -------- | ------------------ | | Space | O(n) | O(n) | | Search | O(log n) | amortized O(log n) | | Insert | O(log n) | amortized O(log n) | | Delete | O(log n) | amortized O(log n) | ``` ### Heaps heap is a specialized tree-based data structure that satisfies the heap property: if P is a parent node of C, then the key (the value) of P is either greater than or equal to (in a max heap) or less than or equal to (in a min heap) the key of C. The node at the "top" of the heap (with no parents) is called the root node. The heap is one maximally efficient implementation of an abstract data type called a priority queue, and in fact priority queues are often referred to as "heaps", regardless of how they may be implemented. A common implementation of a heap is the binary heap, in which the tree is a binary tree. Python: `heapq`, C++: ` -> make_heap etc`, Java: `java.util.PriorityQueue` ## Inverted indexes Inverted index in computer science (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents (named in contrast to a forward index, which maps from documents to content). The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines. Additionally, several significant general-purpose mainframe-based database management systems have used inverted list architectures ## The Fourier transform - betterexplained.com/articles/an-interactive-guide-to-the-fourier-transform/ - usecase: Shazam ## Parallel algorithms a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can be executed a piece at a time on many different processing devices, and then combined together again at the end to get the correct result. Many parallel algorithms are executed concurrently – though in general concurrent algorithms are a distinct concept – and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to as "sequential algorithms", by contrast with concurrent algorithms. ## MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a map procedure (or method), which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. ## Probabilistic data structures Probabilistic data structures can't give you a definite answer, instead they provide you with a reasonable approximation of the answer and a way to approximate this estimation. They are extremely useful for big data and streaming application because they allow to dramatically decrease the amount of memory needed (in comparison to data structures that give you exact answers). In majority of the cases these data structures use hash functions to randomize the items. Because they ignore collisions they keep the size constant, but this is also a reason why they can't give you exact values. The advantages they bring: - they use small amount of memory (you can control how much) - they can be easily parallelizable (hashes are independent) - they have constant query time (not even amortized constant like in dictionary) Also see: highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining ### Bloom filters A Bloom filter is a space-efficient **probabilistic data structure**, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives. Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses. ### Count–min sketch In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear space, at the expense of overcounting some events due to collisions. Count–min sketches are essentially the same data structure as the counting Bloom filters introduced in 1998 by Fan et al. However, they are used differently and therefore sized differently: a count-min sketch typically has a sublinear number of cells, related to the desired approximation quality of the sketch, while a counting Bloom filter is more typically sized to match the number of elements in the set. Also see: stackoverflow.com/q/6811351 ### HyperLogLog HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical accuracy of 2%, using 1.5 kB of memory. Also a probabilistic data structure. ## Cryptographic hash functions A cryptographic hash function is a special class of hash function that has certain properties which make it suitable for use in cryptography. It is a mathematical algorithm that maps data of arbitrary size to a bit string of a fixed size (a hash) and is designed to be a one-way function, that is, a function which is infeasible to invert. The only way to recreate the input data from an ideal cryptographic hash function's output is to attempt a brute-force search of possible inputs to see if they produce a match, or use a rainbow table of matched hashes. Bruce Schneier has called one-way hash functions "the workhorses of modern cryptography". The input data is often called the message, and the output (the hash value or hash) is often called the message digest or simply the digest. The ideal cryptographic hash function has five main properties: - it is deterministic so the same message always results in the same hash - it is quick to compute the hash value for any given message - it is infeasible to generate a message from its hash value except by trying all possible messages - a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value - it is infeasible to find two different messages with the same hash value ## Diffie-Hellman key exchange Diffie–Hellman key exchange (DH) is a method of securely exchanging cryptographic keys over a public channel. Diffie–Hellman is used to secure a variety of Internet services. However, research published in October 2015 suggests that the parameters in use for many DH Internet applications at that time are not strong enough to prevent compromise by very well-funded attackers, such as the security services of large governments. Also see https://weakdh.org/ ## Simplex algorithm and Linear programming In mathematical optimization, Dantzig's simplex algorithm (or simplex method) is a popular algorithm for linear programming. All the graph algorithms can be done through linear programming instead. Linear programming is a much more general framework, and graph problems are a subset of that.