This document accompanies and was written with the help of this video.
I'll add the timestamps for the snippets too.
A data structure is a way to store and organinze data in a computer, so that it can be used effectively
We can talk about them as:
- Abstract Data Types (ADTS): Mathematical/Logical models
- Implementation
Whenever we talk about a data structure, we talk about:
- Logical view
- Operations
- Cost of operations
- Implementation
- Store a given number of elements of a given data type.
- We should be able to write/modify element at a position
- Read element at a position
A data structure that gives us implementation for this is: Array
- Empty when size is 0
- Insert elements in a list
- Remove elements
- Count the number of elements
- Read/Modify element at a position
- Specify the data-type of the list (int, char etc.)
These features can also be implemented in an Array. But, we need to define a max-size. We can always exhaust the list.
To add more elements, when the array is full, we create a new larger array, copy previous elements into new array and free the memory or the previous array.
Now what should be the best size to increase the array by?
We create the new array, double the size of the previous array and copy previous items in the array
-
Access - Read/ Write element at an index
This would take Constant Time O(1)
Address of A[i] = base_address + i * bytes_taken
-
Insert -
-
When adding at the end, it would take constant time, when full, O(n)
-
When adding at a position,
The worst case would be to add the elements at the start and shift all the elements, this would make the insert operation proportional to the length of the list. Hence the time taken would be O(n)
-
Remove -
Similar to insert, so, O(n)
Though good enough, Arrays aren't the best implementation of a dynamic list
A linked list is a sequence of data structures, which are connected together via links. Linked List is a sequence of links which contains items. Each link contains a connection to another link.
Head Node: The first element of a linked list
Address of the head node gives us the access to the complete list
-
Access:
Unlike an array we cannot access elements in constant time
The time taken to access the elements would be proportional to the size of the array, O(n)
-
Insertion:
O(n) complexity, but we won't have to shift the elements like in an array
While traversing the elements of a linked list, it's better to iterate over them, than to print them recursively, because,
- When we print iteratively, we only use one temporary variable.
- When we print using recursion, we use a lot of memory in the stack.
A doubly linked list is a linked data structure that consists of a set of sequentially linked records called nodes. Each node contains three fields: two link fields and one data field.
It is just like a linked list or singly linked list, except it has a pointer for the previous node too.
- Reverse lookup
- Use extra memory for pointer at previous node
A list with a restriction that insertion and deletion can be performed only from one end, called the top
-
Push -
This is basically insertion
-
Pop -
This is removing the most recent item from the stack
-
Top -
Returns element at the top of the stack
-
isEmpty -
Check if a stack if empty
All of the operations in a stack take constant time, or their big O notation is O(n)
Stack is called a Last In First Out (LIFO) structure
- Function calls/Recursion
- Perform undo in an editor
- Balenced paranthesis in a compiler
<operand> <operator> <operand>
Infix notation is the notation commonly used in arithmetical and logical formulae and statements. It is characterized by the placement of operators between operands
Example:
2 + 2
3 * 4 + a
We can't use this while programming. This is because this notation has a lot of ambiguity, or doubt that which operation do we apply first. We have better options - Prefix and Postfix
Order of operations:
- Parenthesis
- Exponents (in case of multiple - right to left)
- Multiplication and Division
- Addition and Substraction
<operator> <operand> <operand>
Also known as Polish Notion and was introduced in 1924.
Here, operator is placed operands
Example:
+ 2 2
+ * 3 4 a
In this type of notation, one operand is only associated with one operator
<operand> <operand> <operator>
Also known as the Reverse Polish Notation
Programaticaly, this expression is the easiest to parse and least time and memory consuming.
Prefix can also be executed in a similar complexity, but the algorithm to evaluate postfix expressions is quite intuitive and easy.
Example:
2 2 +
3 4 a + *
A list with a restriction that insertion can be performed only from one end, called the rear and deletion can be performed at the other end, called the front.
-
Enqueue / Push -
This is basically insertion
-
Dqueue/ Pop -
This is removing the first item from the queue
-
Front/ Peek -
Returns element at the front of the queue
-
isEmpty -
Check if a queue if empty
-
isFull -
Check if a queue if full
All of the operations in a queue take constant time, or their big O notation is O(n)
Queue is called a First In First Out (FIFO) structure
- We have a shared resource but it can only take one request at a time
- Process Scheduling
A collection of entities called nodes, linked together to simulate a heirarchy.
- It is a non-linear, heirarchial data structure
- Top-most node is called the root of the tree, it is the only node without a parent.
- Each node contains data and link/reference to some other node, called it's Children
- Node which has children is called a Parent Node.
- Nodes with the same parent are called Sibling Nodes
- When a node has a direct link to another node, we have a parent-child relationship
- Nodes that do not have children are called Leaf Nodes. All other nodes are called Internal Nodes
- We can only move from parent -> Child in a tree
- If we can go from A to B, A is the Ancestor of B and B is the Descendant of A.
-
It can be called a recursive data structure.
It consists of a distinguished node called root and some sub-trees, and the arrangement is such that root of the tree contains link to all of the root's sub-trees.
Recursion is basically reducing something in a self similar manner. This property of trees is used everywhere.
-
In a tree with N nodes, there will exactly be (N-1) links/ edges
All nodes will have exactly one incoming edge/link, except the root.
-
Depth of x: Length of the path from root to x, or, number of edges in path from root to x.
The depth for root node will be 0.
-
Height of x: Number of edges in longest path from x to a leaf.
Height of tree is the height of the root node.
Height of an empty tree is -1.
Height of a tree with 1 node is 0.
- Storing naturally heirarichal data, like file system.
- Organize data for quick search, insertion and deletion.
- A special kind of tree, Trie is used to store a dictionary and is used for dynamic spell checking.
- Used in Network Routing Algorithms.
A type of tree where each node can have atmost 2 children. One child is called the left child, the other, the right child.
A binary tree where each node has either 2 or 0 child nodes.
A binary tree where all levels except the possibly last are completely filled and all nodes are as far left as possible.
Max number of nodes at level i = 2^i
A binary tree where all levels will be completely filled.
Maximum number of nodes in a binary tree with height h will be
= 20 + 21 + 22 + .... + 2h
= 2h+1 - 1
= 2number of levels - 1
Height of a complete binary tree with n nodes,
log2(n+1) - 1
Min height possible: log2n
Max height possible: n-1
For a binary tree, if the height of left and right subtree for every node is not more than K (mostly 1).
difference = hleft - hright
We can implement binary trees in the following ways:
- Dynamically created nodes linked to each other using pointers or references.
- In some special cases, we can use arrays - They are used to implement heaps.
For node at index i, for a complete binary tree,
left-child-index = 2i + 1
right-child-index = 2i + 2
A type of binary tree in which each node, value of all the nodes in left subtree is lesser or equal(to handle duplicates) and value of all the nodes in right subtree is greater.
The search operation has the complexity of O(log2n)
The search we perform in this tree is a binary search and hence, the name.
To prevent the worst case, we can keep the tree balanced.
To insert data, we search with log complexity, and we just add the place
During insertion and deletion, the balance of the tree is disturbed and there are ways of restoring it.
It is the process of visiting(reading/ processing data in the node) each node in the tree exactly once in some order.
Based on the order in which nodes are visited, they can be classified into:
-
Breadth-First
This is called level-order traversal and here we traverse by levels.
-
Depth-First
Here the core idea is that when we visit a child, we visit the whole sub-tree. The code here is relatively simple, and very intuitive through recursion.
Time complexity of all these algorithms is O(n).
Space complexity of all these algorithms is O(h). Where, h is the height of the binary tree.
In the worst case, O(n)
In the best/average case, O(log n)
Based on the relative order of left sub-tree, right sub-tree and root, we have different depth-first approaches:
-
Preorder Traversal
root -> left -> right
-
Inorder Traversal
left -> root -> right
Inorder traversal of a binary search tree will give us a sorted list. This is because the left data is lesser and in Inorder traversal we start traversing from the left, that is ascending order.
-
Postorder Traversal
left -> right -> root
-
There can be three cases when we try and delete a node, and we handle them all seperately:
-
The node has no child
In this case, we just simply delete the link to the node and free up the memory.
-
The node has one child
In this case, we can simply delete the node, and free the memory, but in addition to that, we can link to parent to the child node.
-
The node has 2 children
This is the most complex case and has 2 solutions:
-
We find the min value in the right subtree and copy the value in the targetted node. We now delete the duplicate node by a recursive function and reducing it to any of the above 2 conditions.
-
We find the max value in the left subtree and copy the value in the targetted node. We now delete the duplicate node by a recursive function and reducing it to any of the above 2 conditions.
-
The next element we get when we are performing inorder traversal of the tree, that is find the next gratest element in the tree.
There are 2 possiblities when finding the inorder successor:
-
The target node has a right subtree
To find the next element, we can just go the deepest in the right sub-tree, or, find the min value in the right sub-tree.
-
There is no right subtree
In this case, the parent node might already have been visited, so, we go to the nearest ancestor for which the given node would be in the left subtree.
This is a non-linear data strcuture, just like a tree, and is a collection of objects or entities, called nodes/vertices that are connected by edges.
In a tree, there are rules dictating the conections of various nodes, for instance, in a tree with N nodes, we must exactly have (N-1) edges and there is exactly one edge in a parent child relationship etc.
Contrary to this, a graph has no rules dictating the connections among the nodes. It contains nodes and edges which can be connected in any possible way.
So, tree is just a special kind of a graph.
A graph G is an ordered pair of a set V of vertices and a set E of edges.
G = (V, E)
Ordered pair: a pair of mathematical objects, where order matters, for example,
(a, b) != (b, a) if (a != b)
Unordered pair: simply a set of 2 mathematical objects with no specific order, for example,
{a, b} = {b, a}
How is an edge represented?
An edge is uniquely identified by it's 2 endpoints.
Edges can be of two types:
-
Directed:
Connection is one-way.
-
Undirected:
Connection is two-way.
A graph with all directed edges is called a directed graph or a Digraph
A graph with all undirected edges is called an undirected graph
Sometimes, in a graph all connections cannot be treated as equal. Some connections can be prefereable to others. For example, a road network.
In these kind of cases, we associate a weight or a cost to each edge of the graph. These kind of graphs are called Weighted Graphs.
We can actually look at all the graphs as weighted graphs, an unweighted graph can be seen as a weighted graph in which the weight of all the edges is the same, typically, we assume it to be 1 unit.
Any kind of real life application can be modelled using a graph, and they can be used to store all kinds of data.
Once data is modelled using a graph, a lot of problems can easily be solved by using standard graph theory algorithms.
Examples:
- A Social Network
- Interlinked webpages - WWW (World wide web)
- Web crawling - This is just basically graph traversal to reccommend good search results.
- Intercity road networks - Weighted undirected graph.
- Intracity road networks - Weighted directed graphs: This is because there can be one-way roads in a city.
-
Self loop:
An edge is called a self loop, or a self edge, if it only involves one vertex. These complicate algorithms and make working with graphs difficult.
These can be seen in webpages, where a webpage links to itself.
-
Multiedge / Parallel edge: An edge is called a multiedge, if it occurs more than once in a graph.
These can be seen if we are modelling a flight network using graphs.
-
If a graph doesn't have a self loop or a multi edge, then it is called a simple graph.
-
Number of edges:
Minimum number of edges for a graph = 0
These can be calculated for a simple graph, cause if self loops are multi loops get involved, we obviously won't be able to calculate.
-
Directed-
Here, we will have a directed edge from one node to every other node.
So, in general,
If |V| = n,
then, for number of edges, E
0 <= E <= n(n-1)
-
Undirected -
Here, we will have only one edge from one node to another,
So, similar to a directed graph,
If |V| = n,
then, for number of edges, E
0 <= E <= n(n-1) / 2
-
-
If the number of edges in a graph are close to the sqaure of it's number of nodes, that is the maximum number of edges, then it is called a dense graph.
-
If the number of edges in a graph is really less, we call it a sparse graph. Typically, close to the number of vertices.
-
Path - a sequence of vertices where each adjecent pair is connected by an edge.
In a directed graph, all edges must also be aligned in one direction, the direction of path.
- Simple path - a path in which no vertices, and hence, no edges are repeated.
Most of the time when we say path, we mean a simple path. So, for the actual path, we use the term walk, and thus, the defination of a simple path becomes,
A walk in which no vertices, and hence, no edges are repeated.
- Trial - a walk in which no edges are repeated.
If a trail is possible between two vertices, then certainly a simple path, or just a path is also possible.
-
Strongly Connected Graph - A graph is called a strongly connected graph, if there is a path from any vertex to any other vertex.
- If it's an undirected graph, we simply call it connected
- If it's a directed graph, we call it strongly connected.
-
Weakly Connected Graphs - If a directed graph is not strongly connected, but can be by treating all the edges as undirected, then it is called a weakly connected graph.
-
Closed walk - If a walk starts and ends at the same vertex and the length of the walk(number of edges in the walk) is > 0.
Some use the term cycle for a closed walk, but the term cycle generally refers to a simple cycle.
-
Simple Cycle - A simple cycle is a closed walk in which there is no repetition except the start and end vertices.
-
Acyclic graph - A graph with no cycle. A tree would be an example of an acyclic graph.
We can store the graph in the computer's memory in different ways. Here we analyse the time and space complexity of different methods:
-
Edge List
We create 2 lists,
- To store vertices
- To store edges
Vertices can be a simple string, and in a realistic scenario the strings won't be very long, so the space complexity will be proportional to the number of vertices.
For edges, we can store it in the form
struct Edge { int start_vertex; int end_vertex; int weight; }
Here, the space complexity would be propoortional to the number of edges.
Hence, the total space complexity of storing a graph would be
O(|V| + |E|)We cannot do a lot better if we want to store a graph in the memory. So we are good with the memory usage.
For Time complexity,
The most frequently performed operations would be-
-
Find nodes adjecent to given nodes- For this specific case, we would have to scan the whole edge list. We will have to perform a linear search.
Time complexity for this operation will be
O(|E|) -
Find if two given nodes are connected - Similar to the previous case, here too, we will have to perform a linear search.
Thus, time complexity for this operation will also be
O(|E|)
If we take a look at that in terms of number of vertices, the time complexity would be
O(|V| * |V|), this is because the maximum number of edges is n(n-1), which is costly. -
Adjacency Matrix
Here, instead of having an edge list, we store the data in a matrix (2-D array), called the adjecency matrix.
Here we fill in the values as,
for Aij,
1 - if an edge exists from i to j 0 - otherwiseIf we introduce weights, then, instead of 1, we can use a weight for each edge and we just put all the empty values as infinity or -infinity.
Here the time complexity will be really good for both the operations,
-
Find nodes to adjecent nodes -
O(|V|) -
Find if two given nodes are connected -
O(1)
Another advantage is that if we wanna compute for a undirectional matrix, we only have to look at half the matrix.
The drawback to this is that it requires a lot of memory, or the space complexity for this representation is
O(|V|2)Thus, we only use adjecency matrix if the graph is dense, for a sparse graph, which is most of the graphs in real life, we cannot use a adjecency matrix.
-
-
Adjacency List
The problem with adjecency matrix is that it requires a lot of space. Adjecency list somewhat solves this issue.
We know that most of the graphs are sparse. So, instead of storing if the value exists in the index of the column, we can directly store which element exists in an array where the indexes don't mean anything.
This would solve the space problem and the time complexity would also remain almost the same.
-
Find nodes to adjecent nodes -
O(|V|) -
Find if two given nodes are connected -
O(|V|)
Here, to increase the performance, instead of storing the values as an array, we can store them in a linked list or even a binary tree.
-