Skip to content

Commit 2dbf9b3

Browse files
authored
Add exam projects
1 parent eb9aa05 commit 2dbf9b3

1 file changed

Lines changed: 303 additions & 1 deletion

File tree

exam_projects.md

Lines changed: 303 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,306 @@ Dr. Dario Coscia <<dcoscia@sissa.it>>
1313

1414
---
1515

16-
# TODO
16+
## 0. A project of your choice, that might be useful for your research!
17+
18+
In this case get in touch with the instructor and the assistant before starting developing your project.
19+
20+
---
21+
22+
## 1. (Complex) High-Performance eigenvalue solver
23+
24+
### Problem statement
25+
Solve the **eigenvalue problem** efficiently for a large sparse matrix:
26+
27+
$$
28+
Ax = \lambda x
29+
$$
30+
31+
where $A$ is a symmetric matrix.
32+
33+
### Implementation steps
34+
1. Generate a large random sparse matrix using `scipy.sparse`.
35+
2. Compare three approaches:
36+
- Baseline: NumPy's `numpy.linalg.eig`.
37+
- Optimized: SciPy's `scipy.sparse.linalg.eigs`.
38+
- High-performance: Custom Numba-based iterative solver.
39+
3. Profile runtime and memory usage.
40+
41+
### Expected output
42+
- Performance comparison graphs (runtime vs. matrix size).
43+
- Trade-off analysis between accuracy and efficiency.
44+
45+
---
46+
47+
## 2. (Complex) Large-Scale data processing: profiling and optimization
48+
49+
### Problem statement
50+
Optimize **large-scale data processing** operations for a dataset with $10^8$ rows.
51+
52+
### Implementation steps
53+
1. Load and analyze a dataset (e.g., financial data, sensor logs).
54+
2. Profile performance using `cProfile`, `line_profiler`, and `memory_profiler`.
55+
3. Optimize bottlenecks:
56+
- Replace loops with NumPy vectorization.
57+
- Use Numba for fast computations.
58+
- Optimize storage: compare CSV, HDF5, Parquet.
59+
4. Benchmark before and after optimizations.
60+
61+
### Expected output
62+
- Performance graphs (before vs. after optimization).
63+
- Report on bottleneck analysis and applied optimizations.
64+
65+
---
66+
67+
## 3. (Complex) Parallel K-Means clustering on HPC
68+
69+
### Problem statement
70+
Optimize **K-Means clustering** for large datasets using parallelization.
71+
72+
### Mathematical formulation
73+
Given data points $x_1, x_2, ..., x_N$, partition them into $K$ clusters by minimizing:
74+
75+
$$
76+
J = \sum_{i=1}^{N} \min_{k} \| x_i - \mu_k \|^2
77+
$$
78+
79+
where $\mu_k$ are cluster centroids.
80+
81+
### Implementation steps
82+
1. Baseline: Naïve K-Means with NumPy.
83+
2. Parallelized version using Numba (`prange`).
84+
3. GPU-accelerated version using CuPy or PyTorch.
85+
4. Test on large datasets (e.g., MNIST, synthetic Gaussian blobs).
86+
87+
### Expected output
88+
- Speedup graphs (serial vs. CPU parallel vs. GPU).
89+
- Clustering performance comparison (runtime vs. dataset size).
90+
91+
---
92+
93+
## 4. Parallel sorting algorithm benchmark
94+
95+
### Problem statement
96+
Compare different **parallel sorting algorithms**.
97+
98+
### Implementation steps
99+
1. Implement different sorting algorithms:
100+
- Merge Sort (NumPy baseline)
101+
- Parallel Merge Sort (Numba `prange`)
102+
- Quicksort with parallel partitioning
103+
2. Compare performance for large random datasets.
104+
3. Profile memory usage and scalability.
105+
106+
### Expected output
107+
- Performance benchmarks (runtime vs. input size).
108+
- Profiling report on efficiency.
109+
110+
---
111+
112+
## 5. (Complex) Parallel matrix multiplication using MPI
113+
114+
### Problem statement
115+
Implement a **parallel matrix multiplication algorithm** to compute $C=A×B$ efficiently for large matrices.
116+
117+
### Implementation steps
118+
- Implement a serial matrix multiplication.
119+
- Distribute rows of $A$ across MPI processes.
120+
- Use `mpi4py` to communicate required portions of $B$.
121+
- Gather results into the final matrix $C$.
122+
123+
### Expected output
124+
- Performance benchmarks (serial vs. parallel) for increasing matrix sizes (e.g., $256 \times 256$ to $4096 \times 4096$).
125+
- Memory usage analysis.
126+
127+
---
128+
129+
## 6. (Complex) Parallelized PageRank Algorithm
130+
131+
### Problem statement
132+
Implement a **parallelized version of the PageRank algorithm** to rank web pages in a large graph.
133+
134+
### Mathematical formulation
135+
Given a graph $G = (V, E)$, the PageRank $PR(v)$ of a node $v$ is computed iteratively as:
136+
137+
$$
138+
PR(v) = \frac{1-d}{N} + d \sum_{u \in M(v)} \frac{PR(u)}{L(u)}
139+
$$
140+
141+
where $d$ is the damping factor, $N$ is the total number of nodes, $M(v)$ is the set of nodes pointing to $v$, and $L(u)$ is the number of outbound links from $u$.
142+
143+
### Implementation steps
144+
1. Implement a serial version using NumPy.
145+
2. Parallelize the iterative updates using Numba or MPI.
146+
3. Test on synthetic graphs or real-world datasets (e.g., Stanford Web Graph).
147+
4. Compare convergence rates and runtime.
148+
149+
### Expected output
150+
- Performance benchmarks (serial vs. parallel).
151+
- Convergence analysis for different graph sizes.
152+
153+
---
154+
155+
## 7. Parallelized gradient descent for machine learning
156+
157+
### Problem statement
158+
Implement a **parallelized gradient descent algorithm** for optimizing a machine learning model (e.g., linear regression) with $> 5\cdot 10^4$ parameters.
159+
160+
### Mathematical formulation
161+
Given a loss function $L(\theta)$, update the parameters $\theta$ iteratively as:
162+
163+
$$
164+
\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)
165+
$$
166+
167+
where $\eta$ is the learning rate.
168+
169+
### Implementation steps
170+
1. Implement a serial gradient descent using NumPy.
171+
2. Parallelize the gradient computation using Numba or MPI.
172+
3. Test on synthetic datasets or real-world datasets (e.g., Boston Housing).
173+
4. Compare convergence rates and runtime.
174+
175+
### Expected output
176+
- Performance benchmarks (serial vs. parallel).
177+
- Convergence analysis for different dataset sizes.
178+
179+
---
180+
181+
## 8. Parallelized sparse matrix-vector multiplication
182+
183+
### Problem statement
184+
Implement a **parallelized sparse matrix-vector multiplication** for large sparse matrices.
185+
186+
### Mathematical formulation
187+
Given a sparse matrix $A$ and a vector $x$, compute $y = A \cdot x$.
188+
189+
### Implementation steps
190+
1. Generate a large sparse matrix using `scipy.sparse`.
191+
2. Implement a serial matrix-vector multiplication.
192+
3. Parallelize the computation using Numba or MPI.
193+
4. Compare performance for different matrix sizes.
194+
195+
### Expected output
196+
- Speedup plots (serial vs. parallel).
197+
- Memory usage analysis.
198+
199+
---
200+
201+
## 9. Efficient Causal Convolutions for Time-Series Forecasting
202+
203+
### Problem statement
204+
Implement efficient **causal convolutions** for time-series forecasting, optimizing for large input sequences. Causal convolutions ensure that the model only uses past information for predictions, making them suitable for tasks where future data points cannot be accessed. See [WaveNet](https://arxiv.org/abs/1609.03499) for a possible application.
205+
206+
### Mathematical formulation
207+
In a **causal convolution**, the output at time $t$ depends only on inputs from time $t$ and earlier. For a 1D causal convolution with kernel $k$, the output $y_t$ is computed as:
208+
209+
$$
210+
y_t = \sum_{i=0}^{d} k_i \cdot x_{t-i}
211+
$$
212+
213+
where $d$ is the filter size (or receptive field) and $x_{t-i}$ are the past inputs. This ensures no information from the future is used when predicting $y_t$.
214+
215+
### Implementation steps
216+
1. Implement a basic 1D/2D/3D causal convolution layer in PyTorch.
217+
2. Extend dilated causal convolutions to increase the receptive field without increasing the computational complexity.
218+
3. Benchmark the model's performance on long input sequences (training time), such as time-series data or raw audio as in [WaveNet](https://arxiv.org/abs/1609.03499).
219+
220+
---
221+
222+
## 10. (Complex) Efficient LSTM/GRU Implementations
223+
224+
### Problem statement
225+
Implement efficient versions of **Long Short-Term Memory (LSTM)** and **Gated Recurrent Unit (GRU)**, focusing on reducing computational cost and improving runtime performance for sequence modelling tasks.
226+
227+
### Mathematical formulation
228+
229+
- **LSTM** has the following key equations:
230+
231+
$$
232+
i_t = \sigma(W_{ii}x_t + W_{hi}h_{t-1} + b_i)
233+
$$
234+
$$
235+
f_t = \sigma(W_{if}x_t + W_{hf}h_{t-1} + b_f)
236+
$$
237+
$$
238+
o_t = \sigma(W_{io}x_t + W_{ho}h_{t-1} + b_o)
239+
$$
240+
$$
241+
g_t = \tanh(W_{ig}x_t + W_{hg}h_{t-1} + b_g)
242+
$$
243+
$$
244+
c_t = f_t \cdot c_{t-1} + i_t \cdot g_t
245+
$$
246+
$$
247+
h_t = o_t \cdot \tanh(c_t)
248+
$$
249+
250+
- **GRU** simplifies the gates as:
251+
252+
$$
253+
z_t = \sigma(W_{iz}x_t + W_{hz}h_{t-1} + b_z)
254+
$$
255+
$$
256+
r_t = \sigma(W_{ir}x_t + W_{hr}h_{t-1} + b_r)
257+
$$
258+
$$
259+
n_t = \tanh(W_{in}x_t + r_t \cdot (W_{hn}h_{t-1}) + b_n)
260+
$$
261+
$$
262+
h_t = (1 - z_t) \cdot n_t + z_t \cdot h_{t-1}
263+
$$
264+
265+
### Implementation steps
266+
1. Implement baseline LSTM and GRU using PyTorch.
267+
2. Optimize LSTM/GRU by reducing the number of matrix multiplications and shared weights by vectorization.
268+
3. Investigate fused operations and scripting.
269+
4. Compare the optimized version regarding runtime, memory usage, and training stability. You can train on your preferable language task, or consider the experiments in [this paper](https://arxiv.org/abs/2410.01201v1) as suggestions.
270+
271+
### Expected output
272+
- Performance comparison (runtime vs. sequence length).
273+
- Memory usage profile for different implementations.
274+
275+
---
276+
277+
## 11. Efficient minLSTM and minGRU Implementations
278+
279+
### Problem statement
280+
Implement efficient **minLSTM** and **minGRU**, which are minimalistic versions of LSTM and GRU designed to reduce computational complexity while maintaining similar performance for sequence modelling.
281+
282+
### Mathematical formulation
283+
- **minLSTM**, **minGRU** mathematical implementation are reported in [this paper](https://arxiv.org/abs/2410.01201v1).
284+
285+
### Implementation steps
286+
1. Implement the minLSTM and minGRU architectures in PyTorch.
287+
2. Optimize their implementations by vectorizing an efficient parallel scan.
288+
3. Compare minLSTM/minGRU against standard LSTM/GRU in terms of training speed, convergence, and accuracy on datasets like sequential MNIST, time-series data, or NLP tasks.
289+
290+
### Expected output
291+
- Performance comparison (runtime vs. sequence length).
292+
- Memory usage profile for different implementations.
293+
294+
---
295+
296+
## 12. (Regular 1-3, Complex 1-4) Implementing Structured State Space Models (S4)
297+
298+
### Problem statement
299+
Implement the **Structured State Space (S4/S6) model**, which is an efficient and scalable variant of SSMs designed for modelling long-range dependencies in sequential data. Refer to [this paper](https://arxiv.org/abs/2312.00752) for an overview.
300+
301+
### Mathematical formulation
302+
The S4/S6 model can be expressed as a specific type of SSM that uses efficient parameterization of the state-space matrices to reduce computational complexity while maintaining the ability to model long-range sequences.
303+
304+
- **State evolution (base)**:
305+
306+
$$
307+
h_t = A h_{t-1} + B x_t
308+
$$
309+
310+
### Implementation steps
311+
1. Implement the baseline version of S4 by following the mathematical formulation using PyTorch or TensorFlow.
312+
2. Implement optimizations such as low-rank matrix approximations and using efficient convolutional operations to handle long-range dependencies.
313+
3. Compare the performance (speed, memory, accuracy) of S4 with vanilla SSMs and other sequence models like LSTMs and Transformers.
314+
4. (Complex) Implement with S4/S6 layers the Mamba, H3, and Gated MLP architectures. Test on the selective copy task.
315+
316+
### Expected output
317+
- Performance comparison charts (accuracy, training time, memory usage) between S4, SSM, and other models.
318+
- Analysis of how well the model handles long-range dependencies.

0 commit comments

Comments
 (0)