From 2177cd70b92b9c1610e8dd68cbd0b1d44f5754ca Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Fri, 5 Jul 2019 10:59:48 +0100 Subject: [PATCH 01/14] add large svd draft --- _posts/2019-07-02-very-large-svds.md | 253 +++++++++++++++++++++++++++ 1 file changed, 253 insertions(+) create mode 100644 _posts/2019-07-02-very-large-svds.md diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md new file mode 100644 index 00000000..bf957d9a --- /dev/null +++ b/_posts/2019-07-02-very-large-svds.md @@ -0,0 +1,253 @@ +--- +layout: post +title: Very Large SVDs +tagline: Dask + CuPy + Zarr + Genomics +author: Matthew Rocklin +tags: [GPU, array, cupy] +draft: true +theme: twitter +--- +{% include JB/setup %} + +Summary +------- + +We perform Singular Value Decomposition (SVD) calculations on large datasets. + +We modify the computation both by using fully precise and approximate methods, +and by using both CPUs and GPUs. + +In the end we compute the approximate SVD of an 80TB dataset in a few minutes +using a mutli-GPU machine. Then we run this from a dataset stored on disk and +found that disk I/O is, predictably, a major bottleneck. + + +SVD - The simple case +--------------------- + +Dask array contains a relatively sophisticated SVD algorithm that works in the +tall-and-skinny or short-and-fat cases, but not so well in the roughly-square +case. It works by taking QR decompositions of each block of the array, +combining the R matrices somehow, doing another smaller SVD on those, and then +performing some matrix multiplies to get back to the full result. It's +numerically stable and decently fast, assumming that the intermediate R +matrices of the QR decompositions mostly fit in RAM. + +The memory constraints here are that if you have an `n` by `m` tall and +skinny array (`n >> m`) cut into `k` blocks then you need to have about `m**2 * +k` space. This is true in many cases, including typical PCA machine learning +workloads, where you have tabular data with a few columns (hundreds at most) +and many rows. + +It's easy to use and quite robust. + +```python +import dask.array as da + +x = da.random.random((10000000, 20)) +x +``` + + + + + + +
+ + + + + + + +
Array Chunk
Bytes 1.60 GB 100.00 MB
Shape (10000000, 20) (625000, 20)
Count 16 Tasks 16 Chunks
Type float64 numpy.ndarray
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 20 + 10000000 + +
+ + +```python +u, s, v = da.linalg.svd(x) +``` + +\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 1.60 GB 100.00 MB
Shape (10000000, 20) (625000, 20)
Count 120 Tasks 16 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 10000000\n\n
+\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 1.60 GB 100.00 MB
Shape (10000000, 20) (625000, 20)
Count 120 Tasks 16 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 10000000\n\n
+\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 160 B 160 B
Shape (20,) (20,)
Count 120 Tasks 1 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 1\n\n
+\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 3.20 kB 3.20 kB
Shape (20, 20) (20, 20)
Count 120 Tasks 1 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 20\n\n
+ + +This works fine in the short and fat case too (when you have far more columns +than rows) but we're always going to assume that one of your dimensions is +unchunked, and that the other dimension has chunks that are quite a bit +longer, otherwise, things might not fit into RAM. + + +Approximate SVD +--------------- + +If your dataset is large in both dimensions then the algorithm above won't work +as is. However, if you don't need exact results, or if you only need a few of +the components, then there are a number of excellent approximation algorithms. + +Dask array has one of these approximation algorithms implemented in the +[da.linalg.svd_compressed](https://docs.dask.org/en/latest/array-api.html#dask.array.linalg.svd_compressed) +function. And with it we can compute the approximate SVD of some very large +matrices, at least in theory. + +I was recently working on a problem (explained below) and found that I was +still running out of memory when dealing with this algorithm. There were two +challenges that I ran into: + +1. The algorithm requires multiple passes over the data, but the Dask task + scheduler was keeping the input matrix in memory after it had been loaded once + in order to avoid recomputation. + Things still worked, but Dask had to move the data to disk and back + repeatedly, which reduced performance significantly. + + We resolved this by including explicit persist/wait calls in the algorithm. + +2. Related chunks of data would be loaded at different times, and so would + need to stick around longer than necessary to wait for their associated + chunks. + + We resolved this by engaging task fusion as an optimization pass. + +Before diving further into the technical solution +I'm going to quickly provide the use case that was motivating this work. + + +Application - Genomics +---------------------- + +Alistair Miles from XXX motivates the problem as follows: + +TODO Alistair + +We'll simulate the data with the following random array: + + +Multiple passes over the data and Fusion +---------------------------------------- + +To stop Dask from holding onto the data we intentionally trigger computation as +we build up the graph. This is a bit atypical in Dask calculations (we prefer +to have as much of the computation at once before computing) but useful given +the multiple-pass nature of this problem. This was a fairly easy change, and +is available in [dask/dask #5041](https://github.com/dask/dask/pull/5041). + +Additionally, we found that it was helpful to turn on moderately wide task +fusion. + +```python +import dask +dask.config.set(fuse_ave_width=5) +``` + +Then things work fine +--------------------- + +I can happily perform an SVD on a terabyte array on my Macbook Pro + +```python +import dask.array as da + +x = da.random.random(size=(1_000_000, 100_000), chunks=(1_000, 20_000)) + +u, s, v = da.linalg.svd_compressed(x, k=10, compute=True) +v.compute() +``` + +This call is no longer entirely lazy, and it recomputes `x` a couple times, but +it works, and it works using only a few GB of RAM on my comsumer laptop. + +It takes around TODO time to compute that on my laptop. + + +Adding GPUs +----------- + +We can increase performance considerably by using a multi-GPU machine. +Here is almost the same code, running in roughly the same time, except for the +following changes: + +1. We increase the size of the array by a factor of TODO +2. We switch out NumPy for CuPy, a GPU NumPy implementation +3. We use an eight-GPU DGX-1 machine + +To see this run, we recommend looking at +[this point in the attached screencast](TODO) + + +Read dataset from Disk +---------------------- + +While impressive, the computation above is mostly bound by generating random +data and then performing matrix multiplies. GPUs are good at both of these +things. + +In practice though, our input array won't be randomly generated, it will be +coming from some dataset stored on disk (or some other storage system). +So to make things more realistic we generate our dataset, and then store it to +disk efficiently in the [Zarr format](https://zarr.readthedocs.io/en/stable/) +(which is what Alistair, our genomics collaborator, uses). + +```python +import dask.array as da + +x = da.random.randint(0, 4, size=..., chunks=...) +compressor=... +x.to_zarr(...) +``` + +When we rerun the computation (again on a smaller size) we find that we're +largely bound by reading from disk. + + +```python +x = da.from_zarr(...) +x = x.map_blocks(cupy.asarray) + +u, s, v = da.linalg.svd_compressed(x, k=10, compute=True, seed=rs) +v.compute() +``` + +And so we come back to a common lesson of high performance computing: + +*High performance computing isn't about doing one thing exceedingly well, it's +about doing nothing poorly*. + +In this case GPUs made our computation fast enough that we now need to focus on +other components of our system, notably disk bandwidth, and a direct reader for +Zarr data to GPU memory. From cc6ac269c5735261f4d5052b09a71013e7e5ade1 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 11:02:53 -0400 Subject: [PATCH 02/14] update genomic intro, add video on svd on a dgx2, add notebooks --- _posts/2019-07-02-very-large-svds.md | 83 ++++++++++++---------------- 1 file changed, 34 insertions(+), 49 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index bf957d9a..19d461e8 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -17,21 +17,19 @@ We perform Singular Value Decomposition (SVD) calculations on large datasets. We modify the computation both by using fully precise and approximate methods, and by using both CPUs and GPUs. -In the end we compute the approximate SVD of an 80TB dataset in a few minutes -using a mutli-GPU machine. Then we run this from a dataset stored on disk and -found that disk I/O is, predictably, a major bottleneck. +In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds. Then we run this from a dataset stored in the cloud I/O is, predictably, a major bottleneck. SVD - The simple case --------------------- -Dask array contains a relatively sophisticated SVD algorithm that works in the +Dask arrays contain a relatively sophisticated SVD algorithm that works in the tall-and-skinny or short-and-fat cases, but not so well in the roughly-square case. It works by taking QR decompositions of each block of the array, combining the R matrices somehow, doing another smaller SVD on those, and then -performing some matrix multiplies to get back to the full result. It's -numerically stable and decently fast, assumming that the intermediate R -matrices of the QR decompositions mostly fit in RAM. +performing some matrix multiplication to get back to the full result. It's +numerically stable and decently fast, assuming that the intermediate R +matrices of the QR decompositions mostly fit in memory. The memory constraints here are that if you have an `n` by `m` tall and skinny array (`n >> m`) cut into `k` blocks then you need to have about `m**2 * @@ -102,16 +100,11 @@ x u, s, v = da.linalg.svd(x) ``` -\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 1.60 GB 100.00 MB
Shape (10000000, 20) (625000, 20)
Count 120 Tasks 16 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 10000000\n\n
-\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 1.60 GB 100.00 MB
Shape (10000000, 20) (625000, 20)
Count 120 Tasks 16 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 10000000\n\n
-\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 160 B 160 B
Shape (20,) (20,)
Count 120 Tasks 1 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 1\n\n
-\n\n\n\n\n
\n\n \n \n \n \n \n \n
Array Chunk
Bytes 3.20 kB 3.20 kB
Shape (20, 20) (20, 20)
Count 120 Tasks 1 Chunks
Type float64 numpy.ndarray
\n
\n\n\n \n \n \n\n \n \n \n\n \n \n\n \n 20\n 20\n\n
- This works fine in the short and fat case too (when you have far more columns than rows) but we're always going to assume that one of your dimensions is unchunked, and that the other dimension has chunks that are quite a bit -longer, otherwise, things might not fit into RAM. +longer, otherwise, things might not fit into memory. Approximate SVD @@ -151,15 +144,26 @@ I'm going to quickly provide the use case that was motivating this work. Application - Genomics ---------------------- -Alistair Miles from XXX motivates the problem as follows: +In humans, mosquitos, mice -- organisms which have two parents (diploids) -- they inherit two [alleles](https://en.wikipedia.org/wiki/Allele), +one from each parent. For those not biologist, recall the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), where +each box is filled letter from each column/row combination. + + +punnet square + + +In the above there are three types of genotypes: AA, Aa, and aa. For computational genomics, these variants are typically encoded 0, 1, or 2. -TODO Alistair +Humans have 20,000-25,000 genes and experiments across populations can vary between 10s of thousand of individuals to 10 Million individuals. For example, [biobank](https://www.ukbiobank.ac.uk/) SOMETHING ABOUT THE SCIENCE -We'll simulate the data with the following random array: +Common questions we want to answer of this data include: How do genes vary across populations ? Can we classify a set of genes for a particular trait ? [Performing an SVD](https://www.pnas.org/content/97/18/10101) helps scientists understand what is happening, not at an individual gene level, but across the entire genome. +Reducing the time computing, like all science, allows for testing more hypotheses in less time. +Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. -Multiple passes over the data and Fusion ----------------------------------------- + +Performant SVDs w/ Dask +----------------------- To stop Dask from holding onto the data we intentionally trigger computation as we build up the graph. This is a bit atypical in Dask calculations (we prefer @@ -178,37 +182,38 @@ dask.config.set(fuse_ave_width=5) Then things work fine --------------------- -I can happily perform an SVD on a terabyte array on my Macbook Pro +I can happily perform an SVD on a 20GB array on my Macbook Pro ```python import dask.array as da -x = da.random.random(size=(1_000_000, 100_000), chunks=(1_000, 20_000)) +x = da.random.random(size=(1_000_000, 20_000), chunks=(20_000, 5_000)) u, s, v = da.linalg.svd_compressed(x, k=10, compute=True) v.compute() ``` This call is no longer entirely lazy, and it recomputes `x` a couple times, but -it works, and it works using only a few GB of RAM on my comsumer laptop. +it works, and it works using only a few GB of RAM on my consumer laptop. -It takes around TODO time to compute that on my laptop. +It takes around 2min 30s time to compute that on my laptop. That's great! But we can do better Adding GPUs ----------- We can increase performance considerably by using a multi-GPU machine. -Here is almost the same code, running in roughly the same time, except for the +Here is almost the same code, running in significantly less same time but we make the following changes: -1. We increase the size of the array by a factor of TODO +1. We increase the size of the array by a factor of 10x 2. We switch out NumPy for CuPy, a GPU NumPy implementation -3. We use an eight-GPU DGX-1 machine +3. We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (this will dramatically decrease transfer time between workers) To see this run, we recommend looking at -[this point in the attached screencast](TODO) +[this point in the attached screencast](https://youtu.be/4X5yky2lvEw) +On A DGX2 calculating an SVD takes 10-15 seconds: [SVD GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) Read dataset from Disk ---------------------- @@ -218,30 +223,10 @@ data and then performing matrix multiplies. GPUs are good at both of these things. In practice though, our input array won't be randomly generated, it will be -coming from some dataset stored on disk (or some other storage system). -So to make things more realistic we generate our dataset, and then store it to -disk efficiently in the [Zarr format](https://zarr.readthedocs.io/en/stable/) -(which is what Alistair, our genomics collaborator, uses). - -```python -import dask.array as da +coming from some dataset stored on disk or increasingly more common, stored in the cloud. +To make things more realistic we perform a similiar calculation with data stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) (which is what Alistair, our genomics collaborator, uses) in [GCS](https://cloud.google.com/storage) -x = da.random.randint(0, 4, size=..., chunks=...) -compressor=... -x.to_zarr(...) -``` - -When we rerun the computation (again on a smaller size) we find that we're -largely bound by reading from disk. - - -```python -x = da.from_zarr(...) -x = x.map_blocks(cupy.asarray) - -u, s, v = da.linalg.svd_compressed(x, k=10, compute=True, seed=rs) -v.compute() -``` +[Zarr SVD Example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078) And so we come back to a common lesson of high performance computing: From 86d127b2a7a185caf817e35eec73615f853905af Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 12:03:55 -0400 Subject: [PATCH 03/14] add Alistair's genomic background --- _posts/2019-07-02-very-large-svds.md | 52 +++++++++++++++++++++++----- 1 file changed, 43 insertions(+), 9 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 19d461e8..e0272c6f 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -144,23 +144,57 @@ I'm going to quickly provide the use case that was motivating this work. Application - Genomics ---------------------- -In humans, mosquitos, mice -- organisms which have two parents (diploids) -- they inherit two [alleles](https://en.wikipedia.org/wiki/Allele), -one from each parent. For those not biologist, recall the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), where -each box is filled letter from each column/row combination. - +Many studies are using genome sequencing to study genetic variation +one from each parent. For those not biologist, recall the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), where between different individuals within a species. These includes +each box is filled letter from each column/row combination. studies of human populations, but also other species such as mice, +mosquitoes or disease-causing parasites. These studies will, in +general, find a large number of sites in the genome sequence where +individuals differ from each other. For example, humans have more +than 100 million variable sites in the genome, and modern studies +like the [UK BioBank](https://www.ukbiobank.ac.uk/) are working towards +sequencing the genomes of 1 million individuals or more. + +In diploid species like humans, mice or mosquitoes, each individual +carries two genome sequences, one inherited from each parent. At each +of the 100 million variable genome sites there will be two or more +"alleles" that a single genome might carry. One way to think about +this is via the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), which +represents the different possible genotypes that one individual might +carry at one of these variable sites punnet square -In the above there are three types of genotypes: AA, Aa, and aa. For computational genomics, these variants are typically encoded 0, 1, or 2. +In the above there are three possible genotypes: AA, Aa, and aa. For + +computational genomics, these genotypes can be encoded as 0, 1, or 2. +Humans have 20,000-25,000 genes and experiments across populations can vary between 10s of thousand of individuals to 10 Million individuals. For example, [biobank](https://www.ukbiobank.ac.uk/) SOMETHING ABOUT THE SCIENCE In a study of a species with M genetic variants assayed in N + +individual samples, we can represent these genotypes as an (N x M) +Common questions we want to answer of this data include: How do genes vary across populations ? Can we classify a set of genes for a particular trait ? [Performing an SVD](https://www.pnas.org/content/97/18/10101) helps scientists understand what is happening, not at an individual gene level, but across the entire genome. array of integers. For a modern human genetics study, the scale of -Humans have 20,000-25,000 genes and experiments across populations can vary between 10s of thousand of individuals to 10 Million individuals. For example, [biobank](https://www.ukbiobank.ac.uk/) SOMETHING ABOUT THE SCIENCE +this array might approach (100 million x 1 million). (Although in +Reducing the time computing, like all science, allows for testing more hypotheses in less time. practice, the size of the first dimension (number of variants) can be +Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. reduced somewhat, by at least an order of magnitude, because many -Common questions we want to answer of this data include: How do genes vary across populations ? Can we classify a set of genes for a particular trait ? [Performing an SVD](https://www.pnas.org/content/97/18/10101) helps scientists understand what is happening, not at an individual gene level, but across the entire genome. +genetic variants will carry little information and/or be correlated +with each other.) -Reducing the time computing, like all science, allows for testing more hypotheses in less time. -Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. +These genetic differences are not random, but carry information about +patterns of genetic similarity and shared ancestry between +individuals, because of the way they have been inherited through many +generations. A common task is to perform a dimensionality reduction +analysis on these data, such as a [principal components analysis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190) +(SVD), to identify genetic structure reflecting these differencies in +degree of shared ancestry. This is an essential part of discovering +genetic variants associated with different diseases, and for learning +more about the genetic history of populations and species. +Reducing the time taken to compute an analysis such as SVD, like all +science, allows for exploring larger datasets and testing more +hypotheses in less time. Practically, this means not simply a fast +SVD but an accelerated pipeline end-to-end, from data loading to +analysis, to understanding. Performant SVDs w/ Dask ----------------------- From f6259de261f59aed1a05e542e66d2d32036bbb20 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 12:09:36 -0400 Subject: [PATCH 04/14] cleaned up previous commit --- _posts/2019-07-02-very-large-svds.md | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index e0272c6f..de43dfdf 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -2,7 +2,7 @@ layout: post title: Very Large SVDs tagline: Dask + CuPy + Zarr + Genomics -author: Matthew Rocklin +author: Matthew Rocklin & Alistair Miles tags: [GPU, array, cupy] draft: true theme: twitter @@ -145,8 +145,8 @@ Application - Genomics ---------------------- Many studies are using genome sequencing to study genetic variation -one from each parent. For those not biologist, recall the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), where between different individuals within a species. These includes -each box is filled letter from each column/row combination. studies of human populations, but also other species such as mice, +between different individuals within a species. These includes +studies of human populations, but also other species such as mice, mosquitoes or disease-causing parasites. These studies will, in general, find a large number of sites in the genome sequence where individuals differ from each other. For example, humans have more @@ -158,25 +158,23 @@ In diploid species like humans, mice or mosquitoes, each individual carries two genome sequences, one inherited from each parent. At each of the 100 million variable genome sites there will be two or more "alleles" that a single genome might carry. One way to think about -this is via the [Punnett square](https://en.wikipedia.org/wiki/Punnett_square), which +this is via the [Punnett +square](https://en.wikipedia.org/wiki/Punnett_square), which represents the different possible genotypes that one individual might -carry at one of these variable sites +carry at one of these variable sites: + punnet square In the above there are three possible genotypes: AA, Aa, and aa. For - computational genomics, these genotypes can be encoded as 0, 1, or 2. -Humans have 20,000-25,000 genes and experiments across populations can vary between 10s of thousand of individuals to 10 Million individuals. For example, [biobank](https://www.ukbiobank.ac.uk/) SOMETHING ABOUT THE SCIENCE In a study of a species with M genetic variants assayed in N - +In a study of a species with M genetic variants assayed in N individual samples, we can represent these genotypes as an (N x M) -Common questions we want to answer of this data include: How do genes vary across populations ? Can we classify a set of genes for a particular trait ? [Performing an SVD](https://www.pnas.org/content/97/18/10101) helps scientists understand what is happening, not at an individual gene level, but across the entire genome. array of integers. For a modern human genetics study, the scale of - +array of integers. For a modern human genetics study, the scale of this array might approach (100 million x 1 million). (Although in -Reducing the time computing, like all science, allows for testing more hypotheses in less time. practice, the size of the first dimension (number of variants) can be -Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. reduced somewhat, by at least an order of magnitude, because many - +practice, the size of the first dimension (number of variants) can be +reduced somewhat, by at least an order of magnitude, because many genetic variants will carry little information and/or be correlated with each other.) @@ -184,7 +182,8 @@ These genetic differences are not random, but carry information about patterns of genetic similarity and shared ancestry between individuals, because of the way they have been inherited through many generations. A common task is to perform a dimensionality reduction -analysis on these data, such as a [principal components analysis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190) +analysis on these data, such as a [principal components +analysis](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0020190) (SVD), to identify genetic structure reflecting these differencies in degree of shared ancestry. This is an essential part of discovering genetic variants associated with different diseases, and for learning From 8526d1a930ac5b3d1cc00ac44602a5332e8b935f Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 12:20:44 -0400 Subject: [PATCH 05/14] Update _posts/2019-07-02-very-large-svds.md Co-authored-by: Alistair Miles --- _posts/2019-07-02-very-large-svds.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index de43dfdf..310109d5 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -170,7 +170,7 @@ carry at one of these variable sites: In the above there are three possible genotypes: AA, Aa, and aa. For computational genomics, these genotypes can be encoded as 0, 1, or 2. In a study of a species with M genetic variants assayed in N -individual samples, we can represent these genotypes as an (N x M) +individual samples, we can represent these genotypes as an (M x N) array of integers. For a modern human genetics study, the scale of this array might approach (100 million x 1 million). (Although in practice, the size of the first dimension (number of variants) can be From 38ca9dcaf90962a587f345185eeb3ceb25098fd0 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 12:21:26 -0400 Subject: [PATCH 06/14] checkpoint --- _posts/2019-07-02-very-large-svds.md | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index de43dfdf..abaa23db 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -195,6 +195,8 @@ hypotheses in less time. Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. +*We want to run an experiment in less time than it take to make a cup of tea* + Performant SVDs w/ Dask ----------------------- @@ -232,34 +234,37 @@ it works, and it works using only a few GB of RAM on my consumer laptop. It takes around 2min 30s time to compute that on my laptop. That's great! But we can do better -Adding GPUs ------------ +Adding GPUs (a 15 second SVD) +----------------------------- We can increase performance considerably by using a multi-GPU machine. Here is almost the same code, running in significantly less same time but we make the following changes: -1. We increase the size of the array by a factor of 10x +1. We increase the size of the array by a factor of **10x** 2. We switch out NumPy for CuPy, a GPU NumPy implementation 3. We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (this will dramatically decrease transfer time between workers) -To see this run, we recommend looking at -[this point in the attached screencast](https://youtu.be/4X5yky2lvEw) +On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) + +To see this run, we recommend viewing +[the attached screencast](https://youtu.be/4X5yky2lvEw) -On A DGX2 calculating an SVD takes 10-15 seconds: [SVD GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) Read dataset from Disk ---------------------- While impressive, the computation above is mostly bound by generating random -data and then performing matrix multiplies. GPUs are good at both of these +data and then performing matrix calculations. GPUs are good at both of these things. In practice though, our input array won't be randomly generated, it will be coming from some dataset stored on disk or increasingly more common, stored in the cloud. -To make things more realistic we perform a similiar calculation with data stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) (which is what Alistair, our genomics collaborator, uses) in [GCS](https://cloud.google.com/storage) +To make things more realistic we perform a similar calculation with data stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) (which is what Alistair, our genomics collaborator, uses) in [GCS](https://cloud.google.com/storage) + +In this [Zarr SVD example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078), we load a 25GB GCS backed data set onto a DGX2, -[Zarr SVD Example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078) +Again, on a DGX2, from data loading to SVD we are running in time less than it take to make a cup of tea. And so we come back to a common lesson of high performance computing: From 0c3944d506dc90bd489a91e0661bd56955e4fd95 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 12:27:41 -0400 Subject: [PATCH 07/14] flush out svd with zarr and conclusion --- _posts/2019-07-02-very-large-svds.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index d2052fba..703e46bb 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -262,9 +262,9 @@ In practice though, our input array won't be randomly generated, it will be coming from some dataset stored on disk or increasingly more common, stored in the cloud. To make things more realistic we perform a similar calculation with data stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) (which is what Alistair, our genomics collaborator, uses) in [GCS](https://cloud.google.com/storage) -In this [Zarr SVD example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078), we load a 25GB GCS backed data set onto a DGX2, +In this [Zarr SVD example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078), we load a 25GB GCS backed data set onto a DGX2, run a few processing steps, then perform an SVD. The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds. -Again, on a DGX2, from data loading to SVD we are running in time less than it take to make a cup of tea. +Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. However, the data loading can be accelerated. From GCS we are reading into host memory, uncompressing the zarr bits, then moving the data from host memory to device memory. This is be improved if we cut out the host entirely and read directly into the GPU. And so we come back to a common lesson of high performance computing: From 8ff0ce55eb05a899c20d9498a4fd65689470154d Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Tue, 5 May 2020 17:05:08 -0700 Subject: [PATCH 08/14] some minor edits --- _posts/2019-07-02-very-large-svds.md | 51 +++++++++++++++++----------- 1 file changed, 32 insertions(+), 19 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 703e46bb..081f5e1d 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -2,7 +2,7 @@ layout: post title: Very Large SVDs tagline: Dask + CuPy + Zarr + Genomics -author: Matthew Rocklin & Alistair Miles +author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled) tags: [GPU, array, cupy] draft: true theme: twitter @@ -17,7 +17,9 @@ We perform Singular Value Decomposition (SVD) calculations on large datasets. We modify the computation both by using fully precise and approximate methods, and by using both CPUs and GPUs. -In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds. Then we run this from a dataset stored in the cloud I/O is, predictably, a major bottleneck. +In the end we compute an approximate SVD of 200GB of simulated data and using a mutli-GPU machine in 15-20 seconds. +Then we run this from a dataset stored in the cloud +where we find that I/O is, predictably, a major bottleneck. SVD - The simple case @@ -26,7 +28,7 @@ SVD - The simple case Dask arrays contain a relatively sophisticated SVD algorithm that works in the tall-and-skinny or short-and-fat cases, but not so well in the roughly-square case. It works by taking QR decompositions of each block of the array, -combining the R matrices somehow, doing another smaller SVD on those, and then +combining the R matrices, doing another smaller SVD on those, and then performing some matrix multiplication to get back to the full result. It's numerically stable and decently fast, assuming that the intermediate R matrices of the QR decompositions mostly fit in memory. @@ -116,12 +118,12 @@ the components, then there are a number of excellent approximation algorithms. Dask array has one of these approximation algorithms implemented in the [da.linalg.svd_compressed](https://docs.dask.org/en/latest/array-api.html#dask.array.linalg.svd_compressed) -function. And with it we can compute the approximate SVD of some very large -matrices, at least in theory. +function. And with it we can compute the approximate SVD of very large +matrices. -I was recently working on a problem (explained below) and found that I was +We were recently working on a problem (explained below) and found that I was still running out of memory when dealing with this algorithm. There were two -challenges that I ran into: +challenges that we ran into: 1. The algorithm requires multiple passes over the data, but the Dask task scheduler was keeping the input matrix in memory after it had been loaded once @@ -129,7 +131,7 @@ challenges that I ran into: Things still worked, but Dask had to move the data to disk and back repeatedly, which reduced performance significantly. - We resolved this by including explicit persist/wait calls in the algorithm. + We resolved this by including explicit recomputation steps in the algorithm. 2. Related chunks of data would be loaded at different times, and so would need to stick around longer than necessary to wait for their associated @@ -138,7 +140,7 @@ challenges that I ran into: We resolved this by engaging task fusion as an optimization pass. Before diving further into the technical solution -I'm going to quickly provide the use case that was motivating this work. +we quickly provide the use case that was motivating this work. Application - Genomics @@ -195,11 +197,13 @@ hypotheses in less time. Practically, this means not simply a fast SVD but an accelerated pipeline end-to-end, from data loading to analysis, to understanding. -*We want to run an experiment in less time than it take to make a cup of tea* +*We want to run an experiment in less time than it takes to make a cup of tea* Performant SVDs w/ Dask ----------------------- +Now that we have that scientific background, let's transition back to talking about computation. + To stop Dask from holding onto the data we intentionally trigger computation as we build up the graph. This is a bit atypical in Dask calculations (we prefer to have as much of the computation at once before computing) but useful given @@ -211,13 +215,13 @@ fusion. ```python import dask -dask.config.set(fuse_ave_width=5) +dask.config.set({"optimization.fuse.ave-width": 5}) ``` Then things work fine --------------------- -I can happily perform an SVD on a 20GB array on my Macbook Pro +We can happily perform an SVD on a 20GB array on a Macbook Pro ```python import dask.array as da @@ -229,9 +233,9 @@ v.compute() ``` This call is no longer entirely lazy, and it recomputes `x` a couple times, but -it works, and it works using only a few GB of RAM on my consumer laptop. +it works, and it works using only a few GB of RAM on a consumer laptop. -It takes around 2min 30s time to compute that on my laptop. That's great! But we can do better +It takes around 2min 30s time to compute that on a laptop. That's great! But we can do better Adding GPUs (a 15 second SVD) @@ -260,11 +264,20 @@ things. In practice though, our input array won't be randomly generated, it will be coming from some dataset stored on disk or increasingly more common, stored in the cloud. -To make things more realistic we perform a similar calculation with data stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) (which is what Alistair, our genomics collaborator, uses) in [GCS](https://cloud.google.com/storage) - -In this [Zarr SVD example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078), we load a 25GB GCS backed data set onto a DGX2, run a few processing steps, then perform an SVD. The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds. - -Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. However, the data loading can be accelerated. From GCS we are reading into host memory, uncompressing the zarr bits, then moving the data from host memory to device memory. This is be improved if we cut out the host entirely and read directly into the GPU. +To make things more realistic we perform a similar calculation with data +stored in a [Zarr format](https://zarr.readthedocs.io/en/stable/) +in [GCS](https://cloud.google.com/storage) + +In this [Zarr SVD example](https://gist.github.com/quasiben/e52bc740ae22ae321f30987c65998078), +we load a 25GB GCS backed data set onto a DGX2, +run a few processing steps, then perform an SVD. +The combination of preprocessing and SVD calculations ran in 18.7 sec and the data loading took 14.3 seconds. + +Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. +However, the data loading can be accelerated. +From GCS we are reading into host memory, uncompressing the zarr bits, +then moving the data from host memory to device memory. +This is be improved if we cut out the host entirely and read directly into the GPU. And so we come back to a common lesson of high performance computing: From 7abdf3a258c6601990b9221f9bf622b05be44994 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Tue, 5 May 2020 22:27:12 -0400 Subject: [PATCH 09/14] update gpu section -- add disclaimer --- _posts/2019-07-02-very-large-svds.md | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 081f5e1d..1aa9ba54 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -121,7 +121,7 @@ Dask array has one of these approximation algorithms implemented in the function. And with it we can compute the approximate SVD of very large matrices. -We were recently working on a problem (explained below) and found that I was +We were recently working on a problem (explained below) and found that we were still running out of memory when dealing with this algorithm. There were two challenges that we ran into: @@ -241,18 +241,22 @@ It takes around 2min 30s time to compute that on a laptop. That's great! But w Adding GPUs (a 15 second SVD) ----------------------------- -We can increase performance considerably by using a multi-GPU machine. -Here is almost the same code, running in significantly less same time but we make the +*Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA* + +We can dramatically increase performance by using a multi-GPU machine. NVIDIA and other manufactures now make machines with multiple GPUs co-located in the same physical box. In the following section, we will run the calculations on a **DGX2**, a machine with 16 GPUs. + + +Below is almost the same code, running in significantly less same time but we make the following changes: 1. We increase the size of the array by a factor of **10x** 2. We switch out NumPy for CuPy, a GPU NumPy implementation -3. We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (this will dramatically decrease transfer time between workers) +3. We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers) On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) To see this run, we recommend viewing -[the attached screencast](https://youtu.be/4X5yky2lvEw) +[the attached screencast](https://www.youtube.com/watch?v=6hmt1gARqp0) Read dataset from Disk @@ -275,9 +279,8 @@ The combination of preprocessing and SVD calculations ran in 18.7 sec and the da Again, on a DGX2, from data loading to SVD we are running in time less than it would take to make a cup of tea. However, the data loading can be accelerated. -From GCS we are reading into host memory, uncompressing the zarr bits, -then moving the data from host memory to device memory. -This is be improved if we cut out the host entirely and read directly into the GPU. +From GCS we are reading into data into the main memory of the machine (host memory), uncompressing the zarr bits, +then moving the data from host memory to the GPU (device memory). Passing data back and forth between host and device memory can significantly decrease performance. Reading directly into the GPU, bypassing host memory, would improve the overall pipeline. And so we come back to a common lesson of high performance computing: From 1016ae77a2cbb40a4809b508f386c072f53fe396 Mon Sep 17 00:00:00 2001 From: Benjamin Zaitlen Date: Thu, 7 May 2020 16:20:24 -0400 Subject: [PATCH 10/14] change author order --- _posts/2019-07-02-very-large-svds.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 1aa9ba54..fe7d9751 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -2,8 +2,8 @@ layout: post title: Very Large SVDs tagline: Dask + CuPy + Zarr + Genomics -author: Alistair Miles (Oxford Big Data Institute), Ben Zaitlen (NVIDIA), Matthew Rocklin (Coiled) -tags: [GPU, array, cupy] +author: Matthew Rocklin (Coiled), Ben Zaitlen (NVIDIA), Alistair Miles (Oxford Big Data Institute) +tags: [GPU, array, CuPy] draft: true theme: twitter --- From edc8994b2e1509b890e95540373e8a83f52643be Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Mon, 11 May 2020 11:50:31 -0700 Subject: [PATCH 11/14] add cloud, compression, and final thoughts to post --- _posts/2019-07-02-very-large-svds.md | 223 +++++++++++++++++++++++++-- 1 file changed, 213 insertions(+), 10 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 1aa9ba54..1982bfc0 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -221,6 +221,15 @@ dask.config.set({"optimization.fuse.ave-width": 5}) Then things work fine --------------------- +We're going to try this SVD on a few different choices of hardware including: + +1. A MacBook Pro +2. A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast network +3. A twenty-node cluster on AWS + + +### Macbook Pro + We can happily perform an SVD on a 20GB array on a Macbook Pro ```python @@ -235,16 +244,21 @@ v.compute() This call is no longer entirely lazy, and it recomputes `x` a couple times, but it works, and it works using only a few GB of RAM on a consumer laptop. -It takes around 2min 30s time to compute that on a laptop. That's great! But we can do better +It takes around 2min 30s time to compute that on a laptop. +That's great! It was super easy to try out, didn't require any special +hardware or setup, and in many cases is fast enough. +By working locally we can iterate quickly. +Now that things work, we can experiment with different hardware. -Adding GPUs (a 15 second SVD) ------------------------------ -*Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA* +### Adding GPUs (a 15 second SVD) -We can dramatically increase performance by using a multi-GPU machine. NVIDIA and other manufactures now make machines with multiple GPUs co-located in the same physical box. In the following section, we will run the calculations on a **DGX2**, a machine with 16 GPUs. +*Disclaimer: one of the authors (Ben Zaitlen) works for NVIDIA* +We can dramatically increase performance by using a multi-GPU machine. +NVIDIA and other manufactures now make machines with multiple GPUs co-located in the same physical box. +In the following section, we will run the calculations on a **DGX2**, a machine with 16 GPUs and fast network connection. Below is almost the same code, running in significantly less same time but we make the following changes: @@ -253,14 +267,37 @@ following changes: 2. We switch out NumPy for CuPy, a GPU NumPy implementation 3. We use a sixteen-GPU DGX-2 machine with NVLink interconnects between GPUs (NVLink will dramatically decrease transfer time between workers) -On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 and15 seconds: [SVD Multi-GPU Notebook](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4) +On A DGX2 we can calculate an SVD on a 200GB Dask array between 10 to 15 seconds. -To see this run, we recommend viewing -[the attached screencast](https://www.youtube.com/watch?v=6hmt1gARqp0) +The [full notebook is here](https://gist.github.com/quasiben/98ee254920837313946f621e103d41f4), +but the relevant code snippets are below: +```python +# Some GPU specific setup +from dask_cuda import LocalCluster -Read dataset from Disk ----------------------- +cluster = LocalCluster(...) +client = Client(cluster) + +import cupy +import dask.array as da +rs = da.random.RandomState(RandomState=cupy.random.RandomState) + +# Create the data and run the SVD as normal +x = rs.randint(0, 3, size=(10_000_000, 20_000), + chunks=(20_000, 5_000), dtype="uint8") +x = x.persist() + +u, s, v = da.linalg.svd_compressed(x, k=10, seed=rs) +v.compute() +``` + +To see this run, we recommend viewing this screencast: + + + + +### Read dataset from Disk While impressive, the computation above is mostly bound by generating random data and then performing matrix calculations. GPUs are good at both of these @@ -290,3 +327,169 @@ about doing nothing poorly*. In this case GPUs made our computation fast enough that we now need to focus on other components of our system, notably disk bandwidth, and a direct reader for Zarr data to GPU memory. + + +### Cloud + +*Diclaimer: one of the authoer (Matthew Rocklin) works for Coiled Computing* + +We can also run this on the cloud with any number of frameworks. +In this case we used the [Coiled Cloud](https://coiled.io) service to deploy on AWS + +```python +from coiled_cloud import Cloud, Cluster +cloud = Cloud() + +cloud.create_cluster_type( + organization="friends", + name="genomics", + worker_cpu=4, + worker_memory="16 GiB", + worker_environment={ + "OMP_NUM_THREADS": 1, + "OPENBLAS_NUM_THREADS": 1, + # "EXTRA_PIP_PACKAGES": "zarr" + }, +) + +cluster = Cluster( + organization="friends", + typename="genomics", + n_workers=20, +) + +from dask.distributed import Client +client = Client(cluster) + +# then proceed as normal +``` + +Using 20 machines with a total of 80 CPU cores on a dataset that was 10x larger +than the MacBook pro example we were able to run in about the same amount of +time. This shows near optimal scaling for this problem, which is nice to see +given how complex the SVD calculation is. + +A screencast of this problem is viewable here + + + + +Compression +----------- + +One of the easiest ways for us to improve performance is to reduce the size of +this data through compression. +This data is highly compressible for two reasons: + +1. The real-world data itself has structure and repetition + (although the random play data does not) +2. We're storing entries that take on only four values. + We're using eight-bit integers when we only needed two-bit integers + +Let's solve the second problem first. + +### Compression with bit twiddling + +Ideally Numpy would have a two-bit integer datatype. +Unfortunately it doesn't, and this is hard because memory in computers is +generally thought of in full bytes. + +To work around this we can use bit arithmetic to shove four values into a single value +Here are functions that do that, assuming that our array is square, +and the last dimension is divisible by four. + +```python +import numpy as np + +def compress(x: np.ndarray) -> np.ndarray: + out = np.zeros_like(x, shape=(x.shape[0], x.shape[1] // 4)) + out += x[:, 0::4] + out += x[:, 1::4] << 2 + out += x[:, 2::4] << 4 + out += x[:, 3::4] << 6 + return out + + +def decompress(out: np.ndarray) -> np.ndarray: + back = np.zeros_like(out, shape=(out.shape[0], out.shape[1] * 4)) + back[:, 0::4] = out & 0b00000011 + back[:, 1::4] = (out & 0b00001100) >> 2 + back[:, 2::4] = (out & 0b00110000) >> 4 + back[:, 3::4] = (out & 0b11000000) >> 6 + return back +``` + +Then, we can use these functions along with Dask to store our data in a +compressed state, but lazily decompress on-demand. + +```python +x = x.map_blocks(compress).persist().map_blocks(decompress) +``` + +That's it. We compress each block our data and store that in memory. +However the output variable that we have, `x` will decompress each chunk before +we operate on it, so we don't need to worry about handling compressed blocks. + +### Compression with Zarr + +A slightly more general but probably less efficient route would be to compress +our arrays with a proper compression library like Zarr. + +The example below shows how we do this in practice. + +```python +import zarr +import numpy as np +from numcodecs import Blosc +compressor = Blosc(cname='lz4', clevel=3, shuffle=Blosc.BITSHUFFLE) + + +x = x.map_blocks(zarr.array, compressor=compressor).persist().map_blocks(np.array) +``` + +Additionally, if we're using the dask-distributed scheduler then we want to +make sure that the Blosc compression library doesn't use additional threads. +That way we don't have parallel calls of a parallel library, which can cause +some contention + +```python +def set_no_threads_blosc(): + """ Stop blosc from using multiple threads """ + import numcodecs + numcodecs.blosc.use_threads = False + +# Run on all workers +client.register_worker_plugin(set_no_threads_blosc) +``` + +This approach is more general, and probably a good trick to have up ones' +sleeve, but it also doesn't work on GPUs, which in the end is why we ended up +going with the bit-twiddling approach one section above, which uses API that +was uniformly accessible within the Numpy and CuPy libraries. + + +Final Thoughts +============== + +In this post we did a few things, all around a single important problems in +genomics. + +1. We learned a bit of science +2. We translated a science problem into a computational problem, + and in particular into a request to perform large singular value decompositions +3. We used a canned algorithm in dask.array that performed pretty well, + assuming that we're comfortable going over the array in a few passes +4. We then tried that algorithm on three architectures quickly + 1. A Macbook Pro + 2. A multi-GPU machine + 3. A cluster in the cloud +5. Finally we talked about some tricks to pack more data into the same memory + with compression + +This problem was nice in that we got to dive deep into a technical science +problem, and yet also try a bunch of architecture quickly to investigate +hardware choices that we might make in the future. + +We used several technologies here today, made by several different communities +and companies. It was great to see how they all worked together seamlessly to +provide a flexible-yet-consistent experience. From 3fb42483645aa7168a91581aa07bc3b0afdcf77e Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Tue, 12 May 2020 07:43:18 -0700 Subject: [PATCH 12/14] Apply suggestions from code review Co-authored-by: Benjamin Zaitlen --- _posts/2019-07-02-very-large-svds.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2019-07-02-very-large-svds.md index 1a3503ee..59c2abbd 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2019-07-02-very-large-svds.md @@ -224,7 +224,7 @@ Then things work fine We're going to try this SVD on a few different choices of hardware including: 1. A MacBook Pro -2. A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast network +2. A DGX-2, an NVIDIA worksation with 16 high-end GPUs and fast interconnect 3. A twenty-node cluster on AWS @@ -258,7 +258,7 @@ Now that things work, we can experiment with different hardware. We can dramatically increase performance by using a multi-GPU machine. NVIDIA and other manufactures now make machines with multiple GPUs co-located in the same physical box. -In the following section, we will run the calculations on a **DGX2**, a machine with 16 GPUs and fast network connection. +In the following section, we will run the calculations on a **DGX2**, a machine with 16 GPUs and fast interconnect between the GPUs. Below is almost the same code, running in significantly less same time but we make the following changes: From b6fbb516e36bd559c551190ec77e25aa32dd4712 Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Tue, 12 May 2020 07:58:42 -0700 Subject: [PATCH 13/14] Move to today's date --- .../{2019-07-02-very-large-svds.md => 2020-05-13-large-svds.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename _posts/{2019-07-02-very-large-svds.md => 2020-05-13-large-svds.md} (99%) diff --git a/_posts/2019-07-02-very-large-svds.md b/_posts/2020-05-13-large-svds.md similarity index 99% rename from _posts/2019-07-02-very-large-svds.md rename to _posts/2020-05-13-large-svds.md index 59c2abbd..b8c748c8 100644 --- a/_posts/2019-07-02-very-large-svds.md +++ b/_posts/2020-05-13-large-svds.md @@ -2,7 +2,7 @@ layout: post title: Very Large SVDs tagline: Dask + CuPy + Zarr + Genomics -author: Matthew Rocklin (Coiled), Ben Zaitlen (NVIDIA), Alistair Miles (Oxford Big Data Institute) +author: Matthew Rocklin (Coiled), Ben Zaitlen (NVIDIA), Alistair Miles (Oxford University) tags: [GPU, array, CuPy] draft: true theme: twitter From 049a965f7e2babcae72aab3ad98faf03e94fbfb5 Mon Sep 17 00:00:00 2001 From: Matthew Rocklin Date: Wed, 13 May 2020 07:08:03 -0700 Subject: [PATCH 14/14] push --- _posts/2020-05-13-large-svds.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/_posts/2020-05-13-large-svds.md b/_posts/2020-05-13-large-svds.md index b8c748c8..ff96f237 100644 --- a/_posts/2020-05-13-large-svds.md +++ b/_posts/2020-05-13-large-svds.md @@ -1,10 +1,9 @@ --- layout: post -title: Very Large SVDs +title: Large SVDs tagline: Dask + CuPy + Zarr + Genomics author: Matthew Rocklin (Coiled), Ben Zaitlen (NVIDIA), Alistair Miles (Oxford University) tags: [GPU, array, CuPy] -draft: true theme: twitter --- {% include JB/setup %}