SPARK PARALLELIZATION GENERAL CONCEPTS

Concept Summary - taken directly from references
- Spark RDD/DF
- Spark spawns a single Task for a single partition, which will run inside the executor JVM.
- Spark single partition is located in a single machine
- A Stage is a combination of transformations which does not cause any shuffling, pipelining as many narrow transformations (eg: map, filter etc) as possible.
- Each stage contains as many tasks as partitions of the RDD
  - Based on other resource this seems to refer that 1 spark task can have multiple transfomations in it filter->map
  - Spark book author stackoverflow
    - Says If you use ...filter().map(), they will be executed in the same task for each partition, analogous to chaining "mappers" in MapReduce.
- RDDs are not mapped to executors. Multiple partitions from an RDD may be served by multiple tasks which may belong to multiple executors.
My draw.io scheme on How Dataframes and RDDs comprise of partitions which is the parallelization unit in Spark
References
- Spark partitioning explanation
- Spark partitioning QA format

SPARK JOINS

From my understanding it seems that both sort merge join and shuffle hash join 1. partition the data but in 2. step the shuffle join creates an hash table (For 1 partition) where shuffle join relies on sorted lists and merging the data. Mostly sorts merge join is preferred