Spark Repartition & Coalesce - Explained

All data processed by spark is stored in partitions. Today we discuss what are partitions, how partitioning works in Spark (Pyspark), why it matters and how the user can manually control the partitions using repartition and coalesce for effective distributed computing. Introduction Spark is a framework which provides parallel and distributed computing on big data. To perform it’s parallel processing, spark splits the data into smaller chunks(i.e. partitions) and distributes the same to each node in the cluster to provide a parallel execution of the data.

Continue reading