pyspark split load uniformly across all executors - apache-spark

I have a 5 node cluster.I am loading a 100k csv file to a dataframe using pyspark and performing some etl operations and writing the output to a parquet file.
When I load the data frame how can divide the dataset uniformly across all executors os that each executor processes 20k records.

If possible, make sure that the input data is split into smaller files.
that way each executor will read and process a single file.
In the case that you can't modify the input files, you can call df.repartition(5), but keep in mind that it will cause an expensive shuffle operation

Related

Why is spark dataframe repartition faster than coalesce when reducing number of partitions?

I have a df with 100 partitions, and before saving to HDFS as .parquet I want to reduce the number of partitions because the parquet files would be too small (<1MB).
I've added coalesce before writing:
df.coalesce(3).write.mode("append").parquet(OUTPUT_LOC)
It works but slows down the process from 2-3s per file to 10-20s per file.
When I try repartition:
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
The process does not slow down at all, 2-3s per file.
Why? Shouldn't coalesce always be faster when reducing the number of partitions because it avoids a full shuffle?
Background:
I'm importing files from local storage to spark cluster and saving the resulting dataframes as a parquet file. Each file is approx 100-200MB.
Files are located on the "spark-driver" machine, I'm running spark-submit in client deploy mode.
I'm reading files one by one in driver:
data = read_lines(file_name)
rdd = sc.parallelize(data,100)
rdd2 = rdd.flatMap(lambda j: myfunc(j))
df = rdd2.toDF(mySchema)
df.repartition(3).write.mode("append").parquet(OUTPUT_LOC)
Spark version is 3.1.1
Spark/HDFS cluster has 5 workers with 8CPU,32GB RAM
Each executor has 4cores and 15GB RAM, that makes 10 executors total.
EDIT:
When I use coalesce(1) I get spark.rpc.message.maxSize limit breached error, but not when I use repartition(1). Could that be a clue?
Attaching DAG visualizations .. Looks like WholeStageCodegen part is taking too long on coalesce DAGs?
This can happen sometimes if your data is not evenly distributed and when you do coalesce it tries to reduce the partitions by combining the small partitions in order to reduce full shuffle but there could still be some data skew in one of the partition and that single partition would be taking the most of the time.
While you do repartition the data gets distributed almost evenly on all the partitions as it does full shuffle and all the tasks would almost get completed in the same time.
You could use the spark UI to see why when you are doing coalesce what is happening in terms of tasks and do you see any single task running long.

Huge Multiline Json file is being processed by single Executor

I have a huge json file 35-40GB size, Its a MULTILINE JSON on hdfs. I have made use of .option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
with Pyspark.
I have bumped up 60 Executors, 16 cores, 16GB Ememory and set memory overhead parameters.
Every run the Executors were being lost.
It is perfectly working for smaller files, but not with files > 15 GB
I have enough cluster resources.
From the spark UI what I have seen is every time the data is being processed by single executor, all other executors were idle.
I have seen the stages (0/2) Tasks(0/51)
I have re-partitioned the data as well.
Code:
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json').repartition(50)
df.count()
df.rdd.glom().map(len).collect()
df.write.... (HDFSLOCATION, format='csv')
Goal: My goal is to apply UDF function on each of the column and clean the data and write to CSV format.
Size of dataframe is 8 million rows with 210 columns
Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
spark.read.option('multiline', 'true').read.json('MULTILINE_JSONFILE_.json')
even if you have 16 cores.
I would recommend that you split a json file into many files.
More precisely, parallelism is base on number of blocks of files if files are stored on HDFS. if MULTILINE_JSONFILE_.json is 40GB, it might have more than 400 blocks if the block size is 128MB. So, Spark tasks should run in parallel if the file is located in HDFS. If you are stuck with parallelism, I think this is because option("multiline", false) is specified.
In databricks documentation, you can see following sentence.
Files will be loaded as a whole entity and cannot be split.

Does Spark distributes dataframe across nodes internally?

I am trying to use Spark for processing csv file on cluster. I want to understand if I need to explicitly read the file on each of the worker nodes to do the processing in parallel or will the driver node read the file and distribute the data across cluster for processing internally? (I am working with Spark 2.3.2 and Python)
I know RDD's can be parallelized using SparkContext.parallelize() but what in case of Spark DataFrames?
if __name__=="__main__":
spark=SparkSession.builder.appName('myApp').getOrCreate()
df=spark.read.csv('dataFile.csv',header=True)
df=df.filter("date>'2010-12-01' AND date<='2010-12-02' AND town=='Madrid'")
So if I am running the above code on cluster, will the entire operation be done by driver node or will it distribute df across cluster and each worker perform processing on its data partition?
To be strict, if you run the above code it will not read or process any data. DataFrames are basically an abstraction implemented on top of RDDs. As with RDDs, you have to distinguish transformations and actions. As your code only consists of one filter(...) transformation, noting will happen in terms of readind or processing of data. Spark will only create the DataFrame which is an execution plan. You have to perform an action like count() or write.csv(...) to actually trigger processing of the CSV file.
If you do so, the data will then be read and processed by 1..n worker nodes. It is never read or processed by the driver node. How many or your worker nodes are actually involved depends -- in your code -- on the number of partitions of your source file. Each partition of the source file can be processed in parallel by one worker node. In your example it is probably a single CSV file, so when you call df.rdd.getNumPartitions() after you read the file, it should return 1. Hence, only one worker node will read the data. The same is true if you check the number of partitions after your filter(...) operation.
Here are two ways of how the processing of your single CSV file can be parallelized:
You can manually repartition your source DataFrame by calling df.repartition(n) with n the number of partitions you want to have. But -- and this is a significant but -- this means that all data is potentially send over the network (aka shuffle)!
You perform aggregations or joins on the DataFrame. These operations have to trigger a shuffle. Spark then uses the number of partitions specified in spark.sql.shuffle.partitions(default: 200) to partition the resulting DataFrame.

Breaking lineage of an RDD without relying on HDFS

I'm running a spark application on Amazon spot instances. In the end, I'm exporting my results to parquet files on S3. The tasks are memory intensive, so I have to run the initial calculations using a large number of partitions (hundreds of thousands). In the end, I would like to coalesce the partitions to a few large partitions and save them to big parquet files. And this is where I get into trouble:
- If I'm using .coalesce(), which is a narrow transformation, the entire lineage that precedes the coalesce will be executed on a small number of partitions, which will cause OOMs.
- If I'm using .repartition(), I rely on HDFS for the shuffle files.
This is a problem when using spot instances, which may be decommissioned, leaving corrupt/missing HDFS blocks.
- checkpointing also relies on HDFS so I can't use that.
- converting to a Dataframe and back didn't actually break the lineage (rdd.toDF.rdd, am I missing something?).
To conclude, I'm looking for a way to coalesce to a smaller amount of partitions only to persist the data on S3 - I would like for the calculation to happen using the original partitions.

Location of HadoopPartition

I have a dataset in a csv file that occupies two blocks in HDFS and replicated on two nodes, A and B. Each node has a copy of the dataset.
When Spark starts processing the data, I have seen two ways how Spark loads the dataset as input. It either loads the entire dataset into memory on one node and perform most of the tasks on it or loads the dataset into two nodes and spill the tasks on both nodes (based on what I observed on the history server). For both cases, there is sufficient capacity to keep the whole datasets in memory.
I repeated the same experiment multiple times and Spark seemed to alternate between these two ways. Supposedly Spark inherits the input split location as in a MapReduce job. From my understanding, MapReduce should be able to take advantage of two nodes. I don't understand why Spark or MapReduce will alternate between the two cases.
When only one node is used for processing, the performance is worse.
When your loading the data in Spark you can specify the minimum number of splits and this will force Spark to load the data on multiple machines (with the textFile api you would add minPartitions=2 to your call.

Resources