Partitioning in Spark 3.1 with java - apache-spark

I am using spark 3.1 with java. In my code, I am writing final result dataset into GCP storage, in that it creates multiple files as my dataset is large. I am spark job in GCP dataproc cluster. It is configured to use 250 worker nodes(each has 8 vCPUs). Spark command is configured to run 2 executers per node and 3 cores for each executor. When the spark job is trigger YARN ResourceManager is showing only 25% of worker cores being used to containers per node. Also I configured, shuffle partition size as 5500(spark.sql.shuffle.partitions=5500). And I used
mydataset.coalesce(4500) to reduce the number of result files creating in Cloud stoage. But it creates 5499 files for one of the dataset which has nearly 45000 rows and 3500 files for another dataset which has nearly 85000 rows. Its really confusing on what basis its creating file partition s. Can't I control that? Is there any default value there? If yes, can i get that default value in Java code?
Thanks in Advance

Related

Spark based processing of data stored on SSD

We are currently using Spark 2.1 based application which analyses and process huge number of records to generate some stats which is used for report generation. Now our we are using 150 executors, 2 core per executor and 10 GB per executor for our spark jobs and size of data is ~3TB stored in parquet format. For processing 12 months of data it is taking ~15 mins of time.
Now to improve performance we want to try full SSD based node to store data in HDFS. Well the question is, are there any special configuration/optimisation to be done for SSD? Are there any study done for Spark processing performance on SSD based HDFS vs HDD based HDFS?
http://spark.apache.org/docs/latest/hardware-provisioning.html#local-disks
SPARK_LOCAL_DIRS is config that you need to change.
https://www.slideshare.net/databricks/optimizing-apache-spark-throughput-using-intel-optane-and-intel-memory-drive-technology-with-ravikanth-durgavajhala
Use case is K means algo but will help.

Spark Sql Job optimization

I have a job which consist around 9 sql statement to pull data from hive and write back to hive db. It is currently running for 3hrs which seems too long considering spark abitlity to process data. The application launchs total 11 stages.
I did some analysis using Spark UI and found below grey areas which can be improved:
Stage 8 in Job 5 has shuffle output of 1.5 TB.
Time gap between job 4 and job 5 is 20 Mins. I read about this time gap and found spark perform IO out of spark job which reflects as gap between two jobs which can be seen in driver logs.
We have a cluster of 800 nodes with restricted resources for each queue and I am using below conf to submit job:
-- num-executor 200
-- executor-core 1
-- executor-memory 6G
-- deployment mode client
Attaching Image of UI as well.
Now my questions are:
Where can I find driver log for this job?
In image, I see a long list of Executor added which I sum is more than 200 but in Executor tab, number is exactly 200. Any explation for this?
Out of all the stages, only one stage has TASK around 35000 but rest of stages has 200 tasks only. Should I increase number of executor or should I go for dynamic allocation facility of spark?
Below are the thought processes that may guide you to some extent:
Is it necessary to have one core per executor? The executor need not be fat always. You can have more cores in one executor. it is a trade-off between creating a slim vs fat executors.
Configure shuffle partition parameter spark.sql.shuffle.partitions
Ensure while reading data from Hive, you are using Sparksession (basically HiveContext). This will pull the data into Spark memory from HDFS and schema information from Metastore of Hive.
Yes, Dynamic allocation of resources is a feature that helps in allocating the right set of resources. It is better than having fixed allocation.

unable to launch more tasks in spark cluster

I have a 6 node cluster with 8 cores and 32 gb ram each. I am reading a simple csv file from azure blob storage and writing to hive table.
when the job runs I see only a single task getting launched and single executor working and all the other executor and instances sitting idle/dead.
How to increase the number of tasks so the job can run faster.
any help appreciated
I'm guessing that your csv file is in one block. Therefore your data is only on one partition and since Spark "only" creates one task per partition, you only have one.
You can call repartition(X) on your dataframe/rdd just after reading it to increase the number of partitions. Reading won't be faster but all your transformations and the writting will be parallelized.

How many partitions when reading parquet data from Spark

I'm using Spark 1.6.0. and DataFrame API for reading partitioned parquet data.
I'm wondering how many partitions will be used.
Here are some figures on my data :
2182 files
196 partitions
2 GB
It seems that Spark uses 2182 partitions because when I perform a count, the job is splitted into 2182 tasks.
That's seem to be confirmed by df.rdd.partitions.length
Is that correct ? In all cases ?
If yes, is it too high regarding the volume of data (i.e. should I use df.repartition to reduce it) ?
yes you can use the re-partition method to reduce the number of tasks such that it is in balance with available resources. you also need to define the number of executor per node, no. of nodes and memory per node while submitting the app so that the tasks will execute in parallel and utilise maximum resources.

How does Apache Spark assign partition-ids to its executors

I have a long-running Spark streaming job which uses 16 executors which only one core each.
I use default partitioner(HashPartitioner) to equally distribute data to 16 partitions. Inside updateStateByKeyfunction, i checked for the partition id from TaskContext.getPartitionId() for multiple batches and found out the partition-id of a executor is quite consistent but still changing to another id after a long run.
I'm planing to do some optimization to spark "updateStateByKey" API, but it can't be achieved if the partition-id keeps changing among batches.
So when does Spark change the partition-id of a executor?
Most probably, the task has failed and restart again, so the TaskContext has changed, and so as the partitionId.

Resources