Spark:repartition and number of tasks (skipped) - apache-spark

I am repartitioning the data frame after reading the data from ORC,
Available cores 6
df = spark.read.orc("filePath")
df.rdd.getNumPartitions()
Giving output as 12 partitions ( It is expected job ran locally so ( cores * 2) in my case 6 * 2 = 12)
Now i am increasing the partitions
df = df.repartition(50)
df.rdd.getNumPartitions() ---- returning 50 partitions
When observed in the SparkUI the job is still executing 12 tasks where as 50 tasks stage was skipped
How to tell spark to use 50 tasks instead of 12 default tasks ?.
Even after forcing repartition to 50 , Why spark is still using 12 tasks ? why not 50 tasks. Could you please some one help me here
As seen in below diagram
Spark Ui

I grant you that reading the Spark UI stuff is not always clear. I do not understand it fully at times. I looked on Databricks Community edition a similar case, but not a fair comparison as it is a driver with 8 cores. You do not state your Spark version.
None-the-less: when requesting repartition(n) --> this is round robin partitioning. The source is 12 partitions and the target 50 partitions. So the 12 tasks are fine in all ways. You need to go from 12 to 50 partitions, from the existing to the new partitions. 50 + 12 = 62, the skipping I think is the previous not yet returned to system partitions.
Few responses, SO is less popular I believe now.

Related

Do multiple spark sessions which query on the same partition in Hadoop table make the query slower?

I have multiple Spark Sessions on different driver node query on the same HDFS file, let's say Table T1. Below is the T1 structures.
Partition by date
Then partition by first 2 digit of the user id
User id field
Other related information of the user.
partition date
partition first 2 digit
User id
other info
17
01
01234
...
18
01
01234
...
18
02
02345
...
Now I tried two scenarios,
Create 8 multiple spark sessions on different driver node (using cluster mode) and concurrently query on the exact same partition and user id, every session acquire all the resources as limited in the code, for instance each session takes
1 driver / 1 CPU / 4 GB
4 executors / 1 CPU per executor / 4 GB per executor
And below is the sample query.
SELECT * FROM T1
WHERE user_id = '01234' AND partition_date IN ('17', '18') AND partition_first_2_digit IN ('01', '02')
Create only 1 spark session and query using the same above string. This session also acquire all the resources as limited in the code which is equal to multiple sessions case.
1 driver / 1 CPU / 4 GB
4 executors / 1 CPU per executor / 4 GB per executor
And the results surprise me, the query time per application on the multiple spark session case is 10 minutes which is much higher than one spark session at 5 minutes!
I am curious about why the multiple spark sessions make the query much slower? Does anyone have the same problems?
Thanks in advance everyone!
In theory, in both cases the performance should be same but there are other factors that may be influencing the scheduling of the jobs. From my experience, you could be experiencing any of these constraints.
The job pool in which you are firing your spark jobs has less resources than required by 8 instances of your job. Lets say each instance/run require 4 executors so while you are running 8 instances of the job that means you are requesting 32 executors in total. If your job queue is configured to support 20 executors ( executors could be limited by either number of cores or memory allocated to queue or both). your 8 jobs are not able to achieve maximum parallelism and hence appear to be slow as compared to one single instance running at the same time which gets all the resources it needs and is able to achieve maximum parallelism
your cluster is running full force and utilising most of it resources for other prod jobs due to which your 8 instances of the same job are waiting for resources at the same time.
if you ran a single test for 8 instances, there could be a possibility that there was a write job happening on the parent directory or hive table which could have caused your jobs to enter a wait state.
These are some of the problems which i have seen in my career, there could be other factors as well. i would suggest get access to yarn schedular or queues from your sysadmin and see if one of these is the culprit.

spark creating num of partitions in RDD more than the data size

I am a noob and learning Pyspark now. My question about RDD is what happens when we try to create more partitions than the data size. E.g.,
data = sc.parallelize(range(5), partitions = 8)
I understand the intention of partitions is to effectively use the CPU cores of a cluster, and making too small partitions involves scheduling overhead than benefitting from distributed computing. What I am curious about is does spark still create 8 partitions here or optimize it to the number of cores? If it's creating 8 partitions then there is data replication in each partition?
My question about RDD is what happens when we try to create more
partitions than the data size
You can easily see how many partitions a given RDD has by using
data.getNumPartitions. I tried creating RDD you have mentioned and running this command and it shows me there are 8 partitions. 4 partitions had one number each and rest 4 empty.
If it's creating 8 partitions then there is data replication in each
partition?
You can try following code and check the executor output to see how many records are there in each partition. Note the first print statement in the below code. I have to return something as required by API so returning each element multiplied by 2.
data.mapPartitionsWithIndex((x,y) => {println(s"partitions $x has ${y.length} records");y.map(a => a*2)}).collect.foreach(println)
I got following output for the above code -
partitions 0 has 0 records
partitions 1 has 1 records
partitions 2 has 0 records
partitions 3 has 1 records
partitions 4 has 0 records
partitions 5 has 1 records
partitions 6 has 0 records
partitions 7 has 1 records
I am curious about is does spark still create 8 partitions here or
optimize it to the number of cores?
Number of partitions defines how much data you want spark to process in one task. If there are 8 partitions and 4 virtual cores then spark would start running 4 tasks ( corresponding to 4 partitions) at once. As these tasks finishes, it will schedule remaining ones those cores.

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...
This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Why is Drill fastest with one partition?

My cluster has 6 nodes, each with 2 cores. I have a Spark job saving a Parquet file of the size of ~150MB to HDFS. If I repartition my dataframe to 6 partitions before saving, Drill queries are actually 30-40% slower than when I repartition it to 1 partition. Why is that? Is it expected? Can it indicate an issue with my setup?
Update
Results of the same SQL query in seconds (3 runs per number of partitions)
1 partition: 1.238, 1.29, 1.404
2 partitions: 1.286 1.175 1.259
3 partitions: 1.699 1.8 1.7
6 partitions: 2.223, 1.96, 1.772
12 partitions: 1.311, 1.335, 1.339
24 partitions: 1.261 1.302 1.235
48 partitions: 1.664 1.757 2.133
As you can see 1, 2, 12 and 24 partitions are fast. 3, 6 and 48 partitions are very clearly slower. What could be causing that?
When saving the parquet file in spark using a single partition, you would save the file locally to the partition on a single node. Once that happens, replication needs to kick in and distribute the file over the different nodes.
When saving the partquet file in spark using multiple partitions, spark would save the file distributed already, however, maybe not exactly the way HDFS needs it. Still replication and re-distribution needs to kick-in, but now in a much complexer situation.
Then depending on your spark process, you might have your data already sorted in a different way (1 vs multiple partitions), potentially making it more suitable for the next process in line (Drill).
I really cannot pintpoint a reason, but with such a small difference in time (you are talking seconds), I am not sure if the difference is even distinctive enough.
Then also, we might need to put the test methods in doubt. Java Garbage collection, Background Processes running (including the replicaton process), etc etc.
One suggestion I would have is to leave your HDFS cluster at rest for a while to make sure replication and other processes quiet down before you start with the Drill process.

Spark 12 GB data load with Window function Performance issue

I am using sparksql to transform 12 GB data.My Transformation is to apply row number function with partition by on one of fields then divide data into two sets first set where row number is 1 and 2nd set include rest of data then write data to target location in 30 partitions.
My job is currently taking approximately 1 hour.I want to run it in less than 10 mins.
I am running this job on 3 Node cluster with specs(16 Cores & 32 GB RAM).
Node 1 yarn master node.
Node 2 Two Executors 1 driver and 1 other
Node 3 Two executors both for processing.
Each executor is assigned 5 cores and 10GB memory.
Is my hardware enough or i need more powerful hardware?
Is executors configuration right?
If both hardware and configuration is good then definitely i need to improve my code.
My code is as follow.
sqlContext=SQLContext(sc)
SalesDf = sqlContext.read.options(header='true').load(path, format='csv')
SalesDf.cache()
SalesDf_Version=SalesDf.withColumn('row_number',F.row_number().over(Window.partitionBy("id").orderBy(desc("recorddate"))))
initialLoad = SalesDf_Version.withColumn('partition',SalesDf_Version.year).withColumn('isActive', when(col('row_number') == 1, lit('Y')).when(col('row_number') != 1, lit('N')))
initialLoad = initialLoad.withColumn('version_flag',col ('isActive')).withColumn('partition',col('city'))
initialLoad = initialLoad.drop('row_number')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
sc.stop()
Thanks in advance for your help
You have a coalesce(1) before writing, what is the reason for that? Coalesce reduces the parallelization of that stage which in your case will cause the windowing query to run on 1 core, so you're losing the benefit of the 16 cores per node.
Remove the coalesce and that should start improving things.
Following were the changes that we implemented to improve performance of our code.
We removed coalesce and used repartition(50).We tried higher and lower numbers in the brackets but 50 was the optimized number in our case.
We were using s3 as our target but it was costing us alot because of rename thing in spark so we used HDFS instead and our job time was reduced to half of what it was before.
Overall by above changes our code ran 12 mins previously it was 50 mins.
Thanks
Ammar

Resources