Spark 12 GB data load with Window function Performance issue - apache-spark

I am using sparksql to transform 12 GB data.My Transformation is to apply row number function with partition by on one of fields then divide data into two sets first set where row number is 1 and 2nd set include rest of data then write data to target location in 30 partitions.
My job is currently taking approximately 1 hour.I want to run it in less than 10 mins.
I am running this job on 3 Node cluster with specs(16 Cores & 32 GB RAM).
Node 1 yarn master node.
Node 2 Two Executors 1 driver and 1 other
Node 3 Two executors both for processing.
Each executor is assigned 5 cores and 10GB memory.
Is my hardware enough or i need more powerful hardware?
Is executors configuration right?
If both hardware and configuration is good then definitely i need to improve my code.
My code is as follow.
sqlContext=SQLContext(sc)
SalesDf = sqlContext.read.options(header='true').load(path, format='csv')
SalesDf.cache()
SalesDf_Version=SalesDf.withColumn('row_number',F.row_number().over(Window.partitionBy("id").orderBy(desc("recorddate"))))
initialLoad = SalesDf_Version.withColumn('partition',SalesDf_Version.year).withColumn('isActive', when(col('row_number') == 1, lit('Y')).when(col('row_number') != 1, lit('N')))
initialLoad = initialLoad.withColumn('version_flag',col ('isActive')).withColumn('partition',col('city'))
initialLoad = initialLoad.drop('row_number')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
sc.stop()
Thanks in advance for your help

You have a coalesce(1) before writing, what is the reason for that? Coalesce reduces the parallelization of that stage which in your case will cause the windowing query to run on 1 core, so you're losing the benefit of the 16 cores per node.
Remove the coalesce and that should start improving things.

Following were the changes that we implemented to improve performance of our code.
We removed coalesce and used repartition(50).We tried higher and lower numbers in the brackets but 50 was the optimized number in our case.
We were using s3 as our target but it was costing us alot because of rename thing in spark so we used HDFS instead and our job time was reduced to half of what it was before.
Overall by above changes our code ran 12 mins previously it was 50 mins.
Thanks
Ammar

Related

decide no of partition in spark (running on YARN) based on executer ,cores and memory

How to decide no of partition in spark (running on YARN) based on executer, cores and memory.
As i am new to spark so doesn't have much hands on real scenario
I know many things to consider to decide the partition but still any production general scenario explanation in detail will be very helpful.
Thanks in advance
One important parameter for parallel collections is the number of
partitions to cut the dataset into. Spark will run one task for each
partition of the cluster. Typically you want 2-4 partitions for each
CPU in your cluster
the number of parition is recommended to be 2/4 * the number of cores.
so if you have 7 executor with 5 core , you can repartition between 7*5*2 = 70 and 7*5*4 = 140 partition
https://spark.apache.org/docs/latest/rdd-programming-guide.html
IMO with spark 3.0 and AWS EMR 2.4.x with adaptive query execution you're often better off letting spark handle it. If you do want to hand tune it the answer can often times be complicated. One good option is to have 2 or 4 times the number of cpus available. While this is useful for most datasizes it becomes problematic with very large and very small datasets. In those cases it's useful to aim for ~128MB per partition.

Spark Job Internals

I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
If I dont have any wide transformations, will this cause a spill to disk?
For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!
Spark will run all jobs for the first stage before starting the second. This does not mean that it will start 8 partitions, wait for them all to complete, and then start another 8 partitions. Instead, this means that each time an executor finishes a partition, it will start another partition from the first stage until all partions from the first stage is started, then spark will wait until all stages in the first stage are complete before starting the second stage.
The data is stored in memory, or if not enough memory is available, spilled to disk on the executor memory. Whether a spill happens will depend on exactly how much memory is available, and how much intermediate data results.
The optimal file size is varies, and is best measured, but some key factors to consider:
The total number of files limits total parallelism, so should be greater than the number of cores.
The amount of memory used processing a partition should be less than the amount available to the executor. (~4GB for AWS glue)
There is overhead per file read, so you don't want too many small files.
I would be inclined towards 10MB files or larger if you only have 8 cores.

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...
This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Spark: understanding partitioning - cores

I'd like to understand partitioning in Spark.
I am running spark in local mode on windows 10.
My laptop has 2 physical cores and 4 logical cores.
1/ Terminology : to me, a core in spark = a thread. So a core in Spark is different than a physical core, right? A Spark core is associated to a task, right?
If so, since you need a thread for a partition, if my sparksql dataframe has 4 partitions, it needs 4 threads right?
2/ If I have 4 logical cores, does it mean that I can only run 4 concurrent threads at the same time on my laptop? So 4 in Spark?
3/ Setting the number of partitions : how to choose the number of partitions of my dataframe, so that further transformations and actions run as fast as possible?
-Should it have 4 partitions since my laptop has 4 logical cores?
-Is the number of partitions related to physical cores or logical cores?
-In spark documentations, it's written that you need 2-3 tasks per CPU. Since I have two physical coresn should the nb of partitions be equal to 4or6?
(I know that number of partitions will not have much effect on local mode, but this is just to understand)
Theres no such thing as a "spark core". If you are referring to options like --executor-cores then yes, that refers to how many tasks each executor will run concurrently.
You can set the number of concurrent tasks to whatever you want, but more than the number of logical cores you have probably won't give and advantage.
Number of partitions to use is situational. Without knowing the data or the transformations you are doing it's hard to give a number. Typical advice is to use just below a multiple of your total cores., for example, if you have 16 cores, maybe 47, 79, 127 and similar numbers just under a multiple of 16 are good to use. The reason for this is you want to make sure all cores are working (as little time as possible do you have resources idle, waiting for others to finish). but you leave a little extra to allow for speculative execution (spark may decide to run the same task twice if it is running slowly to see if it will go faster on a second try).
Picking the number is a bit of trial and error though, Take advantage of the spark job server to monitor how your tasks are running. Having few tasks with many of records each means you should probably increase the number of partitions, on the other hand, many partitions with only a few records each is also bad and you should try to reduce the partitioning in these cases.

Processing time for my Spark program does not decrease on increasing the number of nodes in the cluster

I have a Cloudera cluster with 3 nodes on which Apache Spark is installed. I am running a Spark program which reads data from HBase tables, transforms the data and stores it in a different HBase table. With 3 nodes the time taken in approximately 1 minutes 10 seconds for 5 million rows HBase data. On decreasing or increasing the number of nodes, the time taken came similar whereas it was expected to reduce after increasing the number of nodes and increase by increasing the number of nodes.Below was the time taken:
1) With 3 nodes: Approximately 1 minute 10 seconds for 5 million rows.
2) With 1 node: Approximately 1 minute 10 seconds for 5 million rows.
3) With 6 nodes: Approximately 1 minute 10 seconds for 5 million rows.
What can be the reason for same time taken despite increasing or decreasing the number of nodes?
Thank You.
By default, Hbase will probably read the 5 million rows from a single region or maybe 2 regions (degree of parallelism). The write will occur to a single region or maybe 2 based on the scale of the data.
Is Spark your bottleneck? If you allocate variable resources (more/less cores or memory) it will only lead to change in overall times of the job if the computation on the job is the bottleneck.
If your computation (the transform) is relatively simple, the bottleneck might be reading from HBase or writing from HBase. In that case irrespective of how many node/cores you may give it. The run time will be constant.
From the runtimes you have mentioned it seems that's the issue.
The bottleneck may be one or both hbase and spark side. You can check the hbase side for your tables number of region servers. It is same meaning with the read and write parallelism of data. The more the better usually. You must notice the hotspotting issue
The spark side parallelism can be checked with your number of rdd for your data. May be you should repartition your data. Added to this,cluster resource utilization may be your problem. For checking this you can monitor spark master web interface. Number of nodes, number of workers per node, and number of job, task per worker etc. Also you must check number of cpu and amont of ram usage per worker within this interface.
For details here

Resources