Processing time for my Spark program does not decrease on increasing the number of nodes in the cluster - apache-spark

I have a Cloudera cluster with 3 nodes on which Apache Spark is installed. I am running a Spark program which reads data from HBase tables, transforms the data and stores it in a different HBase table. With 3 nodes the time taken in approximately 1 minutes 10 seconds for 5 million rows HBase data. On decreasing or increasing the number of nodes, the time taken came similar whereas it was expected to reduce after increasing the number of nodes and increase by increasing the number of nodes.Below was the time taken:
1) With 3 nodes: Approximately 1 minute 10 seconds for 5 million rows.
2) With 1 node: Approximately 1 minute 10 seconds for 5 million rows.
3) With 6 nodes: Approximately 1 minute 10 seconds for 5 million rows.
What can be the reason for same time taken despite increasing or decreasing the number of nodes?
Thank You.

By default, Hbase will probably read the 5 million rows from a single region or maybe 2 regions (degree of parallelism). The write will occur to a single region or maybe 2 based on the scale of the data.

Is Spark your bottleneck? If you allocate variable resources (more/less cores or memory) it will only lead to change in overall times of the job if the computation on the job is the bottleneck.
If your computation (the transform) is relatively simple, the bottleneck might be reading from HBase or writing from HBase. In that case irrespective of how many node/cores you may give it. The run time will be constant.
From the runtimes you have mentioned it seems that's the issue.

The bottleneck may be one or both hbase and spark side. You can check the hbase side for your tables number of region servers. It is same meaning with the read and write parallelism of data. The more the better usually. You must notice the hotspotting issue
The spark side parallelism can be checked with your number of rdd for your data. May be you should repartition your data. Added to this,cluster resource utilization may be your problem. For checking this you can monitor spark master web interface. Number of nodes, number of workers per node, and number of job, task per worker etc. Also you must check number of cpu and amont of ram usage per worker within this interface.
For details here

Related

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...
This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

does the numbers of partitions of all joined tables affect the maximal executors?

I have a Spark SQL like
select ...
from A
join B on A.k = B.k
join C on A.k = C.k
A has 2k partitions; B has 7 partitions; while C is not partitioned.
I set the maximal dynamic executor number to be 50. However, the application got < 20 executors. When two stages run in parallel, one stage uses 5 executors, while the other uses 6.
Should I increase the partition numbers of B and C to parallelize the query more?
Definitely, there is an impact due to uneven partitions and they are:
Less concurrency – You are not using advantages of parallelism. There could be worker nodes which are sitting ideal. Data skewing and improper resource utilization.
Your data might be skewed on one partition and hence your one worker might be doing more than other workers and hence resource issues might come at that worker.
Since there is a trade-off between partition count, they should be in right number otherwise task scheduling may take more time than actual execution time.
You should have usually between 100 and 10K partitions depending upon cluster size and data.
Lower bound – 2 X number of cores in cluster available to application
Upper bound – task should take 100+ ms time to execute.If it is taking less time than your partitioned data is too small and your application might be spending more time in scheduling the tasks.

Spark 12 GB data load with Window function Performance issue

I am using sparksql to transform 12 GB data.My Transformation is to apply row number function with partition by on one of fields then divide data into two sets first set where row number is 1 and 2nd set include rest of data then write data to target location in 30 partitions.
My job is currently taking approximately 1 hour.I want to run it in less than 10 mins.
I am running this job on 3 Node cluster with specs(16 Cores & 32 GB RAM).
Node 1 yarn master node.
Node 2 Two Executors 1 driver and 1 other
Node 3 Two executors both for processing.
Each executor is assigned 5 cores and 10GB memory.
Is my hardware enough or i need more powerful hardware?
Is executors configuration right?
If both hardware and configuration is good then definitely i need to improve my code.
My code is as follow.
sqlContext=SQLContext(sc)
SalesDf = sqlContext.read.options(header='true').load(path, format='csv')
SalesDf.cache()
SalesDf_Version=SalesDf.withColumn('row_number',F.row_number().over(Window.partitionBy("id").orderBy(desc("recorddate"))))
initialLoad = SalesDf_Version.withColumn('partition',SalesDf_Version.year).withColumn('isActive', when(col('row_number') == 1, lit('Y')).when(col('row_number') != 1, lit('N')))
initialLoad = initialLoad.withColumn('version_flag',col ('isActive')).withColumn('partition',col('city'))
initialLoad = initialLoad.drop('row_number')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
initialLoad.coalesce(1).write.partitionBy('isActive','partition').option("header", "true").mode('overwrite').csv(path +'Temp/target/')
sc.stop()
Thanks in advance for your help
You have a coalesce(1) before writing, what is the reason for that? Coalesce reduces the parallelization of that stage which in your case will cause the windowing query to run on 1 core, so you're losing the benefit of the 16 cores per node.
Remove the coalesce and that should start improving things.
Following were the changes that we implemented to improve performance of our code.
We removed coalesce and used repartition(50).We tried higher and lower numbers in the brackets but 50 was the optimized number in our case.
We were using s3 as our target but it was costing us alot because of rename thing in spark so we used HDFS instead and our job time was reduced to half of what it was before.
Overall by above changes our code ran 12 mins previously it was 50 mins.
Thanks
Ammar

Hadoop on Azure - file processing on larger number of nodes takes the same amount of time

I ran a wordcount program in python on HDInsight clusters of different size and every time it took the same amount of time. The file size is 600 MB and I ran it on 2, 4 and 8 nodes - every time the same amount of time (not to the second but very close).
I expected the time to change since the file is processed by larger number of nodes as the cluster grows in size... I am wondering if this is the case with a file which is relatively small? Or is there a way to define number of nodes on which the job should be done? - I personally don't think so since the cluster size is set in advance.
Or is it the nature of the wordcount application and the fact that the reducer does the same amount of work?
Or is because it's python - I read somewhere it is said to be slower than java (or scala on spark)?
The same thing happens on Spark clusters - although the nodes number goes up the time does not go down.
Per my experience, 600MB data size for processing on Hadoop is small. Not all time cost for processing files, because Hadoop need some time to prepare startup for M/R job & data on HDFS.
For a small dataset, it's not necessary for using too more compute nodes. Even, the performance got by a single computer would be higher than the cluster on Hadoop, such as the Hadoop sample wordcount for several small text files.
As I known, the dataset size on Hadoop need to over hundreds of GB level generally for performance advantage, and performance increase with an increase in the number of nodes.
As references, there is a SO thread (Why submitting job to mapreduce takes so much time in General?) that you can know.

Settle the right number of partition on RDD

I read some comments which says than a good number of partition for a RDD is 2-3 time the number of core. I have 8 nodes each with two 12-cores processor, so i have 192 cores, i setup the partition beetween 384-576 but it doesn't seems works efficiently, i tried 8 partition, same result. Maybe i have to setup other parameters in order to my job works better on the cluster rather than on my machine. I add that the file i analyse make 150k lines.
val data = sc.textFile("/img.csv",384)
The primary effect would be by specifying too few partitions or far too many partitions.
Too few partitions You will not utilize all of the cores available in the cluster.
Too many partitions There will be excessive overhead in managing many small tasks.
Between the two the first one is far more impactful on performance. Scheduling too many smalls tasks is a relatively small impact at this point for partition counts below 1000. If you have on the order of tens of thousands of partitions then spark gets very slow.
Now, considering your case, you are getting the same results from 8 and 384-576 partitions. Generally the thumb rule says,
NoOfPartitions = (NumberOfWorkerNodes*NoOfCoresPerWorkerNode)-1
It says that, as we know, the task is processed by CPU cores. So we should set that many number of partitions which is the total number of cores in the cluster to process-1(for Application Master of driver). That means the each core will process each partition at a time.
That means with 191 partitions can improve the performance. Otherwise impact of setting less and more partitions scenario is explained in beginnning.
Hope this will help!!!

Resources