Apache Spark Delay Between Jobs - apache-spark

my as you can see, my small application has 4 jobs which run for a total duration of 20.2 seconds, however there is a big delay between job 1 and 2 causing the total time to be over a minute. Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload of HFiles into a HBase table. Here is the code I used to load to load the files
val outputDir = new Path(HBaseUtils.getHFilesStorageLocation(resolvedTableName))
val job = Job.getInstance(hBaseConf)
job.getConfiguration.set(TableOutputFormat.OUTPUT_TABLE, resolvedTableName)
job.setOutputFormatClass(classOf[HFileOutputFormat2])
job.setMapOutputKeyClass(classOf[ImmutableBytesWritable])
job.setMapOutputValueClass(classOf[KeyValue])
val connection = ConnectionFactory.createConnection(job.getConfiguration)
val hBaseAdmin = connection.getAdmin
val table = TableName.valueOf(Bytes.toBytes(resolvedTableName))
val tab = connection.getTable(table).asInstanceOf[HTable]
val bulkLoader = new LoadIncrementalHFiles(job.getConfiguration)
preBulkUploadCallback.map(callback => callback())
bulkLoader.doBulkLoad(outputDir, hBaseAdmin, tab, tab.getRegionLocator)
If anyone has any ideas, I would be very greatful

I can see there are 26 tasks in job 1 which is based on the number of hfiles created. Even though the job 2 shows completed in 2s, it takes some time to copy these files to target location and that's why you are getting a delay between job 2 and 3. This can be avoided by reducing the number of tasks in job 1.

Decrease the number of Regions for the output table in Hbase, which will result in reducing the number of task for your second job.
TableOutputFormat determines the split based on the number of regions for a given table in Hbase

Job number 1 runJob at SparkHadoopMapReduceWriter.scala:88 is performing a bulkupload
This is not quite true. This job merely creates HFiles outside of HBase. The gap you see between this job and the next one could be explained by the actual bulk loading at bulkLoader.doBulkLoad. This operation involves only metadata trasfer and usually performs faster (from my experience), so you should check the driver logs to see where it hangs.

Thanks for your input guys, I lowered the number of HFiles created in task 0. This has decreased the lag by about 20%. I used
HFileOutputFormat2.configureIncrementalLoad(job, tab, tab.getRegionLocator)
which automatically calculates the number of reduce tasks to match the current number of regions for the table. I will say that we are are using HBase backed by S3 in AWS EMR instead of the classical HDFS. I'm am going to investigate now whether this could be contributing to the lag.

Related

Spark limit + write is too slow

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.

Is there a more systematic way to resolve a slow AWS Glue + PySpark execution stage?

I have this code snippet that I ran locally in standalone mode using 100 records only:
from awsglue.context import GlueContext
glue_context = GlueContext(sc)
glue_df = glue_context.create_dynamic_frame.from_catalog(database=db, table_name=table)
df = glue_df.toDF()
print(df.count())
The schema contains 89 columns all having string data type except 5 columns that have array of struct data type. The data size is 3.1 MB.
Also, here is some info about the environment used to run the code:
spark.executor.cores: 2
spark.executor.id: driver
spark.driver.memory: 1000M
Problem is I can't find out why stage 1 took 12 minutes to finish where it only has to count 100 records. I can't find what "Scan parquet" and "Exchange" Tasks mean as shown in this image:
Stage 1 DAG Visualization
My question is, is there a more systematic way to understand what those tasks mean. As a beginner, I heavily relied on Spark UI but it doesn't give much information about the tasks it has executed. I was able to find which task took the most time but I have no idea why it is the case and how to systematically resolve it.
The running time in spark code is calculating based on the cluster kick-off time, DAG scheduler optimisation time, running stages time. In your case, the issue could be because of the followings:
The number of parquet files. To test this easily read the table and write it back as one parquet file. You are calling a table but behind the scene, it's reading the physical parquet files so the number of files is an item to consider.
Number of spark clusters. The number of clusters should be a relevant number of computing resources you have. For example, in your case, you have 2 core with a small-size table. So it's more efficient to have just a few partitions instead of the default partition numbers which is 200.
To get more clarification on the spark stages use explain function and read the DAG result. As a result of this function you could see and compare Analyzed Logical Plan, Optimized Logical Plan, and Physical Plan that has been calculated by internal optimiser processes.
To find a more detailed description of the explain function please visit this LINK

Why my shuffle partition is not 200(default) during group by operation? (Spark 2.4.5)

I am new to spark and trying to understand the internals of it. So,
I am reading a small 50MB parquet file from s3 and performing a group by and then saving back to s3.
When I observe the Spark UI, I can see 3 stages created for this,
stage 0 : load (1 tasks)
stage 1 : shufflequerystage for grouping (12 tasks)
stage 2: save (coalescedshufflereader) (26 tasks)
Code Sample:
df = spark.read.format("parquet").load(src_loc)
df_agg = df.groupby(grp_attribute)\
.agg(F.sum("no_of_launches").alias("no_of_launchesGroup")
df_agg.write.mode("overwrite").parquet(target_loc)
I am using EMR instance with 1 master, 3 core nodes(each with 4vcores). So, default parallelism is 12. I am not changing any config in runtime. But I am not able to understand why 26 tasks are created in the final stage? As I understand by default the shuffle partition should be 200. Screenshot of the UI attached.
I tried a similar logic on Databricks with Spark 2.4.5.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'true'), the final number of my partitions is 2.
I observe that with spark.conf.set('spark.sql.adaptive.enabled', 'false') and spark.conf.set('spark.sql.shuffle.partitions', 75), the final number of my partitions is 75.
Using print(df_agg.rdd.getNumPartitions()) reveals this.
So, the job output on Spark UI does not reflect this. May be a repartition occurs at the end. Interesting, but not really an issue.
In Spark sql, number of shuffle partitions are set using spark.sql.shuffle.partitions which defaults to 200. In most of the cases, this number is too high for smaller data and too small for bigger data. Selecting right value becomes always tricky for the developer.
So we need an ability to coalesce the shuffle partitions by looking at the mapper output. If the mapping generates small number of partitions, we want to reduce the overall shuffle partitions so it will improve the performance.
In the lastet version , Spark3.0 with Adaptive Query Execution , this feature of reducing the tasks is automated.
http://blog.madhukaraphatak.com/spark-aqe-part-2/
Considering this in Spark2.4.5 also catalist opimiser or EMR might have enabled this feature to reduce the tasks insternally rather 200 tasks.

AWS Glue Spark job does not scale when partitioning DataFrame

I am developing a Glue Spark job script using Glue development endpoint which has 4 DPUs allocated. According to Glue documentation 1 DPU equals to 2 executors and each executor can run 4 tasks. 1 DPU is reserved for master and 1 executor is for the driver. Now when my development endpoint has 4 DPUs I expect to have 5 executors and 20 tasks.
The script I am developing loads 1 million rows using JDBC connection. Then I coalesce the one million row partition into 5 partitions and write it to S3 bucket using the option maxRecordsPerFile = 100000. The whole process takes 34 seconds. Then I change the number of partitions to 10 and the job runs for 34 seconds again. So if I have 20 tasks available why is the script taking the same amount of time to complete with more partitions?
Edit: I started executing the script with an actual job, not development endpoint. I set the amount of workers to 10 and worker type to standard. Looking at metrics I can see that I have only 9 executors instead of 17 and only 1 executor is doing something and the rest are idle.
Code:
...
df = spark.read.format("jdbc").option("driver", job_config["jdbcDriver"]).option("url", jdbc_config["url"]).option(
"user", jdbc_config["user"]).option("password", jdbc_config["password"]).option("dbtable", query).option("fetchSize", 50000).load()
df.coalesce(17)
df.write.mode("overwrite").format("csv").option(
"compression", "gzip").option("maxRecordsPerFile", 1000000).save(job_config["s3Path"])
...
This is highly likely a limitation of the connections being opened to your jdbc data source, too few connections reduce parallelism too much may burden your database. Increase the degree of parallelism by tuning the options here.
Since you are reading as a data frame, you can set the upper lower bound and the partition columns. More can be found here.
To size your DPUs correctly, I would suggest linking the spark-ui, it could help narrow down where all the time is spend and the actual distribution of your tasks when you look at the DAG.

Why is the processing time the same for 2 and 4 cores and different partitions in Spark Streaming?

I'm trying to run some tests regarding processing times for a Spark Streaming Application, in local mode in my 4 core machine.
Here is my code:
SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("sparkstreaminggetjson");
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
JavaReceiverInputDStream<String> streamData1 = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
streamData1.print();
I am receiving 1 JSON message per second.
So, I test this for 4 different scenerios:
1) setMaster(...local[2]) and 1 partition
2) setMaster(...local[*]) and 1 partition
3)setMaster(...local[2]) and 4 partitions (using streamData1.repartition(4))
4) setMaster(...local[*]) and 4 partitions (using streamData1.repartition(4))
When I check the average processing times in the UI, this is what I get for each scenario:
1) 30 ms
2) 28 ms
3) 72 ms
4) 75 ms
My question is: why are the processing times pretty much the same for 1 and 2, and 3 and 4?
I realize that the increase from 2 to 4 for example is normal, because repartition is a shuffle operation. What I don't get is, for example in 4), why is the processing so similar to 3? Shouldn't it be much smaller since I am increasing the level of paralelization, and I have more cores to distribute the tasks to?
Hope I wasn't confusing,
Thank you so much in advance.
Some of this depends on what your JSON message looks like, I'll assume each message is a single string without line breaks. In that case, with 1 message per second and batch interval of 1 second, at each batch you will get an RDD with just a single item. You can't split that up into multiple partitions, so when you repartition you still have the same situation data-wise, but with the overhead of the repartition step.
Even with larger amounts of data I would not expect too much of difference when all you do to the data is print() it: this will take the first 10 items of your data, which if they can come from just one partition, I would expect Spark to optimize that to only calculate that one partition. In any case you will get more representative numbers if you significantly increase the amount of data per batch, and do some actual processing on the whole set, at minimum something like streamData1.count().print().
To get a better understanding of what happens, it is also useful to dig into the other parts of Spark's UI, like the Stages tab that can tell you how much of the execution time is shuffling, serialization, etc rather than actual execution, and things that affect performance like DAGs that tell you which bits may be cached, and tasks that Spark was able to skip.

Resources