Spark Understanding Interval between jobs - apache-spark

In spark UI, I am wondering what is going on between jobs and looking for any ways reduce them, especially after collect and before writing parquet.
I see a really long break before submitting parquet, almost 1 minute. Considering the whole application is taking 2 minutes, it takes a great proportion. Does this break usually means sparking is going over all the workers and collecting datas? Even so, the interval before parquet is quite longer than other actions, such as collect or first.
Thanks
Here is the image

In my experience, that delay is generally present when the driver portion of your job is busy doing work. For instance, if you do a .collect(), and then iterate over the resulting Array, that work is being done sequentially on the driver, and would result in no tasks being assigned to the executors during that time.

Related

HBase batch loading with speed control cause of slow consumer

We need to load a big part of data from HBase using Spark.
Then we put it into Kafka and read by consumer. But consumer is too slow
At the same time Kafka memory is not enough to keep all scan result.
Our key contain ...yyyy.MM.dd, and now we load 30 days in one Spark job, using operator filter.
But we cant split job to many jobs, (30 jobs filtering each day), cause then each job will have to scan all HBase, and it will make summary scan to slow.
Now we launch Spark job with 100 threads, but we cant make speed slower by set less threads (for example 7 threads). Cause Kafka is used by third hands developers, that make Kafka sometimes too busy to keep any data. So, we need to control HBase scan speed, checking all time is there a memory in Kafka to store our data
We try to save scan result before load to Kafka into some place, for example in ORC files in hdfs, but scan result make many little files, it is problem to group them by memory (or there is a way, if you know please tell me how?), and store into hdfs little files bad. And merging such a files is very expensive operation and spend a lot of time that will make total time too slow
Sugess solutions:
Maybe it is possible to store scan result in hdfs by spark, by set some special flag in filter operator and then run 30 spark jobs to select data from saved result and put each result to Kafka when it possible
Maybe there is some existed mechanism in spark to stop and continue launched jobs
Maybe there is some existed mechanism in spark to separate result by batches (without control to stop and continue loading)
Maybe there is some existed mechanism in spark to separate result by batches (with control to stop and continue loading by external condition)
Maybe when Kafka will throw an exception (that there is no place to store data), there is some backpressure mechanism in spark that will stop scan for some time if there some exceptions appear in execution (but i guess that there is will be limited retry of restarting to execute operator, is it possible to set restart operation forever, if it is a real solution?). But better to keep some free place in Kafka, and not to wait untill it will be overloaded
Do using PageFilter in HBase (but i guess that it is hard to realize), or other solutions variants? And i guess that there is too many objects in memory to use PageFilter
P.S
This https://github.com/hortonworks-spark/shc/issues/108 will not help, we already use filter
Any ideas would be helpful

Spark limit + write is too slow

I have a dataset of 8Billion records stored in parquet files in Azure Data Lake Gen 2.
I wanted to separate out a sample dataset of 2Billion records in a different location for some benchmarking needs so I did the following
df = spark.read.option('inferSchema', 'true').format('parquet').option('badRecordsPath', f'/tmp/badRecords/').load(read_path)
df.limit(2000000000).write.option('badRecordsPath', f'/tmp/badRecords/').format('parquet').save(f'{write_path}/advertiser/2B_parquet')
This job is running on 8 nodes of 8core 28GB RAM machines [ 8 WorkerNodes + 1 Master Node ]. It's been running for over an hour with not a single file is written yet. The load did finish within 2s, so I know the limit + write action is what's causing the bottleneck [ although load just infers schema and creates a list of files but not actually reading the data ].
So I started inspecting the Spark UI for some clues and here are my observations
2 Jobs have been created by Spark
The first job took 35 mins. Here's the DAG
The second job has been running for about an hour now with no progress at all. The second job has two stages in it.
If you notice, stage 3 has one running task, but if I open the stages panel, I can't see any details of the task. I also don't understand why it's trying to do a shuffle when all I have is a limit on my DF. Does limit really need a shuffle? Even if it's shuffling, it seems like 1hr is awfully long to shuffle data around.
Also if this is what's really performing the limit, what did the first job really do? Just read the data? 35mins for that also seems too long, but for now I'd just settle on the job being completed.
Stage 4 is just stuck which is believed to be the actual writing stage and I believe is waiting for this shuffle to end.
I am new to spark and I'm kinda clueless about what's happening here. Any insights on what I'm doing wrong will be very useful.

Apache Spark: is it possible to get dataset counts in a spark job?

Sometimes some Spark job which runs in our cluster runs too long not because of bad optimization, but because of bad logic of the algorithm. In most cases this is a consequence of some unnecessary joins that produce too many rows. Normally we spot such jobs by looking at Spark execution plan where we can find such joins by looking at "number of output rows: xxx" in blue stage labels.
I want to understand - is it possible to optimize this procedure and somehow automatically notify the programmer that the job has too many rows in some dataset (after execution)?
Maybe we can print this in logs (without manually counting dataset's size in code)?
Maybe after running the job we can get the output of the execution plan somehow and save it for further investigations?
No, it's not an option. Spark will do its best to optimize the query plan, so manual interaction with lower execution level is pretty much limited. However, you can "control" the rows for each jobs/tasks by changing some configurations (like spark.sql.shuffle.partitions or spark.sql.files.maxPartitionBytes), or by repartitioning data, which will cause data to be shuffled and re-distributed nearly equally between executors.

How to get spark tasks detail information

By view Spark UI timeline, I find my spark application's last task of a specific stage always cost too much time. It seem the task can't finish forever, I have even waited six times longer time than normal tasks.
I want to get more information about the lask task, but I don't know how to debug this specific task, is there anyone can give me some suggestions?
Thanks for your help!
The data has been partitioned well, so the lask task don't have too much data.
Check the explain plan of the resulting dataframe to understand what operations are happening. Are there any shuffles? Sometimes when operations are performed on a dataframe(such as joins) it can result in intermediate dataframes being mapped to a smaller number of partitions and this can cause slower performance because the data isnt as distributed as can be.
Check if there are a lot of shuffles and repeated calls to such dataframes and try to cache the dataframe that comes right after a shuffle.
Check in the Spark UI (address of the driver:4040 is default) and see what the data volume of cached dataframes is, what are the processes and if there are any other overheads such as gc or if it is pure processing time.
Hope that helps.

How is task distributed in spark

I am trying to understand that when a job is submitted from the spark-submit and I have spark deployed system with 4 nodes how is the work distributed in spark. If there is large data set to operate on, I wanted to understand exactly in how many stages are the task divided and how many executors run for the job. Wanted to understand how is this decided for every stage.
It's hard to answer this question exactly, because there are many uncertainties.
Number of stages depends only on described workflow, which includes different kind of maps, reduces, joins, etc. If you understand it, you basically can read that right from the code. But most importantly that helps you to write more performant algorithms, because it's generally known the one have to avoid shuffles. For example, when you do a join, it requires shuffle - it's a boundary stage. This is pretty simple to see, you have to print rdd.toDebugString() and then look at indentation (look here), because indentation is a shuffle.
But with number of executors that's completely different story, because it depends on number of partitions. It's like for 2 partitions it requires only 2 executors, but for 40 ones - all 4, since you have only 4. But additionally number of partitions might depend on few properties you can provide at the spark-submit:
spark.default.parallelism parameter or
data source you use (f.e. for HDFS and Cassandra it is different)
It'd be a good to keep all of the cores in cluster busy, but no more (meaning single process only just one partition), because processing of each partition takes a bit of overhead. On the other hand if your data is skewed, then some cores would require more time to process bigger partitions, than others - in this case it helps to split data to more partitions so that all cores are busy roughly same amount of time. This helps with balancing cluster and throughput at the same time.

Resources