Monitor Spark actual work time vs. communication time - apache-spark

On a Spark cluster, if the jobs are very small, I assume that the clustering will be inefficient since most of the time will be spent on communication between nodes, rather than utilizing the processors on the nodes.
Is there a way to monitor how much time out of a job submitted with spark-submit is wasted on communication, and how much on actual computation?
I could then monitor this ratio to check how efficient my file aggregation scheme or processing algorithm is in terms of distribution efficiency.
I looked through the Spark docs, and couldn't find anything relevant, though I'm sure I'm missing something. Ideas anyone?

You can see this information in the Spark UI, asuming you are running Spark 1.4.1 or higher (sorry but I don't know how to do this for earlier versions of Spark).
Here is a sample image:
Here is the page that the image came from.
A brief summary: You can view a timeline of all the events happening in your Spark job within the Spark UI. From there, you can zoom in on each individual job and each individual task. Each task is divided into shceduler delay, serialization / deserialization, computation, shuffle, etc.
Now, this is obviously a very pretty UI but you might want something more robust so that you can check this info programmatically. It appears here that you can use the REST API to export the logging info in JSON.

Related

Spark Structured Streaming state management with RocksDB

For a particular use case we are using spark structured streaming, but the process is not efficient and stable. Aggregation stateful operation is the most time taking as well as memory crunching stage in the whole job. Spark Streaming provides an implementation of rocksDB to manage state. It helped us to gain some stability but added an overhead of time. So we are looking to optimise the rocksDB implementation. While exploring the logs we got to know that the Memtable Hit count is always zero and the Block Cache hit count is very low. It will be very helpful if someone can throw light on this.
RocksDB in itself provide various tuning parameters like write_buffer_size, min_buffer_to_merge. We tried to expose these parameters to spark. And then set the parameters value high in order to increase our chances of hitting memtable but that didn't help.
RocksDB is mostly a back up for state (other option is HDFS) or used during shuffle when local cache(memory)for a partition key is not within same executor.
You can check stateful operator metrics provided in spark ui to see how memory(cache) is being used before it hits rocksdb.
May be this below article can help on getting more info.
https://medium.com/#vndhya/stateful-processing-in-spark-structured-streaming-memory-aspects-964bc6414346. (disclosure: its written by me)

How are the task results being processed on Spark?

I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?

Deadlock when many spark jobs are concurrently scheduled

Using spark 2.4.4 running in YARN cluster mode with the spark FIFO scheduler.
I'm submitting multiple spark dataframe operations (i.e. writing data to S3) using a thread pool executor with a variable number of threads. This works fine if I have ~10 threads, but if I use hundreds of threads, there appears to be a deadlock, with no jobs being scheduled according to the Spark UI.
What factors control how many jobs can be scheduled concurrently? Driver resources (e.g. memory/cores)? Some other spark configuration settings?
EDIT:
Here's a brief synopsis of my code
ExecutorService pool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(pool);
Dataset<Row> aHugeDf = spark.read.json(hundredsOfPaths);
List<Future<Void>> futures = listOfSeveralHundredThings
.stream()
.map(aThing -> ecs.submit(() -> {
df
.filter(col("some_column").equalTo(aThing))
.write()
.format("org.apache.hudi")
.options(writeOptions)
.save(outputPathFor(aThing));
return null;
}))
.collect(Collectors.toList());
IntStream.range(0, futures.size()).forEach(i -> ecs.poll(30, TimeUnit.MINUTES));
exec.shutdownNow();
At some point, as nThreads increases, spark no longer seems to be scheduling any jobs as evidenced by:
ecs.poll(...) timing out eventually
The Spark UI jobs tab showing no active jobs
The Spark UI executors tab showing no active tasks for any executor
The Spark UI SQL tab showing nThreads running queries with no running job ID's
My execution environment is
AWS EMR 5.28.1
Spark 2.4.4
Master node = m5.4xlarge
Core nodes = 3x rd5.24xlarge
spark.driver.cores=24
spark.driver.memory=32g
spark.executor.memory=21g
spark.scheduler.mode=FIFO
If possible write the output of the jobs to AWS Elastic MapReduce hdfs (to leverage on the almost instantaneous renames and better file IO of local hdfs) and add a dstcp step to move the files to S3, to save yourself all the troubles of handling the innards of an object store trying to be a filesystem. Also writing to local hdfs will allow you to enable speculation to control runaway tasks without falling into the deadlock traps associated with DirectOutputCommiter.
If you must use S3 as the output directory ensure that the following Spark configurations are set
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.speculation false
Note: DirectParquetOutputCommitter is removed from Spark 2.0 due to the chance of data loss. Unfortunately until we have improved consistency from S3a we have to work with the workarounds. Things are improving with Hadoop 2.8
Avoid keynames in lexicographic order. One could use hashing/random prefixes or reverse date-time to get around.The trick is to name your keys hierarchically, putting the most common things you filter by on the left side of your key. And never have underscores in bucket names due to DNS issues.
Enabling fs.s3a.fast.upload upload parts of a single file to Amazon S3 in parallel
Refer these articles for more detail-
Setting spark.speculation in Spark 2.1.0 while writing to s3
https://medium.com/#subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-a767242f3d98
IMO you're likely approaching this problem wrong. Unless you can guarantee that the number of tasks per job is very low, you're likely not going to get much performance improvement by parallelizing 100s of jobs at once. Your cluster can only support 300 tasks at once, assuming you're using the default parallelism of 200 thats only 1.5 jobs. I'd suggest rewriting your code to cap max concurrent queries at 10. I highly suspect that you have 300 queries with only a single task of several hundred actually running. Most OLTP data processing system intentionally have a fairly low level of concurrent queries compared to more traditional RDS systems for this reason.
also
Apache Hudi has a default parallelism of several hundred FYI.
Why don't you just partition based on your filter column?
I would start by eliminating possible causes. Are you sure its spark that is not able to submit many jobs? Is it spark or is it YARN? If it is the later, you might need to play with the YARN scheduler settings. Could it be something to do with ExecutorService implementation that may have some limitation for the scale you are trying to achieve? Could it be hudi? With the snippet thats hard to determine.
How does the problem manifest itself other than no jobs starting up? Do you see any metrics / monitoring on the cluster or any logs that point to the problem as you say it?
If it is to do with scaling, is is possible for you to autoscale with EMR flex and see if that works for you?
How many executor cores?
Looking into these might help you narrow down or perhaps confirm the issue - unless you have already looked into these things.
(I meant to add this as comment rather than answer but text too long for comment)
Using threads or thread pools are always problematic and error prone.
I had similar problem in processing spark jobs in one of Internet of things application. I resolved using fair scheduling.
Suggestions :
Use fair scheduling (fairscheduler.xml) instead of yarn capacity scheduler
how to ? see this by using dedicated resource pools one per module. when used it will look like below spark ui
See that unit of parllelism (number of partitions ) are correct for data frames you use by seeing spark admin ui. This is spark native way of using parllelism.

Why are accumulators sent directly to the driver?

With Spark, if I've already defined my accumulators to be associative and reducible, why are does each worker send them directly to the driver rather than reducing incrementally along with my actual job? It seems a bit goofy to me.
Each task in Spark maintains its own accumulator and its value is send back to the driver when particular task has been finished.
Since accumulators are in Spark are mostly a diagnostic and monitoring sharing accumulators between tasks would make these almost useless. Not to mention that worker failure after particular task is finished would result in a loss of data and make accumulators even less reliable than they are right now.
Moreover this mechanism is pretty much the same as the standard RDD reduce where tasks results are continuously send to the driver and merged locally.

Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.

Resources