If a Spark stage has completed, is the computation done? - apache-spark

I'm viewing my job in the Spark Application Master console and I can see in real-time the various stages completing as Spark eats its way through my application's DAG. It all goes reasonably fast. Some stages take less than a second, others take a minute or two.
The final stage, at the top of the list, is rdd.saveAsTextFile(path, classOf[GzipCodec]). This stage takes a very long time.
I understand that transformations are executed zero, one or many times depending upon the execution plan created as a result of actions like saveAsTextFile or count.
As the job progresses, I can see the execution plan in the App Manager. Some stages are not present. Some are present more than once. This is expected. I can see the progress of each stage in realtime (as long as I keep hitting F5 to refresh the page). The execution time is roughly commensurate with the data input size for each stage. Because of this, I'm certain that what the App Manager is showing me is the progress of the actual transformations, and not some meta-activity on the DAG.
So if the transformations are occurring in each of those stages, why is the final stage - a simple write to S3 from EMR - so slow?
If, as my colleague suggest, the transformation stages shown in the App Manager are not doing actual computation, what are they doing that consumes so much memory, CPU & time?

In Spark lazy evaluation is a key concept, and a concept you'd better get familiar with if you want to work with Spark.
The stages you witness to complete too fast do not do any significant computation.
If they are not doing actual computation, what are they doing?
They are updating the DAG.
When an action is triggered, then Spark has the chance to consult the DAG to optimize computation (something that wouldn't be possible without lazy optimization).
For more, read Spark Transformation - Why its lazy and what is the advantage?
Moreover, I think your colleague rushed to give you an answer, and mistakenly said:
transformation are cheap
The truth lies in ref's RDD operations:
All transformations in Spark are lazy, in that they do not compute
their results right away. Instead, they just remember the
transformations applied to some base dataset (e.g. a file). The
transformations are only computed when an action requires a result to
be returned to the driver program.
Cheap is not the right word.
That explains why, in the end of the day, the final stage of yours (that actually asks for data and triggers the action) is so slow in comparison with the other tasks.
I mean every stage you mention does not seem to trigger any action. As a result, the final stage has to take into account all of the prior stages, and do all the work needed, but remember, in an optimized Spark-viewpoint.

I guess the real confusion is here:
transformation are cheap
Transformations are lazy (most of the time), but nowhere near cheap. It means transformation won't be applied, unless there is an eager descendant (action) depending on it. It doesn't tell you anything about its cost.
In general transformations are places, where the real work happens. Output actions, excluding storage / network IO, are the ones who are usually cheap, compared to the logic executed in transformations.

Related

How to find out which transformation is taking a long time in Spark UI?

You may have spark code that joins, filters, then groubBys something, and at the end does take(1), for example. But when you look at SparkUI, it only shows that take(1) is taking a long time as an action that contains all those transformations. And it seems like there's no way to see which transformation is taking a long time.
So, how do I find out which transformation is taking a long time in Spark UI?
You can use Stages Tab in spark UI. Stages tab displays a summary page that shows the current state of all stages of all jobs in the Spark application.
At the beginning of the page is the summary with the count of all stages by status (active, pending, completed, skipped, and failed)
You need to identify your transformation operation there. If you are using same transformation multiple time you can differentiate by clicking on details you can know exact line number from code from where it getting called.
Check the time spend and still if you are not satisfied you can visit Storage Tab to check if you persisting your datasets correctly or not. Sometimes id not persisted, spark calculates same things many times.
Good Luck!

Hanging Foundry job; why does it seem stuck on a stage?

I see from my job overview page that my job appears stuck on one of the stages (most others have taken a reasonable amount of time, one of them is much slower).
What does it mean when one of my stages is taking so long to finish?
The most likely thing you're suffering from is skew.
Skew is defined as an imbalance of work done by a Spark stage, namely that certain tasks for whatever reason take much longer to compute than others.
It's important to verify that your job actually has skew and not just assume this is the culprit.
One of the most common reasons for skew is an imbalance of distribution of keys for a shuffle. An example of this is when a join has a large count of rows for keys on both side of a join. There's some ways you can verify this distribution problem.
You might get unlucky sometimes and have a task that is both longer-running and kicked off at the very end of your stage. When this happens, you'll observe particularly slow stage execution times; sometimes you get lucky and it gets kicked off first. In this example, the slower 5 sec task is the skewed task.

How to trigger catalyst optimizer before spark-submit to reduce execution time?

The organisation I am working on is moving to public cloud from their old traditional way of execution. we have to pay for all executions that takes place over the cloud. for reducing this execution cost, we are doing two things:
we are trying to avoid all bad execution.
we are trying to reduce the execution time further.
As a big data engineer, my work mostly depends on SparkSQL and I am trying to reduce the SQL query execution time. what catalyst do at execution time, I want to do that before the execution. for ex- reading the logical plan, optimizing the logical plan and generating the physical plan etc.
I also want to add my custom optimization plans in catalyst which will also be triggered at build time.
Is there any way to do all this before execution?
You can actually get the execution plan for your query by creating the dataframe and not performing any action.
Suppose you have a DataFrame df, you can access df.logicalPlan and traverse over the plan. This might answer your first requirement of avoiding bad execution, if you have some heuristic method to detect it.
As for custom optimizations, you can add your own optimization rules (see https://www.waitingforcode.com/apache-spark-sql/introduction-custom-optimization-apache-spark-sql/read).
This does not trigger at build time, rather at execution time (like all catalyst optimizations)

Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming executions.
If I have a stream X, and in my program I send stream X to function A and function B:
In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.
Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file
Are both functions being executed for each RDD in parallel? How does it work?
Thanks
--- Comments from Spark Community ----
I am adding comments from the spark community -
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.
#Eswara's answer is seems right but it does not apply to your use case as your separate transformation DAG's (X->Y->Z and X->X2) have a common DStream ancestor in X. This means that when the actions are run to trigger each of these flows, the transformation X->Y and the transformation X->X2 cannot happen at the same time. What will happen is the partitions for RDD X will be either computed or loaded from memory (if cached) for each of these transformations separately in a non-parallel manner.
Ideally what would happen is that the transformation X->Y would resolve and then the transformations Y->Z and X->X2 would finish in parallel as there is no shared state between them. I believe Spark's pipelining architecture would optimize for this. You can ensure faster computation on X->X2 by persisting DStream X so that it can be loaded from memory rather than being recomputed or being loaded from disk. See here for more information on persistence.
What would be interesting is if you could provide the replication storage levels *_2 (e.g. MEMORY_ONLY_2 or MEMORY_AND_DISK_2) to be able to run transformations concurrently on the same source. I think those storage levels are currently only useful against lost partitions right now, as the duplicate partition will be processed in place of the lost one.
Yes.
It's similar to spark's execution model which uses DAGs and lazy evaluation except that streaming runs the DAG repeatedly on each fresh batch of data.
In your case, since the DAGs(or sub-DAGs of larger DAG if one prefers to call that way) required to finish each action(each of the 2 foreachs you have) do not have common links all the way back till source, they run completely in parallel.The streaming application as a whole gets X executors(JVMs) and Y cores(threads) per executor allotted at the time of application submission to resource manager.At any time, a given task(i.e., thread) in X*Y tasks will be executing a part or whole of one of these DAGs.Note that any 2 given threads of an application, whether in same executor or otherwise, can execute different actions of the same application at the same time.

Resources