Structuring logic in Spark - apache-spark

The 2 scenarios in my mind:
Hey folks, this may seem like a very basic question, but I'm really in need of some clarity of thought here. Thanks in advance.
Which of these 2 scenarios is a better way of structuring Spark logic? I can think of these 3 possibilities:
Scenario A is better
Scenario B is better
Doesn't matter as the Spark Engine creates the same DAG in both scenarios as there is only 1 action, which is the Write step at the end.
Also, could Scenario A be more optimal than Scenario B as the number of records get's reduced by the GroupBy between the 2 left joins, hence not leading to a massive explosion of my input data and stressing my worker memory?
More context:
I'm using AWS Databricks, cluster version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) with Photon
The performance of this logic is my primary concern

Related

What exactly are the benefits of Spark over Mapreduce if I'm doing a batch processing?

I know Spark has in memory capability that is very useful for iterative jobs. But what if my requirement is traditional batch processing ETL. Does Spark provide me any benefit there? Please give all the pointers related to this, it will help me a lot.
How does Spark help me in case there are no iterative work and it's a batch process?
Is there any scenario where MapReduce would perform better than Spark? Any scenario where MR will be better than Spark?
Assuming you know Map Reduce, then consider:
writing Word Counting in MR when you need to list the top N words. Far more work over multiple Steps in MR vs. 7 or 8 lines in Spark.
for those with dimension processing a la dimensional model, a lot easier to do in Spark.
Spark Structured Streaming use cases...
Certain tasks with extreme high amounts of data may well be better using MR if you cannot acquire enough hardware or Cloud compute resources, i.e. writing to disk and processing per functional step.

How are the task results being processed on Spark?

I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?

How to achieve vertical parallelism in spark?

Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.
I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).

Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.

why is spark fast as compared to other data analytics tools like R, python etc

I am looking for a quick answer to a very basic question related to Spark. I really don't understand how spark works and why is fast?
Question is, "Is spark fast because it divides a job into say 100 parts and run all parts at the same time or is it fast because its processing speed is superfast (in this case I am assuming that spark does not divide a job into 100 parts but just processes the job at one go) or it can do both?"
Another question, "Is spark a cluster of different physical machines or a cluster of different environments on a single machines"?
Thanks,
The question is probably going to be closed, but anyway:
Spark may or may not partition the job, to be more precise the data, depending on configuration. It's correct that partitioning helps the with the parallelism, which provides a major performance gain. This is either non-existent or very limited in Python libraries or R.
A reasonably accurate explanation would be, spark is a cluster of processes which may or may not be on a single machine.

Resources