How to achieve vertical parallelism in spark? - apache-spark

Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.

I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).

Related

What exactly are the benefits of Spark over Mapreduce if I'm doing a batch processing?

I know Spark has in memory capability that is very useful for iterative jobs. But what if my requirement is traditional batch processing ETL. Does Spark provide me any benefit there? Please give all the pointers related to this, it will help me a lot.
How does Spark help me in case there are no iterative work and it's a batch process?
Is there any scenario where MapReduce would perform better than Spark? Any scenario where MR will be better than Spark?
Assuming you know Map Reduce, then consider:
writing Word Counting in MR when you need to list the top N words. Far more work over multiple Steps in MR vs. 7 or 8 lines in Spark.
for those with dimension processing a la dimensional model, a lot easier to do in Spark.
Spark Structured Streaming use cases...
Certain tasks with extreme high amounts of data may well be better using MR if you cannot acquire enough hardware or Cloud compute resources, i.e. writing to disk and processing per functional step.

What is optimal number of Stages in Spark App?

Is there some rule of thumb or best practice regarding number of stages in Spark job?
When do you consider breaking job in smaller pieces?
I found smaller jobs easier to analyze and optimize, but on the other hand loading/extracting data between each job comes with a cost.
There is no hard rule about optimal number of Stages for a Spark App.
It depends on what your functionality is that dictates the number of Stages.
Certain aspects result in Stages due to the Spark Architecture - which makes sense.
But Catalyst & Tungsten optimize and fuse code but cannot obviate 'shuffle boundaries" which means a new Stage.That is also not their task. The DAG Scheduler (underwater for Dataframes) does that.
You can .cache things to reduce re-computation for subsequent Actions in a Spark App, but that has a certain cost as well.
You can use things that reduce "shuffling", e.g. reduceByKey for legacy RDDs.
For Dataframes, DataSets Spark will generate more optimum Execution Plans (in general) and indeed some extra Stages (for computing pivot vales when using pivot.)
You sort of answer partially your own question with the aspect of writing, loading, but bucketBy can help wth such an approach. However, I am not sure how why the complexity is more with a larger Spark App - unless you mean using intermediate tables with less JOINs, UNIONs in smaller pieces. But the number of Stages then is only a consequence, and not so much a deciding factor for Stages.

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

Which query to use for better performance, join in SQL or using Dataset API?

While fetching and manipulating data from HBASE using spark, *Spark sql join* vs *spark dataframe join* - which one is faster?
RDD always Outperform Dataframe and SparkSQL, but from my experience Dataframe perform well compared to SparkSQL. Dataframe function perform well compare to spark sql.Below link will give some insights on this.
Spark RDDs vs DataFrames vs SparkSQL
As far as I can tell, they should behave the same regarding to performance. SQL internally will work as DataFrame
I don't have access to a cluster to properly test but I imagine that the Spark SQL just compiles down to the native data frame code.
The rule of thumb I've heard is that the SQL code should be used for exploration and dataframe operations for production code.
Spark SQL brings a powerful new optimization framework called Catalyst. Using Catalyst, Spark can automatically transform SQL queries so that they execute more efficiently.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations, that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The execution speed will be the same, because they use same optimization algorithms.
If the join might be shared across queries carefully implemented join with RDDs might be a good option. However if this is not the case let spark/catalyst do it's job and join within spark sql. It will do all the optimization. So you wouldn't have to maintain your join logic etc.
Spark SQL join and Spark Dataframe join are almost same thing. The join is actually delegated to RDD operations under the hood. On top of RDD operation we have convenience methods like spark sql, data frame or data set. In case of spark sql it needs to spend a tiny amount of extra time to parse the SQL.
It should be evaluated more in terms of good programming practice. I like dataset because you can catch syntax errors while compiling. And the encodes behind the scene takes care of compacting the data and executing the query.
I did some performance analysis for sql vs dataframe on Cassandra using spark, I think it will be the same for HBASE also.
According to me sql works faster than dataframe approach. The reason behind this might be that in the dataframe approach there are lot of java object's involved. In sql approach everything is done in-memory.
Attaching results.

How mapping/reducing phases work in Spark

I'm coming from a MapReduce background and I'm quite new to Spark. I could not find an article explaining the architectural difference between MapReduce and Spark. My understanding so far is the only difference the MapReduce and Spark have is the notion of 'in-memory' processing. That is, the Spark has mapping/reducing phase and they might run on two different nodes within the cluster. Pairs with the same keys are transferred to the same reducer and there is a shuffling phase involved. Am I correct? or there is some difference in the way mapping and reducing stages are done and...
I think it's directly on point, so I don't mind pointing you to a blog post I wrote:
http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
Spark is a large superset of MapReduce, in the sense that you can express MapReduce with Spark operators, but a lot of other things too. It has a large set of small operations from which you construct pipelines. So there's not a 1:1 mapping, but, you can identify how a lot of MapReduce elements correspond to Spark. Or: MapReduce actually gives you two operations that do a lot more than 'map' and 'reduce', which may not have been obvious so far.

Resources