How to find partitions that go to same node? - apache-spark

Say that I've a HashPartioner and I use it to partition 2 RDDs. Now, if those two RDDs have some common values, they will end up in the same node as they're partitioned by the same partitioner. What I'd like to do is finding those partitions.
In another words, how can I find partitions of 2 RDDs that end up in the same node when partitioned by the same partitioner ?

I do two things. First, one trick I like to use--particularly when I am experimenting--is glom. This is a method on RDD that expresses it as an Array[Array]]. Each inner array represents a partition. So when I am in the Spark shell or writing a quick driver program to experiment, I find glom helpful to reason about the effect of my partitioning strategy and how it is maintained or changed over the course of my transformations.
Then if I care to know which node has which partition(s), I consult my resource manager--typically Mesos, Yarn, or Spark Standalone--to see those details.

The method I was looking for was zipPartitions().

Related

How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)
Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

When should we go for Spark-sql and when should we go for Spark RDD

On which scenario we should prefer spark RDD to write a solution and on which scenario we should choose to go for spark-sql. I know spark-sql gives better performance and it works best with structure and semistructure data. But what else factors are there that we need to take into consideration while choosing betweeen spark Rdd and spark-sql.
I don't see much reasons to still use RDDs.
Assuming you are using JVM based language, you can use DataSet that is the mix of SparkSQL+RDD (DataFrame == DataSet[Row]), according to spark documentation:
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
The problem is python is not support DataSet so, you will use RDD and lose spark-sql optimization when you work with non-structed data.
I found using DFs easier to use than DSs - the latter are still subject to development imho. The comment on pyspark indeed still relevant.
RDDs still handy for zipWithIndex to put asc, contiguous sequence numbers on items.
DFs / DSs have a columnar store and have a better Catalyst (Optimizer) support.
Also, may things with RDDs are painful, like a JOIN requiring a key, value and multi-step join if needing to JOIN more than 2 tables. They are legacy. Problem is the internet is full of legacy and thus RDD jazz.
RDD
RDD is a collection of data across the clusters and it handles both unstructured and structured data. It's typically a function part of handling data.
DF
Data frames are basically two dimensional array of objects defining the data in a rows and columns. It's similar to relations tables in the database. Data frame handles only the structured data.

How to decide number of buckets in Spark

I have read quite a few articles on bucketing in Spark but haven't still been able to get a clear picture of it. But moreover what i have understood till now is that "Bucketing is like partition inside a partition, and it is used for the candidates having very high cardinality which helps in avoiding reshuffling operation".
Even in Spark documentation, can't find enough explanation. Pasting an example from the documentation
peopleDF.write.bucketBy(42, "name").sortBy("age").saveAsTable("people_bucketed")
Unable to understand here, how the number '42' is decided for bucketing. Kindly help to understand the same. Also, any clearer explanation on bucketing would also be great.
42 is just like what is the meaning of life? An example therefore.
Spark Bucketing is handy for ETL in Spark whereby Spark Job A writes out the data for t1 according to Bucketing def and Spark Job B writes out data for t2 likewise and Spark Job C joins t1 and t2 using Bucketing definitions avoiding shuffles aka exchanges. Optimization.
There is no general formula. It depends on volumes, available executors, etc. The main point is avoiding shuffling. As a guideline defaults for JOINing and AGGr are set to 200, so 200 or greater could be an approach, but again how many resources do you have on your cluster?
But to satisfy your quest for knowledge one could argue that the 42 should really be set to number of Executors (= 1 core) you have allocated to the Spark Job/ App, leaving aside the issue of skewness.

Do wide dependency and shuffle always occur at the same time in Spark?

I am reading a spark book, and can hardly understand one sentence below. To me, I cannot imagine a case which is wide dependency, but we don't need shuffle. Can anyone give me an example?
"In certain instances, for example, when Spark already knows the data is partitioned in a certain way, operations with wide dependencies do not cause a shuffle." -- "High Performance Spark" by Holden Karau
RDD dependencies are actually in terms of partitions and how partitions are created.
Note: Below definitions are for ease of understanding:
If each of the partitions of an RDDs are created from only one partition of a single RDD, then it is a narrow dependency.
On the other hand, if a partition in a RDD is created from more than one partition(from same or different RDD), then it is a wide dependency.
Shuffle operation is required whenever data required to create a partition is not at one place(that means, it has to be taken from different locations/partitions).
If data is already grouped in one or more partitions(using operations like groupBy, partitionBy etc), you just have to take the corresponding items from each of the partitions and merge them. In this case, shuffle operation is not necessary.
For more details refer this, especially the visual example images.

Spark streaming RDD partitions

In Spark streaming, is it possible to assign specific RDD partitions to specific nodes in the cluster (for data locality?)
For example, I get a stream of events [a,a,a,b,b,b] and have a 2 node Spark cluster.
I want all a's to always go to Node 1 and all b's to always go to Node 2.
Thanks!
This is possible by specifying a custom partitioner for your RDD. The RangeBasedPartitioner will partition your RDD based on range, but you can implement any partitioning logic you with a custom partitioner. Its generally useful/important for partitions to be relatively balanced, and depending on your input data doing something like this could cause problems (e.g. stragglers etc.) so be careful.

Resources