Spark streaming RDD partitions - apache-spark

In Spark streaming, is it possible to assign specific RDD partitions to specific nodes in the cluster (for data locality?)
For example, I get a stream of events [a,a,a,b,b,b] and have a 2 node Spark cluster.
I want all a's to always go to Node 1 and all b's to always go to Node 2.
Thanks!

This is possible by specifying a custom partitioner for your RDD. The RangeBasedPartitioner will partition your RDD based on range, but you can implement any partitioning logic you with a custom partitioner. Its generally useful/important for partitions to be relatively balanced, and depending on your input data doing something like this could cause problems (e.g. stragglers etc.) so be careful.

Related

How to distribute data into X partitions on read with Spark?

I’m trying to read data from Hive with Spark DF and distribute it into a specific configurable number of partitions (in a correlation to the number of cores). My job is pretty straightforward and it does not contain any joins or aggregations. I’ve read on the spark.sql.shuffle.partitions property but the documentation says:
Configures the number of partitions to use when shuffling data for joins or aggregations.
Does this mean that it would be irrelevant for me to configure this property? Or does the read operation is considered as a shuffle? If not, what is the alternative? Repartition and coalesce seems a bit like an overkill for that matter.
To verify my understanding of your problem, you want to increase number of partitions in your rdd/dataframe which is created immediately after reading data.
In this case the property you are after is spark.sql.files.maxPartitionBytes which controls the maximum data that can be pushed in a partition at max (please refer to https://spark.apache.org/docs/2.4.0/sql-performance-tuning.html)
Default value is 128 MB which can be overridden to improve parallelism.
Read is not a shuffle as such. You need to get the data in at some stage.
The answer below can be used or an algorithm by Spark sets the number of partitions upon a read.
You do not state if you are using RDD or DF. With RDD you can set num partitions. With DF you need to repartition after read in general.
Your point on controlling parallelism is less relevant when joining or aggregating as you note.

How to find partitions that go to same node?

Say that I've a HashPartioner and I use it to partition 2 RDDs. Now, if those two RDDs have some common values, they will end up in the same node as they're partitioned by the same partitioner. What I'd like to do is finding those partitions.
In another words, how can I find partitions of 2 RDDs that end up in the same node when partitioned by the same partitioner ?
I do two things. First, one trick I like to use--particularly when I am experimenting--is glom. This is a method on RDD that expresses it as an Array[Array]]. Each inner array represents a partition. So when I am in the Spark shell or writing a quick driver program to experiment, I find glom helpful to reason about the effect of my partitioning strategy and how it is maintained or changed over the course of my transformations.
Then if I care to know which node has which partition(s), I consult my resource manager--typically Mesos, Yarn, or Spark Standalone--to see those details.
The method I was looking for was zipPartitions().

Number of Partitions of Spark Dataframe

Can anyone explain about the number of partitions that will be created for a Spark Dataframe.
I know that for a RDD, while creating it we can mention the number of partitions like below.
val RDD1 = sc.textFile("path" , 6)
But for Spark dataframe while creating looks like we do not have option to specify number of partitions like for RDD.
Only possibility i think is, after creating dataframe we can use repartition API.
df.repartition(4)
So can anyone please let me know if we can specify the number of partitions while creating a dataframe.
You cannot, or at least not in a general case but it is not that different compared to RDD. For example textFile example code you've provides sets only a limit on the minimum number of partitions.
In general:
Datasets generated locally using methods like range or toDF on local collection will use spark.default.parallelism.
Datasets created from RDD inherit number of partitions from its parent.
Datsets created using data source API:
In Spark 1.x typically depends on the Hadoop configuration (min / max split size).
In Spark 2.x there is a Spark SQL specific configuration in use.
Some data sources may provide additional options which give more control over partitioning. For example JDBC source allows you to set partitioning column, values range and desired number of partitions.
Default number of shuffle partitions in spark dataframe(200)
Default number of partitions in rdd(10)

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Is there a way to check if a variable in Spark is parallelizable?

So I am using groupByKey function in spark, but its not being parallelized, as I can see that during its execution, only 1 core is being used. It seems that the data I'm working with doesn't allow parallelization. Is there a way in spark to know if the input data is amicable to parallelization or if it's not a proper RDD?
The unit of parallelization in Spark is the 'partition'. That is, RDDs are split in partitions and transformations are applied to each partition in parallel. How RDD data is distributed across partitions is determined by the Partitioner. By default, the HashPartitioner is used which should work fine for most purposes.
You can check how many partitions your RDD is split into using:
rdd.partitions // Array of partitions

Resources